In this paper, we propose a fault-tolerance mechanism for microprocessors, which detects transient faults and recovers from them. The investigation of fault-tolerance techniques for microprocessors is driven by two issues: One regards deep submicron fabrication technologies. Future semiconductor technologies could become more susceptible to alpha particles and other cosmic radiation. The other is the increasing popularity of mobile platforms. Cellular telephones are currently used for applications which are critical to our financial security, such as mobile banking, mobile trading, and making airline ticket reservations. Such applications demand that computer systems work correctly. In light of this, we propose a mechanism which is based on an instruction reissue technique for incorrect data speculation recovery and utilizes time redundancy, and evaluate our proposal using a timing simulator.
Dirk FIMMEL Jan MULLER Renate MERKER
We present a new approach to the loop scheduling problem, which excels previous solutions in two important aspects: The resource constraints are formulated using flow graphs, and the initiation interval λ is treated as a rational variable. The approach supports heterogeneous processor architectures and pipelined functional units, and the Integer Linear Programming implementation produces an optimum loop schedule, whereby a minimum λ is achieved. Our flow graph model facilitates the cyclic binding of loop operations to functional units. Compared to previous research results, the solution can provide faster loop schedules and a significant reduction of the problem complexity and solution time.
This paper proposes constructive timing-violation (CTV) and evaluates its potential. It can be utilized both for increasing clock frequency and for reducing energy consumption. Increasing clock frequency over that determined by the critical paths causes timing violations. On the other hand, while supply voltage reduction can result in substantial power savings, it also causes larger gate delay and thus clock must be slow down in order not to violate timing constraints of critical paths. However, if any tolerant mechanisms are provided for the timing violations, it is not necessary to keep the constraints. Rather, the violations would be constructive for high clock frequency or for energy savings. From these observations, we propose the CTV, which is supported by the tolerant mechanism based on contemporary speculative execution mechanisms. We evaluate the CTV using a cycle-by-cycle simulator and present its considerably promising potential.
Kirilka NIKOLOVA Atusi MAEDA Masahiro SOWA
A parallel program with a fixed degree of parallelism cannot be executed efficiently, or at all, by a parallel computer with a different degree of parallelism. This will cause a problem in the distribution of software applications in the near future when parallel computers with various degrees of parallelism will be widely used. In this paper we propose a way to make the machine code of the programs parallelism-independent, i.e. executable in minimum time on parallel computers with any degree of parallelism. We propose and evaluate three parallelism-independent scheduling algorithms for direct acyclic graphs (DAGs) of tasks with non-uniform execution times. To prove their efficiency, we performed simulations both with random DAGs and DAGs extracted from real applications. We evaluate them in terms of schedule length, computation time and size of the scheduled program. Their results are compared to those of the traditional CP/MISF algorithm which is used separately for each number of processors.
Byung In MOON Dong Ryul RYU Jong Wook HONG Tae Young LEE Sangook MOON Yong Surk LEE
We have designed a 32-bit RISC microprocessor with 16-/32-bit fixed-point DSP functionality. This processor, called YD-RISC, combines both general-purpose microprocessor and digital signal processor (DSP) functionality using the reduced instruction set computer (RISC) design principles. It has functional units for arithmetic operation, digital signal processing (DSP) and memory access. They operate in parallel in order to remove stall cycles after DSP or load/store instructions, which usually need one or more issue latency cycles in addition to the first issue cycle. High performance was achieved with these parallel functional units while adopting a sophisticated five-stage pipeline structure. The pipelined DSP unit can execute one 32-bit multiply-accumulate (MAC) or 16-bit complex multiply instruction every one or two cycles through two 17-b 17-b multipliers and an operand examination logic circuit. Power-saving techniques such as power-down mode and disabling execution blocks allow low power consumption. In the design of this processor, we use logic synthesis and automatic place-and-route. This top-down approach shortens design time, while a high clock frequency is achieved by refining the processor architecture.
Wujian ZHANG Runde ZHOU Tsunehachi ISHITANI Ryota KASAI Toshio KONDO
This paper describes an improved multiresolution telescopic search algorithm (MRTlcSA) for block-matching motion estimation. The algorithm uses images with full and reduced bit resolution, and uses motion-track and adaptive-search-window strategies. Simulation results show that the proposed algorithm has low computational complexity and achieves good image quality. We have developed a systolic-architecture-based search engine that has split data paths. In the case of low bit-resolution, the throughput is increased by enhancing the operating parallelism. The new motion estimator works at a low clock frequency and a low supply voltage, and therefore has low power consumption.
Shinsuke KOBAYASHI Yoshinori TAKEUCHI Akira KITAJIMA Masaharu IMAI
In this paper, an architecture of multi-threaded processor for embedded systems is proposed and evaluated comparing with other processors for embedded systems. The experimental results show the trade-off of hardware costs and execution times among processors. Taking proposed multi-threaded processor into account as an embedded processor, design space of embedded systems are enlarged and more suitable architecture can be selected under some design constraints.
Kirilka NIKOLOVA Atusi MAEDA Masahiro SOWA
All the existing scheduling algorithms order the instructions of the program in such a way that it can be executed in minimal time only for one fixed number of processors. In this paper we propose a new scheduling method, called Parallelism-Independent Scheduling Method, which enables the execution of the scheduled program on parallel computers with any degree of parallelism in near-optimal time. We propose three Parallelism-Independent algorithms, which have the following phases: obtaining a parallel schedule by using a list scheduling heuristics, optimization of the parallel schedule by rearranging the tasks in each level, so that they can be executed efficiently with different degrees of parallelism, serialization of the parallel schedule, and insertion of markers for the parallel execution limits. The three algorithms differ in their optimization phase. To prove the efficiency of our algorithms, we have made simulations with random directed acyclic graphs with different size and degree of parallelism. We compared the results in terms of schedule length to those obtained using the Critical Path Algorithm separately for each degree of parallelism.
Hak-Jun KIM Sun-Mo KIM Sang-Bang CHOI
This research presents a novel analytic model to predict the instruction execution rate of superscalar processors using the queuing model with finite-buffer size and synchronous operation mode. The proposed model is also able to analyze the performance relationship between cache and pipeline. The proposed model takes into account various kinds of architectural parameters such as instruction-level parallelism, branch probability, the accuracy of branch prediction, cache miss, and etc. To prove the correctness of the model, we performed extensive simulations and compared the results with the analytic model. Simulation results showed that the proposed model can estimate the average execution rate accurately within 10% error in most cases. The proposed model can explain the causes of performance bottleneck which cannot be uncovered by the simulation method only. The model is also able to show the effect of the cache miss on the performance of out-of-order issue superscalar processors, which can provide an valuable information in designing a balanced system.
In order to improve microprocessor performance, we propose to utilize histories of dynamic instruction sequences. A lot of special purpose memories integrated in a processor chip hold the histories. In this paper, we describe the usefulness of using two special purpose memories: Non-Consecutive basic block Buffer (NCB) and Reference Prediction Table (RPT). The NCB improves instruction fetching efficiency in order to relieve control dependences. The RPT predicts data addresses in order to speculate data dependences. From the simulation study, it has been found that the proposed mechanisms improve processor performance by up to 49. 2%.
Hideo MATSUDA Toru KAWABATA Yukio KANEDA
In this paper we propose a new method for parallel execution of Prolog programs and present its implementation on a distributed memory parallel computer, Fujitsu AP1000. In our method a number of processes (named Prolog engines) explore different branches of a search tree (named tasks) in parallel, which is the same as OR-parallelism. Unlike OR-parallelism, the mapping between Prolog engines and tasks is statically determined like data parallelism. Each Prolog engine can decide which task is executed by the engine without communicating with the other engines. In many search problems, however, such static task mapping may cause imbalance on the processing time of each engine since the computational costs to explore branches may vary substantially. To cope with this issue, we devise a method to adjust the task imbalance by periodical exchanging how many tasks were processed for each engine. Also for reducing communication overhead in load balancing, we limit the scope of engines that exchange the load information each other. The effectiveness of our method is evaluated by measuring execution times for N Queens and the Traveling Salesman Problem on the AP1000. Using 512 processors, we obtained 355-fold speedup for N Queens and 343-fold speedup on the Traveling Salesman Problem.
High speed simulation of neural networks can be achieved through parallel implementations capable of exploiting their massive inherent parallelism. In this paper, we show how this inherent parallelism can be effectively exploited on parallel data-driven systems. By using these systems, the asynchronous parallelism of neural networks can be naturally specified by the functional data-driven programs, and maximally exploited by pipelined and scalable data-driven processors. We shall demonstrate the suitability of data-driven systems for the parallel simulation of neural networks through a parallel implementation of the widely used back propagation networks. The implementation is based on the exploitation of the network and training set parallelisms inherent in these networks, and is evaluated using an image data compression network.
This paper describes the derivation of a parallel program from a nondeterministic sequential program using a bottom-up parser as an example. The derivation procedure consists of two stages: exploitation of AND-parallelism and exploitation of OR-parallelism. An interpreter of the sequential parser BUP is first transformed so that processes for the nodes in a parsing tree can run in parallel. Then, the resultant program is transformed so that a nondeterministic search of a parsing tree can be done in parallel. The former stage is performed by hand-simulation, and the latter is accomplished by the compiler of ANDOR-, which is an AND/OR parallel logic programming language. The program finally derived, written in KL1 (Kernel Language of the FGCS Project), achieves an all-solution search without side effects. The program generated corresponds to an interpreter of PAX, a revised parallel version of BUP. This correspondence shows that the derivation method proposed in this paper is effective for creating efficient parallel programs.