PARAdeg has been defined to try to measure parallelism inherent in a program net. Studies on computation of PARAdeg have been done, but the quantitative evaluation, on how much PARAdeg fits parallelism of program nets, has not been studied. In this paper, we do the evaluation by applying genetic algorithm to measure firing completion times when PARAdeg processors, and less and more processors are provided for 400 program nets. Our experimental results show that the firing completion times decrease rapidly with increase of processors till PARAdeg and slowly when processors are increased to more than PARAdeg, which implies PARAdeg is a reasonable standard to measure parallelism of program nets.
Bong-Joon JUNG Kwang-Il PARK Kyu Ho PARK
In static multiprocessor scheduling, heuristic algorithms have been widely used. Instead of gaining execution speed, most of them show non promising solutions since they search only a part of solution spaces. In this paper, we propose a scheduling algorithm using the genetic algorithm (GA) which is a well-known stochastic search algorithm. The proposed algorithm, named ordered-deme GA (OGA), is based on the multiple subpopulation GA, where a global population is divided into several subpopulations (demes) and each demes evolves independently. To find better schedules, the OGA orders demes from the highest to the lowest deme and migrates both the best and the worst individuals at the same time. In addition, the OGA adaptively assigns different mutation probabilities to each deme to improve search capability. We compare the OGA with well-known heuristic algorithms and other GAs for random task graphs and the task graphs from real numerical problems. The results indicate that the OGA finds mostly better schedules than others although being slower in terms of execution time.
Kirilka NIKOLOVA Atusi MAEDA Masahiro SOWA
All the existing scheduling algorithms order the instructions of the program in such a way that it can be executed in minimal time only for one fixed number of processors. In this paper we propose a new scheduling method, called Parallelism-Independent Scheduling Method, which enables the execution of the scheduled program on parallel computers with any degree of parallelism in near-optimal time. We propose three Parallelism-Independent algorithms, which have the following phases: obtaining a parallel schedule by using a list scheduling heuristics, optimization of the parallel schedule by rearranging the tasks in each level, so that they can be executed efficiently with different degrees of parallelism, serialization of the parallel schedule, and insertion of markers for the parallel execution limits. The three algorithms differ in their optimization phase. To prove the efficiency of our algorithms, we have made simulations with random directed acyclic graphs with different size and degree of parallelism. We compared the results in terms of schedule length to those obtained using the Critical Path Algorithm separately for each degree of parallelism.
Satoshi FUJITA Masafumi YAMASHITA
In this paper, we consider the static multiprocessor scheduling problem for a class of multiprocessor systems consisting of m ( 1) identical processors connected by a complete network. The objective of this survey is to give a panoramic view of theoretical and/or practical approaches for solving the problem, that have been extensively conducted during the past three decades.
Kentaro SANO Hiroyuki KITAJIMA Hiroaki KOBAYASHI Tadao NAKAMURA
A data-parallel processing approach is promising for real-time volume rendering because of the massive parallelism in volume rendering. In data-parallel volume rendering, local results processing elements(PEs) generate from allocated subvolumes are integrated to form a final image. Generally, the integration causes an overhead unavoidable in data-parallel volume rendering due to communications among PEs. This paper proposes a data-parallel shear-warp volume rendering algorithm combined with an adaptive volume subdivision method to reduce the communication overhead and improve processing efficiency. We implement the parallel algorithm on a message-passing multiprocessor system for performance evaluation. The experimental results show that the adaptive volume subdivision method can reduce the overhead and achieve higher efficiency compared with a conventional slab subdivision method.
Tadaaki YAMAUCHI Lance HAMMOND Oyekunle A. OLUKOTUN Kazutami ARIMOTO
A microprocessor integrated with DRAM on the same die has the potential to improve system performance by reducing memory latency and improving memory bandwidth. In this paper we evaluate the performance of a single chip multiprocessor integrated with DRAM when the DRAM is organized as on-chip main memory and as on-chip cache. We compare the performance of this architecture with that of a more conventional chip which only has SRAM-based on-chip cache. The DRAM-based architecture with four processors outperforms the SRAM-based architecture on floating point applications which are effectively parallelized and have large working sets. This performance difference is significantly better than that possible in a uniprocessor DRAM-based architecture, which performs only slightly faster than an SRAM-based architecture on the same applications. In addition, on multiprogrammed workloads, in which independent processes are assigned to every processor in a single chip multiprocessor, the large bandwidth of on-chip DRAM can handle the inter-access contention better. These results demonstrate that a multiprocessor takes better advantage of the large bandwidth provided by the on-chip DRAM than a uniprocessor.
Kosuke YOSHIOKA Makoto HIRAI Kozo KIMURA Tokuzo KIYOHARA
In this paper, we introduce a processor called Media Core Processor (MCP), which targets a system solution for consumer multimedia products. MCP is a heterogeneous multi-processor system designed to guarantee full frame MPEG decoding, and to reduce power consumption. In our processor architecture, each processing unit is optimized to support various characteristics of media processing. All processing units work in parallel in a macro-pipeline manner, thereby achieving high utilization of the processing units. A performance evaluation shows that audio/video full-frame decoding can be realized on 54 MHz operating frequency without any support from external hardware or a CPU. In addition, the high programmability of the MCP provides flexibility and reduces the time-to-market.
In a shared-memory multiprocessor, shared data are usually accessed in a critical section that is protected by a lock variable. Therefore, the order of accesses by multiple processors to the shared data corresponds to the order of acquiring the ownership of the lock variable. This paper presents a selective write-update protocol, where data modified in a critical section are stored in a write cache and, at a synchronization point, they are transferred only to the processor that will execute the critical section following the current processor. By using QOLB synchronization primitives, the next processor can be determined at the execution time. We prove that the selective write-update protocol ensures data coherency of parallel programs that comply with release consistency, and evaluate the performance of the protocol by analytical modeling and program-driven simulation. The simulation results show that our protocol can reduce the number of coherence misses in a critical section while avoiding the multicast of write-update requests on an interconnection network. In addition, we observe that synchronization latency can be decreased by reducing both the execution time of a critical section and the number of write-update requests. From the simulation results, it is shown that our protocol provides better performance than a write-invalidate protocol and a write-update protocol as the number of processors increases.
Hiroshi MATSUOKA Kazuaki OKAMOTO Hideo HIRONO Mitsuhisa SATO Takashi YOKOTA Shuichi SAKAI
In this paper we describe the pipeline design and enhanced hardware for fast message handling in a RICA-1 processor, a processing element (PE) in the RWC-1 multiprocessor. The RWC-1 is based on the reduced inter-processor communication architecture (RICA), in which communications are combined with computation in the processor pipeline. The pipeline is enhanced with hardware mechanisms to support fine-grain parallel execution. The data paths of the RICA-1 super-scalar processor are commonly used for communication as well as instruction execution to minimize its implementation cost. A 128-PE system has been built on January 1998, and it is currently used for hardware debugging, software development and performance evaluation.
Takashi MATSUMURA Morikazu NAKAMURA Juma OKECH Kenji ONAGA
In this paper we consider a parallel and distributed computation of genetic algorithms on loosely-coupled multiprocessor systems. Loosely-coupled ones are more suitable for massively parallel processing and also more easily VLSI implementation than tightly-coupled ones. However, communication overhead on parallel processing is more serious for loosely-coupled ones. We propose in this paper a parallel and distributed execution method of genetic algorithm on loosely-coupled multiprocessor systems of fixed network topologies in which each processor element carries out genetic operations on its own chromosome set and communicates with only the neighbors in order to save communication overhead. We evaluate the proposed method on the multiprocessor systems with ring, torus, and hypercube topologies for benchmark problem instances. From the results, we find that the ring topology is more suitable for the proposed parallel and distributed execution since variety of chromosomes in the ring is kept much more than that in the others. Moreover, we also propose a new network topology called cone which is a hierarchical connection of ring topologies. We show its effectiveness by experimental evaluation.
Shin-Mu TSENG Y. H. CHIN Wei-Pang YANG
We present a new scheduling policy named Value-based Processor Allocation (VPA-k) for scheduling value-based transactions in a multiprocessor real-time database system. The value of a transaction represents the profit the transaction contributes to the system if it is completed before its deadline. Using VPA-k policy, the transactions with higher values are given higher priorities to execute first, while at most k percentage of the total processors are allocated to the urgent transactions dynamically. Through simulation experiments, VPA-k policy is shown to outperform other scheduling policies substantially in both maximizing the totally obtained values and minimizing the number of missed transactions.
Myoung Kwon TCHEUN Seung Ryoul MAENG Jung Wan CHO
To reduce the memory access latency on sharedmemory multiprocessors, several prefetching schemes have been proposed. The sequential prefetching scheme is a simple hardware-controlled scheme, which exploits the sequentiality of memory accesses to predict which blocks will be read in the near future. Aggressive sequential prefetching prefetches many blocks on each miss to reduce the miss rates and results in good performance for application programs with high sequentiality. However, conservative sequential prefetching prefetches a few blocks on each miss to avoid prefetching of useless blocks, which shows better performance than aggressive sequential prefetching for application programs with low sequentiality. We analyze the relationship between the sequentiality of application programs and the effectiveness of sequential prefetching on various memory and network latency and propose a new adaptive sequential prefetching scheme. Simply adding a small table to the sequential prefetching scheme, the proposed scheme prefetches a large number of blocks for application programs with high sequentiality and reduces the miss rates significantly, and prefetches a small number of blocks for application programs with low sequentiality and avoids loading useless blocks.
Toshihiro HANAWA Takayuki KAMEI Hideki YASUKAWA Katsunobu NISHIMURA Hideharu AMANO
A novel approach to the cache coherent Multistage Interconnection Network (MIN) called the MINC (MIN with Cache control mechanism) is proposed. In the MINC, the directory is located only on the shared memory using the Reduced Hierarchical Bit-map Directory schemes (RHBDs). In the RHBD, the bit-map directory is reduced and carried in the packet header for quick multicasting without accessing the directory in each hierarchy. In order to reduce unnecessary packets caused by compacting the bit map in the RHBD, a small cache called the pruning cache is introduced in the switching element. The simulation reveals the pruning cache works most effectively when it is provided in every switching element of the first stage, and it reduces the congestion more than 50% with only 4 entries. The MINC cache control chip with 16 inputs/outputs is implemented on the LPGA (Laser Programmable Gate Array), and works with a 66 MHz clock.
Takafumi AOKI Shinichi SHIONOYA Tatsuo HIGUCHI
This paper explores the potential of multiwave interconnectionsoptical interconnections that employ wavelength components as multiplexable information carriersfor constructing next-generation multiprocessor systems using MCM technology. A hypercube-based multiprocessor network called the multiwave hypercube (MWHC) is proposed, where multiwave interconnections provide highly-flexible dynamic communication channels among processing elements. A performance analysis shows that the use of multiwavelength optics makes possible the reduction of network complexity on an MCM substrate, while supporting low-latency message routing.
In this paper, we propose an efficient task scheduling scheme, called CTS (Class-based Task Scheduling), to obtain high performance in terms of high system utilization and low waiting times for tasks. While a better submesh allocation scheme can improve system performance, an allocation policy alone cannot improve performance significantly. This is due to the fact that the FCFS task scheduling policy leads to large external fragmentation. The CTS strategy maintains four separate queues, one for each incoming task class. This avoids the blacking property incurred in the FCFS scheduling. To reduce the external fragmentation, a job tends to wait for an occupied submesh of the same size instead of using a new submesh in the CTS strategy. Simulation results indicate that the proposed scheduling strategy improves the performance compared to the FCFS scheduling policy by reducing the average waiting delay significantly.
This paper focuses on recovering from processor transient faults in pipelined multiprocessor systems. A pipelined machine may employ out of order execution and branch prediction techniques to increase performance, thus a precise computation state would not be available. We propose an efficient scheme to maintain the precise computation state in a pipelined machine. The goal of this paper is to implement checkpointing and rollback recovery utilizing the technique of precise interrupt in a pipelined system. Detailed analysis is included to demonstrate the effectiveness of this method.
Nakun SEONG Naihoon JUNG Byungho KIM Hyunsoo YOON
This paper presents intelligent memory, a new memory architecture capable of providing efficient lock-free synchronization. In the intelligent memory, a sequence of operations on a shared object associated with that memory module can be processed without any intervention so that an environment for the synchronization can be provided by executing a critical section itself in that memory module. For this, we present a memory architecture for the intelligent memory having minimal instruction set and develop a progtramming model, called Critical Section Procedure (CSP), which consists of shared data structures and operations on them. Intelligent memory is intended to eliminate waste of processing time such as busy waiting in spin lock and the retry due to process contentions in existing lock-free synchronization schemes. Simulation results show that the intelligent memory provides better throughput compared with the spin lock and the existing lock-free synchronization schemes.
SCI (Scalable Coherent Interface) is pointerbased coherent directory scheme for massively parallel multiprocessors. Large message latency is one of the problems with SCI because of its linked list structure: the searching latency of messages could grow as a linear order of the number of processors. In this paper, we focus on a hierarchical architecture to propose a new schemeEST(Extending SCI-Tree), which may reduce the message traffic and also take the advantages of the topology property. Simulation results show that the EST scheme is effective in reducing message latency and communication cost when compared with other schemes.
Dingchao LI Akira MIZUNO Yuji IWAHORI Naohiro ISHII
This paper describes a new approach to the scheduling problem that assigns tasks of a parallel program described as a task graph onto parallel machines. The approach handles interprocessor communication and heterogeneity, based on using both the theoretical results developed so far and a lookahead scheduling strategy. The experimental results on randomly generated task graphs demonstrate the effectiveness of this scheduling heuristic.
Myung-Mook HAN Shoji TATSUMI Yasuhiko KITAMURA Takaaki OKUMOTO
In this paper we discuss a certain constrained optimization problem which is often encountered in the geometrical optimization. Since these kinds of problems occur frequently, constrained genetic optimization becomes very important topic for research. This paper proposes a new methodology to handle constraints using the Genetic Algorithm through a multiprocessor system (FIN) which has a self-similarity network.