The shared last level cache (SLLC) in tile chip multiprocessors (TCMP) provides a low off-chip miss rate, but it causes a long on-chip access latency. In the two-level cache hierarchy, data replication stores replicas of L1 victims in the local LLC (L2 cache) to obtain a short local LLC access latency on the next accesses. Many data replication mechanisms have been proposed, but they do not consider both L1 victim reuse behaviors and LLC replica reception capability. They either produce many useless replicas or increase LLC pressure, which limits the improvement of system performance. In this paper, we propose a two-level cache aware adaptive data replication mechanism (TCDR), which controls replication based on both L1 victim reuse behaviors prediction and LLC replica reception capability monitoring. TCDR not only increases the accuracy of L1 replica selection, but also avoids the pressure of replication on LLC. The results show that TCDR improves the system performance with reasonable hardware overhead.
Jiaheng LIU Ryusuke EGAWA Hiroyuki TAKIZAWA
As the number of cores on a processor increases, cache hierarchies contain more cache levels and a larger last level cache (LLC). Thus, the power and energy consumption of the cache hierarchy becomes non-negligible. Meanwhile, because the cache usage behaviors of individual applications can be different, it is possible to achieve higher energy efficiency of the computing system by determining the appropriate cache configurations for individual applications. This paper proposes a cache control mechanism to improve energy efficiency by adjusting a cache hierarchy to each application. Our mechanism first bypasses and disables a less-significant cache level, then partially disables the LLC, and finally adjusts the associativity if it suffers from a large number of conflict misses. The mechanism can achieve significant energy saving at the sacrifice of small performance degradation. The evaluation results show that our mechanism improves energy efficiency by 23.9% and 7.0% on average over the baseline and the cache-level bypassing mechanisms, respectively. In addition, even if the LLC resource contention occurs, the proposed mechanism is still effective for improving energy efficiency.
Namyong JUNG Hyeongboo BAEK Donghyouk LIM Jinkyu LEE
As real-time embedded systems are required to accommodate various tasks with different levels of criticality, scheduling algorithms for MC (Mixed-Criticality) systems have been widely studied in the real-time systems community. Most studies have focused on MC uniprocessor systems whereas there have been only a few studies to support MC multiprocessor systems. In particular, although the ZL (Zero-Laxity) policy has been known to an effective technique in improving the schedulability performance of base scheduling algorithms on SC (Single-Criticality) multiprocessor systems, the effectiveness of the ZL policy on MC multiprocessor systems has not been revealed to date. In this paper, we focus on realizing the potential of the ZL policy for MC multiprocessor systems, which is the first attempt. To this end, we design the ZL policy for MC multiprocessor systems, and apply the policy to EDF (Earliest Deadline First), yielding EDZL (Earliest Deadline first until Zero-Laxity) tailored for MC multiprocessor systems. Then, we develop a schedulability analysis for EDZL (as well as its base algorithm EDF) to support its timing guarantee. Our simulation results show a significant schedulability improvement of EDZL over EDF, demonstrating the effectiveness of the ZL policy for MC multiprocessor systems.
Hyeongboo BAEK Donghyouk LIM Jinkyu LEE
RTA (Response time analysis) is a popular technique to guarantee timing requirements for a real-time system, and therefore the RTA framework has been widely studied for popular scheduling algorithms such as EDF (Earliest Deadline First) and FP (Fixed Priority). While a number of extended techniques of RTA have been introduced, some of them cannot be used since they have not been proved and evaluated in terms of their correctness and empirical performance. In this letter, we address the state of the art technique of slack reclamation of the existing generic RTA framework for multiprocessors. We present its mathematical proof of correctness and empirical performance evaluation, which have not been revealed to this day.
Hao XIAO Ning WU Fen GE Guanyu ZHU Lei ZHOU
This paper presents a synchronization mechanism to effectively implement the lock and barrier protocols in a decentralized manner through explicit message passing. In the proposed solution, a simple and efficient synchronization control mechanism is proposed to support queued synchronization without contention. By using state-of-the-art Application-Specific Instruction-set Processor (ASIP) technology, we embed the synchronization functionality into a baseline processor, making the proposed mechanism feature ultra-low overhead. Experimental results show the proposed synchronization achieves ultra-low latency and almost ideal scalability when the number of processors increases.
Masayuki SATO Ryusuke EGAWA Hiroyuki TAKIZAWA Hiroaki KOBAYASHI
Chip multiprocessors (CMPs) improve performance by simultaneously executing multiple threads using integrated multiple cores. However, since these cores commonly share one cache, inter-thread cache conflicts often limit the performance improvement by multi-threading. This paper focuses on two causes of inter-thread cache conflicts. In shared caches of CMPs, cached data fetched by one thread are frequently evicted by another thread. Such an eviction, called inter-thread kickout (ITKO), is one of the major causes of inter-thread cache conflicts. The other cause is capacity shortage that occurs when one cache is shared by threads demanding large cache capacities. If the total capacity demanded by the threads exceeds the actual cache capacity, the threads compete to use the limited cache capacity, resulting in capacity shortage. To address inter-thread cache conflicts, we must take into account both ITKOs and capacity shortage. Therefore, this paper proposes a capacity-aware thread scheduling method combined with cache partitioning. In the proposed method, inter-thread cache conflicts due to ITKOs and capacity shortage are decreased by cache partitioning and thread scheduling, respectively. The proposed scheduling method estimates the capacity demand of each thread with an estimation method used in the cache partitioning mechanism. Based on the estimation used for cache partitioning, the thread scheduler decides thread combinations sharing one cache so as to avoid capacity shortage. Evaluation results suggest that the proposed method can improve overall performance by up to 8.1%, and the performance of individual threads by up to 12%. The results also show that both cache partitioning and thread scheduling are indispensable to avoid both ITKOs and capacity shortage simultaneously. Accordingly, the proposed method can significantly reduce the inter-thread cache conflicts and hence improve performance.
Hao XIAO Tsuyoshi ISSHIKI Arif Ullah KHAN Dongju LI Hiroaki KUNIEDA Yuko NAKASE Sadahiro KIMURA
Ultra-wideband (UWB) technology has attracted much attention recently due to its high data rate and low emission power. Its media access control (MAC) protocol, WiMedia MAC, promises a lot of facilities for high-speed and high-quality wireless communication. However, these benefits in turn involve a large amount of computational load, which challenges the traditional uniprocessor architecture based implementation method to provide the required performance. However, the constrained cost and power budget, on the other hand, makes using commercial multiprocessor solutions unrealistic. In this paper, a low-cost and energy-efficient multiprocessor system-on-chip (MPSoC), which tackles at once the aspects of system design, software migration and hardware architecture, is presented for the implementation of UWB MAC layer. Experimental results show that the proposed MPSoC, based on four simple RISC processors and shared-memory infrastructure, achieves up to 45% performance improvement and 65% power saving, but takes 15% less area than the uniprocessor implementation.
Hideki MIWA Ryutaro SUSUKITA Hidetomo SHIBAMURA Tomoya HIRAO Jun MAKI Makoto YOSHIDA Takayuki KANDO Yuichiro AJIMA Ikuo MIYOSHI Toshiyuki SHIMIZU Yuji OINAGA Hisashige ANDO Yuichi INADOMI Koji INOUE Mutsumi AOYAGI Kazuaki MURAKAMI
In the near future, interconnection networks of massively parallel computer systems will connect more than a hundred thousands of computing nodes. The performance evaluation of the interconnection networks can provide real insights to help the development of efficient communication library. Hence, to evaluate the performance of such interconnection networks, simulation tools capable of modeling the networks with sufficient details, supporting a user-friendly interface to describe communication patterns, providing the users with enough performance information, completing simulations within a reasonable time, are a real necessity. This paper introduces a novel interconnection network simulator NSIM, for the evaluation of the performance of extreme-scale interconnection networks. The simulator implements a simplified simulation model so as to run faster without any loss of accuracy. Unlike the existing simulators, NSIM is built on the execution-driven simulation approach. The simulator also provides a MPI-compatible programming interface. Thus, the simulator can emulate parallel program execution and correctly simulate point-to-point and collective communications that are dynamically changed by network congestion. The experimental results in this paper showed sufficient accuracy of this simulator by comparing the simulator and the real machine. We also confirmed that the simulator is capable of evaluating ultra large-scale interconnection networks, consumes smaller memory area, and runs faster than the existing simulator. This paper also introduces a simulation service built on a cloud environment. Without installing NSIM, users can simulate interconnection networks with various configurations by using a web browser.
Utilizing a heterogeneous multiprocessor system has become a popular design paradigm to build an embedded system at a cheap cost. A reliability issue, which is vulnerability to soft errors, has not been taken into account in the conventional IC (integrated circuit) design flow, while chip area, performance, and power consumption have been done. This paper presents a system design paradigm in which a heterogeneous multiprocessor system is synthesized and its chip area is minimized under real-time and reliability constraints. First we define an SEU vulnerability factor as a vulnerability measure for computer systems so that we evaluate task-wise reliability over various processor structures. Next we build a mixed integer linear programming (MILP) model for minimizing the chip area of a heterogeneous multiprocessor system under real-time and SEU vulnerability constraints. Finally, we show several experimental results on our synthesis approach. Experimental results show that our design paradigm has achieved automatic generation of cost-competitive and reliable heterogeneous multiprocessor systems.
In this paper, we consider a problem of assigning n independent tasks onto m identical processors in such a way that the overall execution time of the tasks will be minimized. Unlike conventional task assignment problem, we assume that the execution time of each task is not fixed in advance, and merely upper and lower bounds of the execution time are given at the compile time. In the following, we first provide a theoretical analysis of several conventional scheduling policies in terms of the worst case slowdown compared with the outcome of an optimal off-line scheduling policy. It is shown that the best known algorithm in the literature achieves a worst case performance ratio of 1 + 1/f(n) where f(n) = O(n2/3) for any fixed m, which approaches to one by increasing n to the infinity. We then propose a new scheme that achieves better worst case ratio of 1 + 1/g(n) where g(n) = Θ (n/log n) for any fixed m, which approaches to one more quickly than previous schemes.
Utilizing a heterogeneous multiprocessor system has become a popular design paradigm to build an embedded system at a cheap cost within short development time. A reliability issue for embedded systems, which is vulnerability to single event upsets (SEUs), has become a matter of concern as technology proceeds. This paper discusses reliability inherent in heterogeneous multiprocessors and proposes task scheduling for minimizing SEU vulnerability of them. This paper experimentally shows that increasing performance of a CPU core deteriorates its reliability. Based on the experimental observation, we propose task scheduling for reducing SEU vulnerability of a heterogeneous multiprocessor system. The experimental results demonstrate that our task scheduling technique can reduce much of SEU vulnerability under real-time constraints.
Mohammad ZALFANY URFIANTO Tsuyoshi ISSHIKI Arif ULLAH KHAN Dongju LI Hiroaki KUNIEDA
A simple extension used to assist the decomposition of task-level concurrency within C programs is presented in this paper. The concurrency decomposition is meant to be used as the point of entry for Multiprocessor System-on-Chips (MPSoC) architectures' design-flow. Our methodology allows the (re)use of readily available reference C programs and enables easy and rapid exploration for various alternatives of task partitioning strategies; a crucial task that greatly influences the overall quality of the designed MPSoC. A test case using a JPEG encoder application has been performed and the results are presented in this paper.
Mohammad ZALFANY URFIANTO Tsuyoshi ISSHIKI Arif ULLAH KHAN Dongju LI Hiroaki KUNIEDA
This paper presents a Multiprocessor System-on-Chips (MPSoC) architecture used as an execution platform for the new C-language based MPSoC design framework we are currently developing. The MPSoC architecture is based on an existing SoC platform with a commercial RISC core acting as the host CPU. We extend the existing SoC with a multiprocessor-array block that is used as the main engine to run parallel applications modeled in our design framework. Utilizing several optimizations provided by our compiler, an efficient inter-communication between processing elements with minimum overhead is implemented. A host-interface is designed to integrate the existing RISC core to the multiprocessor-array. The experimental results show that an efficacious integration is achieved, proving that the designed communication module can be used to efficiently incorporate off-the-shelf processors as a processing element for MPSoC architectures designed using our framework.
Hiroaki SHIKANO Jun SHIRAKO Yasutaka WADA Keiji KIMURA Hironori KASAHARA
A power-aware compiler controllable chip multiprocessor (CMP) is presented and its performance and power consumption are evaluated with the optimally scheduled advanced multiprocessor (OSCAR) parallelizing compiler. The CMP is equipped with power control registers that change clock frequency and power supply voltage to functional units including processor cores, memories, and an interconnection network. The OSCAR compiler carries out coarse-grain task parallelization of programs and reduces power consumption using architectural power control support and the compiler's power saving scheme. The performance evaluation shows that MPEG-2 encoding on the proposed CMP with four CPUs results in 82.6% power reduction in real-time execution mode with a deadline constraint on its sequential execution time. Furthermore, MP3 encoding on a heterogeneous CMP with four CPUs and four accelerators results in 53.9% power reduction at 21.1-fold speed-up in performance against its sequential execution in the fastest execution mode.
Wei SUN Chen YU Xavier DEFAGO Yasushi INOGUCHI
The scheduling of real-time tasks with fault-tolerant requirements has been an important problem in multiprocessor systems. The primary-backup (PB) approach is often used as a fault-tolerant technique to guarantee the deadlines of tasks despite the presence of faults. In this paper we propose a dynamic PB-based task scheduling approach, wherein an allocation parameter is used to search the available time slots for a newly arriving task, and the previously scheduled tasks can be re-scheduled when there is no available time slot for the newly arriving task. In order to improve the schedulability we also propose an overloading strategy for PB-overloading and Backup-backup (BB) overloading. Our proposed task scheduling algorithm is compared with some existing scheduling algorithms in the literature through simulation studies. The results have shown that the task rejection ratio of our real-time task scheduling algorithm is almost 50% lower than the compared algorithms.
Sangchul HAN Heeheon KIM Xuefeng PIAO Minkyu PARK Seongje CHO Yookun CHO
This letter proves the finish time predictability of EDZL (Earliest Deadline Zero Laxity) scheduling algorithm for multiprocessor real-time systems, which is a variant of EDF. Based on the results, it also shows that EDZL can successfully schedule any periodic task set if its total utilization is not greater than (m+1)/2, where m is the number of processors.
Jinseok KONG Pen-Chung YEW Gyungho LEE
Directory-based cache coherence schemes are commonly used in large-scale shared-memory multiprocessors, but most of them rely on heuristics to avoid large hardware requirements. We proposed using physical address mapping on directories to significantly reduce directory size needed. This approach allows the size of directory to grow as O(cn log2 n) as in optimal pointer-based directory schemes [11], where n is the number of nodes in the system and c is the number of cache lines in each cache memory. Performance aspects of the proposed scheme are studied in detail using simulation.
Qi-Wei GE Chen LI Mitsuru NAKATA
This paper investigates the usefulness of a new priority list for two-processor scheduling problem of program nets. Firstly, we discuss the weakness of a previously proposed priority list and then introduce a new priority list. Through simulation experiment we show that the new priority list is better than the previous one and can generate the same length of schedules as GA scheduling, which implies the new priority list can generate approximately optimal schedules.
Jong Wook KWAK Hyong Jin BAN Chu Shik JHON
In this letter, we propose "Torus Ring", which is a modified version of 2-level hierarchical ring. The Torus Ring has the same complexity as the hierarchical rings, since the only difference is the way it connects the local rings. It has an advantage over the hierarchical ring when the destination of a packet is the adjacent local ring, especially to the backward direction. Although we assume that the destination of a network packet is uniformly distributed across the processing nodes, the average number of hops in Torus Ring is equal to that of the hierarchical ring. However, the performance gain of the Torus Ring is expected to increase, due to the spatial locality of the application programs in the real parallel programming environment. In the simulation results, latencies of the interconnection network are reduced by up to 19%, with moderate ring utilization ratios.
Minkyu PARK Sangchul HAN Heeheon KIM Seongje CHO Yookun CHO
Multiprocessor architecture becomes common on real-time systems as the workload of real-time systems increases. Recently new deadline-based (EDF-based) multiprocessor scheduling algorithms are devised, and comparative studies on the performance of these algorithms are necessary. In this paper, we compare EDZL, a hybrid of EDF and LLF, with other deadline-based scheduling algorithms such as EDF, EDF-US[m/(2m-1)], and fpEDF. We show EDZL schedules all task sets schedulable by EDF. The experimental results show that the number of preemptions of EDZL is comparable to that of EDF and the schedulable utilization bound of EDZL is higher than those of other algorithms we consider.