The search functionality is under construction.

Keyword Search Result

[Keyword] multiprocessors(18hit)

1-18hit
  • A Two-Level Cache Aware Adaptive Data Replication Mechanism for Shared LLC

    Qianqian WU  Zhenzhou JI  

     
    LETTER-Computer System

      Pubricized:
    2022/03/25
      Vol:
    E105-D No:7
      Page(s):
    1320-1324

    The shared last level cache (SLLC) in tile chip multiprocessors (TCMP) provides a low off-chip miss rate, but it causes a long on-chip access latency. In the two-level cache hierarchy, data replication stores replicas of L1 victims in the local LLC (L2 cache) to obtain a short local LLC access latency on the next accesses. Many data replication mechanisms have been proposed, but they do not consider both L1 victim reuse behaviors and LLC replica reception capability. They either produce many useless replicas or increase LLC pressure, which limits the improvement of system performance. In this paper, we propose a two-level cache aware adaptive data replication mechanism (TCDR), which controls replication based on both L1 victim reuse behaviors prediction and LLC replica reception capability monitoring. TCDR not only increases the accuracy of L1 replica selection, but also avoids the pressure of replication on LLC. The results show that TCDR improves the system performance with reasonable hardware overhead.

  • A Conflict-Aware Capacity Control Mechanism for Deep Cache Hierarchy

    Jiaheng LIU  Ryusuke EGAWA  Hiroyuki TAKIZAWA  

     
    PAPER-Computer System

      Pubricized:
    2022/03/09
      Vol:
    E105-D No:6
      Page(s):
    1150-1163

    As the number of cores on a processor increases, cache hierarchies contain more cache levels and a larger last level cache (LLC). Thus, the power and energy consumption of the cache hierarchy becomes non-negligible. Meanwhile, because the cache usage behaviors of individual applications can be different, it is possible to achieve higher energy efficiency of the computing system by determining the appropriate cache configurations for individual applications. This paper proposes a cache control mechanism to improve energy efficiency by adjusting a cache hierarchy to each application. Our mechanism first bypasses and disables a less-significant cache level, then partially disables the LLC, and finally adjusts the associativity if it suffers from a large number of conflict misses. The mechanism can achieve significant energy saving at the sacrifice of small performance degradation. The evaluation results show that our mechanism improves energy efficiency by 23.9% and 7.0% on average over the baseline and the cache-level bypassing mechanisms, respectively. In addition, even if the LLC resource contention occurs, the proposed mechanism is still effective for improving energy efficiency.

  • Distributed Synchronization for Message-Passing Based Embedded Multiprocessors

    Hao XIAO  Ning WU  Fen GE  Guanyu ZHU  Lei ZHOU  

     
    LETTER-Architecture

      Vol:
    E98-D No:2
      Page(s):
    272-275

    This paper presents a synchronization mechanism to effectively implement the lock and barrier protocols in a decentralized manner through explicit message passing. In the proposed solution, a simple and efficient synchronization control mechanism is proposed to support queued synchronization without contention. By using state-of-the-art Application-Specific Instruction-set Processor (ASIP) technology, we embed the synchronization functionality into a baseline processor, making the proposed mechanism feature ultra-low overhead. Experimental results show the proposed synchronization achieves ultra-low latency and almost ideal scalability when the number of processors increases.

  • A Capacity-Aware Thread Scheduling Method Combined with Cache Partitioning to Reduce Inter-Thread Cache Conflicts

    Masayuki SATO  Ryusuke EGAWA  Hiroyuki TAKIZAWA  Hiroaki KOBAYASHI  

     
    PAPER-Computer System

      Vol:
    E96-D No:9
      Page(s):
    2047-2054

    Chip multiprocessors (CMPs) improve performance by simultaneously executing multiple threads using integrated multiple cores. However, since these cores commonly share one cache, inter-thread cache conflicts often limit the performance improvement by multi-threading. This paper focuses on two causes of inter-thread cache conflicts. In shared caches of CMPs, cached data fetched by one thread are frequently evicted by another thread. Such an eviction, called inter-thread kickout (ITKO), is one of the major causes of inter-thread cache conflicts. The other cause is capacity shortage that occurs when one cache is shared by threads demanding large cache capacities. If the total capacity demanded by the threads exceeds the actual cache capacity, the threads compete to use the limited cache capacity, resulting in capacity shortage. To address inter-thread cache conflicts, we must take into account both ITKOs and capacity shortage. Therefore, this paper proposes a capacity-aware thread scheduling method combined with cache partitioning. In the proposed method, inter-thread cache conflicts due to ITKOs and capacity shortage are decreased by cache partitioning and thread scheduling, respectively. The proposed scheduling method estimates the capacity demand of each thread with an estimation method used in the cache partitioning mechanism. Based on the estimation used for cache partitioning, the thread scheduler decides thread combinations sharing one cache so as to avoid capacity shortage. Evaluation results suggest that the proposed method can improve overall performance by up to 8.1%, and the performance of individual threads by up to 12%. The results also show that both cache partitioning and thread scheduling are indispensable to avoid both ITKOs and capacity shortage simultaneously. Accordingly, the proposed method can significantly reduce the inter-thread cache conflicts and hence improve performance.

  • Scheduling Task In-Trees on Distributed Memory Systems

    Sanjeev BASKIYAR  

     
    PAPER-Theory and Models of Software

      Vol:
    E84-D No:6
      Page(s):
    685-691

    Tree task structures occur frequently in many applications where parallelization may be desirable. We present a formal treatment of non-preemptively scheduling task trees on distributed memory multiprocessors and show that the fundamental problems of scheduling (i) a task tree in absence of any inter-task communication on a fixed number of processors and (ii) a task tree with inter-task communication on an unbounded number of processors are NP-complete. For task trees that satisfy certain constraints, we present an optimal scheduling algorithm. The algorithm is shown optimal over a wider set of task trees than previous works.

  • Error Models and Fault-Secure Scheduling in Multiprocessor Systems

    Koji HASHIMOTO  Tatsuhiro TSUCHIYA  Tohru KIKUNO  

     
    PAPER-Fault Tolerance

      Vol:
    E84-D No:5
      Page(s):
    635-650

    A schedule for a parallel program is said to be 1-fault-secure if a system that uses the schedule can either produce correct output for the program or detect the presence of any faults in a single processor. Although several fault-secure scheduling algorithms have been proposed, they can all only be applied to a class of tree-structured task graphs with a uniform computation cost. Besides, they assume a stringent error model, called the redeemable error model, that considers extremely unlikely cases. In this paper, we first propose two new plausible error models which restrict the manner of error propagation. Then we present three fault-secure scheduling algorithms, one for each of the three models. Unlike previous algorithms, the proposed algorithms can deal with any task graphs with arbitrary computation and communication costs. Through experiments, we evaluate these algorithms and study the impact of the error models on the lengths of fault-secure schedules.

  • Scheduling DAGs on Message Passing m-Processor Systems

    Sanjeev BASKIYAR  

     
    PAPER-Computer Systems

      Vol:
    E83-D No:7
      Page(s):
    1497-1507

    Scheduling directed a-cyclic task graphs (DAGs) onto multiprocessors is known to be an intractable problem. Although there have been several heuristic algorithms for scheduling DAGs onto multiprocessors, few address the mapping onto a given number of completely connected processors with an objective of minimizing the finish time. We present an efficient algorithm called ClusterMerge to statically schedule directed a-cyclic task graphs onto a homogeneous completely connected MIMD system with a given number of processors. The algorithm clusters tasks in a DAG using a longest path heuristic and then iteratively merges these clusters to give a number of clusters identical to the number of available processors. Each of these clusters is then scheduled on a separate processor. Using simulations, we demonstrate that ClusterMerge schedules task graphs yielding the same or lower execution times than those of other researchers, but using fewer processors. We also discuss pitfalls in the various approaches to defining the longest path in a directed a-cyclic task graph.

  • Data-Parallel Volume Rendering with Adaptive Volume Subdivision

    Kentaro SANO  Hiroyuki KITAJIMA  Hiroaki KOBAYASHI  Tadao NAKAMURA  

     
    PAPER-Computer Graphics

      Vol:
    E83-D No:1
      Page(s):
    80-89

    A data-parallel processing approach is promising for real-time volume rendering because of the massive parallelism in volume rendering. In data-parallel volume rendering, local results processing elements(PEs) generate from allocated subvolumes are integrated to form a final image. Generally, the integration causes an overhead unavoidable in data-parallel volume rendering due to communications among PEs. This paper proposes a data-parallel shear-warp volume rendering algorithm combined with an adaptive volume subdivision method to reduce the communication overhead and improve processing efficiency. We implement the parallel algorithm on a message-passing multiprocessor system for performance evaluation. The experimental results show that the adaptive volume subdivision method can reduce the overhead and achieve higher efficiency compared with a conventional slab subdivision method.

  • Selective Write-Update: A Method to Relax Execution Constraints in a Critical Section

    Jae Bum LEE  Chu Shik JHON  

     
    PAPER-Computer Systems

      Vol:
    E81-D No:11
      Page(s):
    1186-1194

    In a shared-memory multiprocessor, shared data are usually accessed in a critical section that is protected by a lock variable. Therefore, the order of accesses by multiple processors to the shared data corresponds to the order of acquiring the ownership of the lock variable. This paper presents a selective write-update protocol, where data modified in a critical section are stored in a write cache and, at a synchronization point, they are transferred only to the processor that will execute the critical section following the current processor. By using QOLB synchronization primitives, the next processor can be determined at the execution time. We prove that the selective write-update protocol ensures data coherency of parallel programs that comply with release consistency, and evaluate the performance of the protocol by analytical modeling and program-driven simulation. The simulation results show that our protocol can reduce the number of coherence misses in a critical section while avoiding the multicast of write-update requests on an interconnection network. In addition, we observe that synchronization latency can be decreased by reducing both the execution time of a critical section and the number of write-update requests. From the simulation results, it is shown that our protocol provides better performance than a write-invalidate protocol and a write-update protocol as the number of processors increases.

  • A Simple Hardware Prefetching Scheme Using Sequentiality for Shared-Memory Multiprocessors

    Myoung Kwon TCHEUN  Seung Ryoul MAENG  Jung Wan CHO  

     
    PAPER-Computer Hardware and Design

      Vol:
    E80-D No:11
      Page(s):
    1055-1063

    To reduce the memory access latency on sharedmemory multiprocessors, several prefetching schemes have been proposed. The sequential prefetching scheme is a simple hardware-controlled scheme, which exploits the sequentiality of memory accesses to predict which blocks will be read in the near future. Aggressive sequential prefetching prefetches many blocks on each miss to reduce the miss rates and results in good performance for application programs with high sequentiality. However, conservative sequential prefetching prefetches a few blocks on each miss to avoid prefetching of useless blocks, which shows better performance than aggressive sequential prefetching for application programs with low sequentiality. We analyze the relationship between the sequentiality of application programs and the effectiveness of sequential prefetching on various memory and network latency and propose a new adaptive sequential prefetching scheme. Simply adding a small table to the sequential prefetching scheme, the proposed scheme prefetches a large number of blocks for application programs with high sequentiality and reduces the miss rates significantly, and prefetches a small number of blocks for application programs with low sequentiality and avoids loading useless blocks.

  • A Lookahead Heuristic for Heterogeneous Multiprocessor Scheduling with Communication Costs

    Dingchao LI  Akira MIZUNO  Yuji IWAHORI  Naohiro ISHII  

     
    PAPER

      Vol:
    E80-D No:4
      Page(s):
    489-494

    This paper describes a new approach to the scheduling problem that assigns tasks of a parallel program described as a task graph onto parallel machines. The approach handles interprocessor communication and heterogeneity, based on using both the theoretical results developed so far and a lookahead scheduling strategy. The experimental results on randomly generated task graphs demonstrate the effectiveness of this scheduling heuristic.

  • Parallel Parsing on a Loosely Coupled Multiprocessor

    Dong-Yul RA  Jong-Hyun KIM  

     
    PAPER-Algorithm and Computational Complexity

      Vol:
    E79-D No:12
      Page(s):
    1620-1628

    In this paper, we introduce a parallel algorithm for parsing context-free languages. Our algorithm can handle arbitrary context-free grammars since it is based on Earley's algorithm. Our algorithm can operate on any loosely coupled multiprocessor which can provide a topology of a one-way ring. Our algorithm uses p processors to parse an input string of length n where 1 p n. It is shown that our algorithm requires O(n3/p) time. The algorithm uses a simple job allocation strategy. However, it achieves high load balancing and uses the processors efficiently.

  • The MDX (Multi-Dimensional X'bar): A Class of Networks for Large Scale Multiprocessors

    Atsushi MURATA  Taisuke BOKU  Hideharu AMANO  

     
    PAPER-Interconnection Networks

      Vol:
    E79-D No:8
      Page(s):
    1116-1123

    The recent advance of semiconductor technologies enable to produce a medium size of crossbar with reasonable cost. By making the best use of the high bandwidth of such crossbars, indirect networks including the base-m n-cube and HyperCross have been proposed and researched. In these networks, a node is connected other nodes through crossbars in multiple dimensions. Although these networks are practically used in commercial machines, almost no discussion on a class of networks including them has been done. In this paper, a network class called Multi-Dimensional X'bar (MDX) which includes the above two networks is defined. Several new networks in this class are proposed, and relationship between these networks and direct networks/multistage interconnection networks is discussed. Finally, routing methods for these new networks are proposed and the average distance is evaluated. Through the discussion and evaluation, the MDX supports higher bandwidth than the corresponding multistage interconnection network with smaller hardware than the corresponding direct network.

  • Bottleneck Identification Methodology for Performance-Oriented Design of Shared-Bus Multiprocessors

    Chiung-San LEE  Tai-Ming PARNG  

     
    PAPER-Computer Systems

      Vol:
    E78-D No:8
      Page(s):
    982-991

    A bottleneck identification methodology is proposed for the performance-oriented design of shared-bus multiprocessors, which are composed of several major subsystems (e.g. off-chip cache, bus, memory, I/O). A subsystem with the longest access time per instruction is the one that limits processor performance and creates a bottleneck to the system. The methodology also facilitates further refined analysis on the access time of the bottleneck subsystem to help identify the causes of the bottleneck. Example performance model of a particular shared-bus multiprocessor architecture with separate address bus and data bus is developed to illustrate the key idea of the bottleneck identification methodology. Accessing conflicts in subsystems and DMA transfers are also considered in the model.

  • Neural Network Multiprocessors Applied with Dynamically Reconfigurable Pipeline Architecture

    Takayuki MORISHITA  Iwao TERAMOTO  

     
    PAPER-Processors

      Vol:
    E77-C No:12
      Page(s):
    1937-1943

    Processing elements (PEs) with a dynamically reconfigurable pipeline architecture allow the high-speed calculation of widely used neural model which is multi-layer perceptrons with the backpropagation (BP) learning rule. Its architecture that was proposed for a single chip is extended to multiprocessors' structure. Each PE holds an element of the synaptic weight matrix and the input vector. Multi-local buses, a swapping mechanism of the weight matrix and the input vector, and transfer commands between processor elements allow the implementation of neural networks larger than the physical PE array. Estimated peak performance by the measurement of single processor element is 21.2 MCPS in the evaluation phase and 8.0 MCUPS during the learning phase at a clock frequency of 50 MHz. In the model, multi-layer perceptrons with 768 neurons and 131072 synapses are trained by a BP learning rule. It corresponds to 1357 MCPS and 512 MCUPS with 64 processor elements and 32 neurons in each PE.

  • Parallel Implementations of Back Propagation Networks on a Dynamic Data-Driven Multiprocessor

    Ali M. ALHAJ  Hiroaki TERADA  

     
    PAPER-Computer Systems

      Vol:
    E77-D No:5
      Page(s):
    579-588

    The data-driven model of computation is well suited for flexible and highly parallel simulation of neural networks. First, the operational semantics of data-driven languages preserve the locality and functionality of neural networks, and naturally describe their inherent parallelism. Second, the asynchronous data-driven execution facilitates the implementation of large and scalable multiprocessor systems, which are necessary to obtain considerable degrees of simulation sppedups. In this paper, we present a dynamic data-driven multiprocessor system, and demonstrate its suitability for the paralel simulation of back propagation neural networks. Two parallel implementations are described and evaluated using an image data compression network. The system is scalable, and as a result, the performance improved proportionally with the increase in number of processors.

  • A Batcher-Double-Omega Network with Combining

    Kalidou GAYE  Hideharu AMANO  

     
    PAPER-Computer Networks

      Vol:
    E75-D No:3
      Page(s):
    307-314

    The Batcher banyan network is well known as a non-blocking switching fabric. However, it is conflict free only when there is no packets for the same destination. To cope with the arbitrary combination of packets, an additional network or special control sequence which causes the increase of the hardware or performance degradation is required. A Batcher Double Omega network with Combining (BDOC) is an elegant solution of this problem. It consists of a Batcher sorter and two double sized Omega networks. Like in the Batcher banyan network, packets are sorted by the destination label in the Batcher sorter. In the first Omega network called the distributer, a packet is routed by a tag corresponding to the sum of the label at the output of the Batcher sorter and the destination label. In the second (Inverse) Omega network called the concentrator, the original destination label is used as the routing tag, and packets are routed without any conflict. The BDOC is useful for an interconnection network to connect processors and memory modules in multiprocessor. Unlike conventional multistage interconnection networks for multiprocessors, packets are transferred in a serial and synchronized manner. The simple structure of the switching element enables a high speed operation which reduces the latency caused by the serial communication. Using the pipelined circuit switching, the address and data packets share the same control signal, and the structure of the switching element is much simplified. Moreover, packets combining which avoids the hot spot contention is realized easily in the concentrator.

  • Optimal Grain Size Determination for Tree-Structured Parallel Programs

    Tsuyoshi KAWAGUCHI  

     
    PAPER

      Vol:
    E75-D No:1
      Page(s):
    35-43

    In this paper we study the problem of scheduling a tree-structured program on multiprocessors so as to minimize the total execution time, which includes communication delay between processors. It is assumed in the problem that a sufficiently large number of processors are available. It is known that if the program structures are restricted to be out-trees, the problem can be solved in O(n2) time, where n denotes the number of modules of a program. However, this problem is known to be NP-hard if the program structures are allowed to be in-trees. Up to now, no optimal algorithm, except an obvious one, was known for the latter case while some approximation algorithms were shown. We present an optimization algorithm with a nontrivial time bound O((1.52)nn log n) for the in-tree case.