The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] multi-core processor(6hit)

1-6hit
  • An Online Thermal-Pattern-Aware Task Scheduler in 3D Multi-Core Processors

    Chien-Hui LIAO  Charles H.-P. WEN  

     
    PAPER

      Vol:
    E100-A No:12
      Page(s):
    2901-2910

    Hotspots occur frequently in 3D multi-core processors (3D-MCPs), and they may adversely impact both the reliability and lifetime of a system. We present a new thermally constrained task scheduler based on a thermal-pattern-aware voltage assignment (TPAVA) to reduce hotspots in and optimize the performance of 3D-MCPs. By analyzing temperature profiles of different voltage assignments, TPAVA pre-emptively assigns different initial operating-voltage levels to cores for reducing temperature increase in 3D-MCPs. The proposed task scheduler consists of an on-line allocation strategy and a new voltage-scaling strategy. In particular, the proposed on-line allocation strategy uses the temperature-variation rates of the cores and takes into two important thermal behaviors of 3D-MCPs that can effectively minimize occurrences of hotspots in both thermally homogeneous and heterogeneous 3D-MCPs. Furthermore, a new vertical-grouping voltage scaling (VGVS) strategy that considers thermal correlation in 3D-MCPs is used to handle thermal emergencies. Experimental results indicate that, when compared to a previous online thermally constrained task scheduler, the proposed task scheduler can reduce hotspot occurrences by approximately 66% (71%) and improve throughput by approximately 8% (2%) in thermally homogeneous (heterogeneous) 3D-MCPs. These results indicate that the proposed task scheduler is an effective technique for suppressing hotspot occurrences and optimizing throughput for 3D-MCPs subject to thermal constraints.

  • Cache-Aware Virtual Machine Scheduling on Multi-Core Architecture

    Cheol-Ho HONG  Young-Pil KIM  Seehwan YOO  Chi-Young LEE  Chuck YOO  

     
    PAPER-Software System

      Vol:
    E95-D No:10
      Page(s):
    2377-2392

    Facing practical limits to increasing processor frequencies, manufacturers have resorted to multi-core designs in their commercial products. In multi-core implementations, cores in a physical package share the last-level caches to improve inter-core communication. To efficiently exploit this facility, operating systems must employ cache-aware schedulers. Unfortunately, virtualization software, which is a foundation technology of cloud computing, is not yet cache-aware or does not fully exploit the locality of the last-level caches. In this paper, we propose a cache-aware virtual machine scheduler for multi-core architectures. The proposed scheduler exploits the locality of the last-level caches to improve the performance of concurrent applications running on virtual machines. For this purpose, we provide a space-partitioning algorithm that migrates and clusters communicating virtual CPUs (VCPUs) in the same cache domain. Second, we provide a time-partitioning algorithm that co-schedules or schedules in sequence clustered VCPUs. Finally, we present a theoretical analysis that proves our scheduling algorithm is more efficient in supporting concurrent applications than the default credit scheduler in Xen. We implemented our virtual machine scheduler in the recent Xen hypervisor with para-virtualized Linux-based operating systems. We show that our approach can improve performance of concurrent virtual machines by up to 19% compared to the credit scheduler.

  • Efficient Implementation of OFDM Inner Receiver on a Programmable Multi-Core Processor Platform

    Wenhua FAN  Chen CHEN  Yun CHEN  Zhiyi YU  Xiaoyang ZENG  

     
    PAPER

      Vol:
    E95-B No:4
      Page(s):
    1241-1248

    This paper presents an efficient implementation of OFDM inner receiver on a programmable multi-core processor platform with CMMB as an application. The platform consists of an array of programmable SIMD processors interconnected in a 2-D mesh network, which can provide high performance and is quite suitable for wireless communication applications. Implemented on one cluster with 8 cores, the receiver includes symbol timing, carrier frequency offset and sampling frequency offset synchronization, channel estimation and equalization. Multiple optimization techniques are explored to improve system throughput such as: task-level parallelism on many cores, data-level parallelism on SIMD cores, minimization of memory access and route-length-minimization task mapping techniques. Besides, efficient memory strategy and specific instructions for complex computation increase the performance. The simulation results show that the inner receiver could achieve a throughput of up to 120 Mbps when operating at 750 MHz.

  • Computation-Communication Overlap of Linpack on a GPU-Accelerated PC Cluster

    Junichi OHMURA  Takefumi MIYOSHI  Hidetsugu IRIE  Tsutomu YOSHINAGA  

     
    PAPER

      Vol:
    E94-D No:12
      Page(s):
    2319-2327

    In this paper, we propose an approach to obtaining enhanced performance of the Linpack benchmark on a GPU-accelerated PC cluster connected via relatively slow inter-node connections. For one node with a quad-core Intel Xeon W3520 processor and a NVIDIA Tesla C1060 GPU card, we implement a CPU–GPU parallel double-precision general matrix–matrix multiplication (dgemm) operation, and achieve a performance improvement of 34% compared with the GPU-only case and 64% compared with the CPU-only case. For an entire 16-node cluster, each node of which is the same as the above and is connected with two gigabit Ethernet links, we use a computation-communication overlap scheme with GPU acceleration for the Linpack benchmark, and achieve a performance improvement of 28% compared with the GPU-accelerated high-performance Linpack benchmark (HPL) without overlapping. Our overlap GPU acceleration solution uses overlaps in which the main inter-node communication and data transfer to the GPU device memory are overlapped with the main computation task on the CPU cores. These overlaps use multi-core processors, which almost all of today's high-performance computers use. In particular, as well as using a CPU core for communication tasks, we also simultaneously use other CPU cores and the GPU for computation tasks. In order to enable overlap between inter-node communication and computation tasks, we eliminate their close dependence by breaking the main computation task into smaller tasks and rescheduling. Based on a scheme in which part of the CPU computation power is simultaneously used for tasks other than computation tasks, we experimentally find the optimal computation ratio for CPUs; this ratio differs from the case of parallel dgemm operation of one node.

  • 7T SRAM Enabling Low-Energy Instantaneous Block Copy and Its Application to Transactional Memory

    Shunsuke OKUMURA  Yuki KAGIYAMA  Yohei NAKATA  Shusuke YOSHIMOTO  Hiroshi KAWAGUCHI  Masahiko YOSHIMOTO  

     
    PAPER-Circuit Design

      Vol:
    E94-A No:12
      Page(s):
    2693-2700

    This paper proposes 7T SRAM which realizes block-level simultaneous copying feature. The proposed SRAM can be used for data transfer between local memories such as checkpoint data storage and transactional memory. The 1-Mb SRAM is comprised of 32-kb blocks, in which 16-kb data can be copied in 33.3 ns at 1.2 V. The proposed scheme reduces energy consumption in copying by 92.7% compared to the conventional read-modify-write manner. By applying the proposed scheme to transactional memory, the number of write back cycles is possibly reduced by 98.7% compared with the conventional memory system.

  • Task Allocation with Algorithm Transformation for Reducing Data-Transfer Bottlenecks in Heterogeneous Multi-Core Processors: A Case Study of HOG Descriptor Computation

    Hasitha Muthumala WAIDYASOORIYA  Daisuke OKUMURA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E93-A No:12
      Page(s):
    2570-2580

    Heterogeneous multi-core processors are attracted by the media processing applications due to their capability of drawing strengths of different cores to improve the overall performance. However, the data transfer bottlenecks and limitations in the task allocation due to the accelerator-incompatible operations prevents us from gaining full potential of the heterogeneous multi-core processors. This paper presents a task allocation method based on algorithm transformation to increase the freedom of task allocation. We use approximation methods such as CORDIC algorithms to map the accelerator-incompatible operations to accelerator cores. According to the experimental results using HOG descriptor computation, the proposed task allocation method reduces the data transfer time by more than 82% and the total processing time by more than 79% compared to the conventional task allocation method.