The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] multicore processor(7hit)

1-7hit
  • Local Memory Mapping of Multicore Processors on an Automatic Parallelizing Compiler

    Yoshitake OKI  Yuto ABE  Kazuki YAMAMOTO  Kohei YAMAMOTO  Tomoya SHIRAKAWA  Akimasa YOSHIDA  Keiji KIMURA  Hironori KASAHARA  

     
    PAPER

      Vol:
    E103-C No:3
      Page(s):
    98-109

    Utilization of local memory from real-time embedded systems to high performance systems with multi-core processors has become an important factor for satisfying hard deadline constraints. However, challenges lie in the area of efficiently managing the memory hierarchy, such as decomposing large data into small blocks to fit onto local memory and transferring blocks for reuse and replacement. To address this issue, this paper presents a compiler optimization method that automatically manage local memory of multi-core processors. The method selects and maps multi-dimensional data onto software specified memory blocks called Adjustable Blocks. These blocks are hierarchically divisible with varying sizes defined by the features of the input application. Moreover, the method introduces mapping structures called Template Arrays to maintain the indices of the decomposed multi-dimensional data. The proposed work is implemented on the OSCAR automatic parallelizing compiler and evaluations were performed on the Renesas RP2 8-core processor. Experimental results from NAS Parallel Benchmark, SPEC benchmark, and multimedia applications show the effectiveness of the method, obtaining maximum speed-ups of 20.44 with 8 cores utilizing local memory from single core sequential versions that use off-chip memory.

  • An Energy-Efficient Task Scheduling for Near Real-Time Systems on Heterogeneous Multicore Processors

    Takashi NAKADA  Hiroyuki YANAGIHASHI  Kunimaro IMAI  Hiroshi UEKI  Takashi TSUCHIYA  Masanori HAYASHIKOSHI  Hiroshi NAKAMURA  

     
    PAPER-Software System

      Pubricized:
    2019/11/01
      Vol:
    E103-D No:2
      Page(s):
    329-338

    Near real-time periodic tasks, which are popular in multimedia streaming applications, have deadline periods that are longer than the input intervals thanks to buffering. For such applications, the conventional frame-based schedulings cannot realize the optimal scheduling due to their shortsighted deadline assumptions. To realize globally energy-efficient executions of these applications, we propose a novel task scheduling algorithm, which takes advantage of the long deadline period. We confirm our approach can take advantage of the longer deadline period and reduce the average power consumption by up to 18%.

  • vCanal: Paravirtual Socket Library towards Fast Networking in Virtualized Environment

    Dongwoo LEE  Changwoo MIN  Young IK EOM  

     
    PAPER-Software System

      Pubricized:
    2015/11/11
      Vol:
    E99-D No:2
      Page(s):
    360-369

    Virtualization is no longer an emerging research area since the virtual processor and memory operate as efficiently as the physical ones. However, I/O performance is still restricted by the virtualization overhead caused by the costly and complex I/O virtualization mechanism, in particular by massive exits occurring on the guest-host switch and redundant processing of the I/O stacks at both guest and host. A para-virtual device driver may reduce the number of exits to the hypervisor, whereas the network stacks in the guest OS are still duplicated. Previous work proposed a socket-outsourcing technique that bypasses the redundant guest network stack by delivering the network request directly to the host. However, even by bypassing the redundant network paths in the guest OS, the obtained performance was still below 60% of the native device, since notifications of completion still depended on the hypervisor. In this paper, we propose vCanal, a novel network virtualization framework, to improve the performance of network access in the virtual machine toward that of the native machine. Implementation of vCanal reached 96% of the native TCP throughput, increasing the UDP latency by only 4% compared to the native latency.

  • Evaluation of an FPGA-Based Heterogeneous Multicore Platform with SIMD/MIMD Custom Accelerators

    Yasuhiro TAKEI  Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E96-A No:12
      Page(s):
    2576-2586

    Heterogeneous multi-core architectures with CPUs and accelerators attract many attentions since they can achieve power-efficient computing in various areas from low-power embedded processing to high-performance computing. Since the optimal architecture is different from application to application, finding the most suitable accelerator is very important. In this paper, we propose an FPGA-based heterogeneous multi-core platform with custom accelerators for power-efficient computing. Using the proposed platform, we evaluate several applications and accelerators to identify many key requirements of the applications and properties of the accelerators. Such an evaluation is very important to select and optimize the most suitable accelerator according to the requirements of an application to achieve the best performance.

  • A Fully Programmable Reed-Solomon Decoder on a Multi-Core Processor Platform

    Bei HUANG  Kaidi YOU  Yun CHEN  Zhiyi YU  Xiaoyang ZENG  

     
    PAPER-Computer Architecture

      Vol:
    E95-D No:12
      Page(s):
    2939-2947

    Reed-Solomon (RS) codes are widely used in digital communication and storage systems. Unlike usual VLSI approaches, this paper presents a high throughput fully programmable Reed-Solomon decoder on a multi-core processor. The multi-core processor platform is a 2-Dimension mesh array of Single Instruction Multiple Data (SIMD) cores, and it is well suited for digital communication applications. By fully extracting the parallelizable operations of the RS decoding process, we propose multiple optimization techniques to improve system throughput, including: task level parallelism on different cores, data level parallelism on each SIMD core, minimizing memory access, and route length minimized task mapping techniques. For RS(255, 239, 8), experimental results show that our 12-core implementation achieve a throughput of 4.35 Gbps, which is much better than several other published implementations. From the results, it is predictable that the throughput is linear with the number of cores by our approach.

  • Scalable Cache-Optimized Concurrent FIFO Queue for Multicore Architectures

    Changwoo MIN  Hyung Kook JUN  Won Tae KIM  Young Ik EOM  

     
    LETTER

      Vol:
    E95-D No:12
      Page(s):
    2956-2957

    A concurrent FIFO queue is a widely used fundamental data structure for parallelizing software. In this letter, we introduce a novel concurrent FIFO queue algorithm for multicore architecture. We achieve better scalability by reducing contention among concurrent threads, and improve performance by optimizing cache-line usage. Experimental results on a server with eight cores show that our algorithm outperforms state-of-the-art algorithms by a factor of two.

  • A 98 GMACs/W 32-Core Vector Processor in 65 nm CMOS

    Xun HE  Xin JIN  Minghui WANG  Dajiang ZHOU  Satoshi GOTO  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E94-A No:12
      Page(s):
    2609-2618

    This paper presents a high-performance dual-issue 32-core SIMD platform for image and video processing. The SIMD cores support 8/16 bits SIMD MAC instructions, and vertical vector access. Eight cores with a 4-ports L2 cache are connected by CIB bus as a cluster. Four clusters are connected by mesh network. This hierarchical network can provide more than 192 GB/s low latency inter-core BW in average. The 4-ports L2 cache architecture is also designed to provide 192 GB/s L2 cache BW. To reduce coherence operation in large-scale SMP, an application specified protocol is proposed. Compared with MOESI, 67.8% of L1 cache energy can be saved in 32 cores case. The whole system including 32 vector cores, 256 KB L2 cache, 64-bit DDRII PHY and two PLL units, occupy 25 mm2 in 65 nm CMOS. It can achieve a peak performance of 375 GMACs and 98 GMACs/W at 1.2 V.