The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] multi-core(37hit)

21-37hit(37hit)

  • Acceleration of Block Matching on a Low-Power Heterogeneous Multi-Core Processor Based on DTU Data-Transfer with Data Re-Allocation

    Yoshitaka HIRAMATSU  Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Toru NOJIRI  Kunio UCHIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Integrated Electronics

      Vol:
    E95-C No:12
      Page(s):
    1872-1882

    The large data-transfer time among different cores is a big problem in heterogeneous multi-core processors. This paper presents a method to accelerate the data transfers exploiting data-transfer-units together with complex memory allocation. We used block matching, which is very common in image processing, to evaluate our technique. The proposed method reduces the data-transfer time by more than 42% compared to the earlier works that use CPU-based data transfers. Moreover, the total processing time is only 15 ms for a VGA image with 1616 pixel blocks.

  • Cache-Aware Virtual Machine Scheduling on Multi-Core Architecture

    Cheol-Ho HONG  Young-Pil KIM  Seehwan YOO  Chi-Young LEE  Chuck YOO  

     
    PAPER-Software System

      Vol:
    E95-D No:10
      Page(s):
    2377-2392

    Facing practical limits to increasing processor frequencies, manufacturers have resorted to multi-core designs in their commercial products. In multi-core implementations, cores in a physical package share the last-level caches to improve inter-core communication. To efficiently exploit this facility, operating systems must employ cache-aware schedulers. Unfortunately, virtualization software, which is a foundation technology of cloud computing, is not yet cache-aware or does not fully exploit the locality of the last-level caches. In this paper, we propose a cache-aware virtual machine scheduler for multi-core architectures. The proposed scheduler exploits the locality of the last-level caches to improve the performance of concurrent applications running on virtual machines. For this purpose, we provide a space-partitioning algorithm that migrates and clusters communicating virtual CPUs (VCPUs) in the same cache domain. Second, we provide a time-partitioning algorithm that co-schedules or schedules in sequence clustered VCPUs. Finally, we present a theoretical analysis that proves our scheduling algorithm is more efficient in supporting concurrent applications than the default credit scheduler in Xen. We implemented our virtual machine scheduler in the recent Xen hypervisor with para-virtualized Linux-based operating systems. We show that our approach can improve performance of concurrent virtual machines by up to 19% compared to the credit scheduler.

  • C- and L-Band Parallel Configuration Optical Fiber Amplifier Employing Bundled Er3+-Doped Fiber

    Makoto YAMADA  Masaharu UNO  Hirotaka ONO  

     
    LETTER-Fiber-Optic Transmission for Communications

      Vol:
    E95-B No:10
      Page(s):
    3294-3297

    We propose a new configuration for a parallel fiber amplifier that can amplify both the C- and L-bands simultaneously by employing bundled Er3+-doped fiber (EDF). The bundled EDF is a candidate amplification medium for multi-core optical fiber amplifiers. Our parallel fiber amplifier is another application of the multi-core amplification medium. The amplifier achieves almost the same signal gain of 20 dB for both the C- and L-bands by using a bundled EDF, which is realized by bundling seven identical single-core EDFs.

  • Process Scheduling Based Memory Energy Management for Multi-Core Mobile Devices

    Tiefei ZHANG  Tianzhou CHEN  

     
    PAPER-Systems and Control

      Vol:
    E95-A No:10
      Page(s):
    1700-1707

    The energy consumption is always a serious problem for mobile devices powered by battery. As the capacity and density of off-chip memory continuous to scale, its energy consumption accounts for a considerable amount of the whole system energy. There are therefore strong demands for energy efficient techniques towards memory system. Different from previous works, we explore the different power management modes of the off-chip memory by process scheduling for the multi-core mobile devices. In particular, we schedule the processes based on their memory access characteristics to maximize the number of the memory banks being in low power mode. We propose a fast approximation algorithm to solve the scheduling process problem for the dual-core mobile device. And for those equipped with more than two cores, we prove that the scheduling process problem is NP-Hard, and propose two heuristic algorithms. The proposed algorithms are evaluated through a series of experiments, for which we have encouraging results.

  • Efficient Implementation of OFDM Inner Receiver on a Programmable Multi-Core Processor Platform

    Wenhua FAN  Chen CHEN  Yun CHEN  Zhiyi YU  Xiaoyang ZENG  

     
    PAPER

      Vol:
    E95-B No:4
      Page(s):
    1241-1248

    This paper presents an efficient implementation of OFDM inner receiver on a programmable multi-core processor platform with CMMB as an application. The platform consists of an array of programmable SIMD processors interconnected in a 2-D mesh network, which can provide high performance and is quite suitable for wireless communication applications. Implemented on one cluster with 8 cores, the receiver includes symbol timing, carrier frequency offset and sampling frequency offset synchronization, channel estimation and equalization. Multiple optimization techniques are explored to improve system throughput such as: task-level parallelism on many cores, data-level parallelism on SIMD cores, minimization of memory access and route-length-minimization task mapping techniques. Besides, efficient memory strategy and specific instructions for complex computation increase the performance. The simulation results show that the inner receiver could achieve a throughput of up to 120 Mbps when operating at 750 MHz.

  • Hybrid Wired/Wireless On-Chip Network Design for Application-Specific SoC

    Shouyi YIN  Yang HU  Zhen ZHANG  Leibo LIU  Shaojun WEI  

     
    PAPER

      Vol:
    E95-C No:4
      Page(s):
    495-505

    Hybrid wired/wireless on-chip network is a promising communication architecture for multi-/many-core SoC. For application-specific SoC design, it is important to design a dedicated on-chip network architecture according to the application-specific nature. In this paper, we propose a heuristic wireless link allocation algorithm for creating hybrid on-chip network architecture. The algorithm can eliminate the performance bottleneck by replacing multi-hop wired paths by high-bandwidth single-hop long-range wireless links. The simulation results show that the hybrid on-chip network designed by our algorithm improves the performance in terms of both communication delay and energy consumption significantly.

  • Minimum-Energy Semi-Static Scheduling of a Periodic Real-Time Task on DVFS-Enabled Multi-Core Processors

    Wan Yeon LEE  Hyogon KIM  Heejo LEE  

     
    LETTER

      Vol:
    E94-D No:12
      Page(s):
    2389-2392

    The proposed scheduling scheme minimizes the energy consumption of a real-time task on the multi-core processor with the dynamic voltage and frequency scaling capability. The scheme allocates a pertinent number of cores to the task execution, inactivates unused cores, and assigns the lowest frequency meeting the deadline. For a periodic real-time task with consecutive real-time instances, the scheme prepares the minimum-energy solutions for all input cases at off-line time, and applies one of the prepared solutions to each real-time instance at runtime.

  • Computation-Communication Overlap of Linpack on a GPU-Accelerated PC Cluster

    Junichi OHMURA  Takefumi MIYOSHI  Hidetsugu IRIE  Tsutomu YOSHINAGA  

     
    PAPER

      Vol:
    E94-D No:12
      Page(s):
    2319-2327

    In this paper, we propose an approach to obtaining enhanced performance of the Linpack benchmark on a GPU-accelerated PC cluster connected via relatively slow inter-node connections. For one node with a quad-core Intel Xeon W3520 processor and a NVIDIA Tesla C1060 GPU card, we implement a CPU–GPU parallel double-precision general matrix–matrix multiplication (dgemm) operation, and achieve a performance improvement of 34% compared with the GPU-only case and 64% compared with the CPU-only case. For an entire 16-node cluster, each node of which is the same as the above and is connected with two gigabit Ethernet links, we use a computation-communication overlap scheme with GPU acceleration for the Linpack benchmark, and achieve a performance improvement of 28% compared with the GPU-accelerated high-performance Linpack benchmark (HPL) without overlapping. Our overlap GPU acceleration solution uses overlaps in which the main inter-node communication and data transfer to the GPU device memory are overlapped with the main computation task on the CPU cores. These overlaps use multi-core processors, which almost all of today's high-performance computers use. In particular, as well as using a CPU core for communication tasks, we also simultaneously use other CPU cores and the GPU for computation tasks. In order to enable overlap between inter-node communication and computation tasks, we eliminate their close dependence by breaking the main computation task into smaller tasks and rescheduling. Based on a scheme in which part of the CPU computation power is simultaneously used for tasks other than computation tasks, we experimentally find the optimal computation ratio for CPUs; this ratio differs from the case of parallel dgemm operation of one node.

  • Hybrid Parallel Extraction of Isosurface Components from 3D Rectilinear Volume Data

    Bong-Soo SOHN  

     
    LETTER-Computer Graphics

      Vol:
    E94-D No:12
      Page(s):
    2553-2556

    We describe an efficient algorithm that extracts a connected component of an isosurface, or a contour, from a 3D rectilinear volume data. The efficiency of the algorithm is achieved by three factors: (i) directly working with rectilinear grids, (ii) parallel utilization of a multi-core CPU for extracting active cells, the cells containing the contour, and (iii) parallel utilization of a many-core GPU for computing the geometries of a contour surface in each active cell using CUDA. Experimental results show that our hybrid parallel implementation achieved up to 20x speedup over existing methods on an ordinary PC. Our work coupled with the Contour Tree framework is useful for quickly segmenting, displaying, and analyzing a feature of interest in 3D rectilinear volume data without being distracted by other features.

  • 7T SRAM Enabling Low-Energy Instantaneous Block Copy and Its Application to Transactional Memory

    Shunsuke OKUMURA  Yuki KAGIYAMA  Yohei NAKATA  Shusuke YOSHIMOTO  Hiroshi KAWAGUCHI  Masahiko YOSHIMOTO  

     
    PAPER-Circuit Design

      Vol:
    E94-A No:12
      Page(s):
    2693-2700

    This paper proposes 7T SRAM which realizes block-level simultaneous copying feature. The proposed SRAM can be used for data transfer between local memories such as checkpoint data storage and transactional memory. The 1-Mb SRAM is comprised of 32-kb blocks, in which 16-kb data can be copied in 33.3 ns at 1.2 V. The proposed scheme reduces energy consumption in copying by 92.7% compared to the conventional read-modify-write manner. By applying the proposed scheme to transactional memory, the number of write back cycles is possibly reduced by 98.7% compared with the conventional memory system.

  • An Investigation on Crosstalk in Multi-Core Fibers by Introducing Random Fluctuation along Longitudinal Direction

    Katsuhiro TAKENAGA  Yoko ARAKAWA  Shoji TANIGAWA  Ning GUAN  Shoichiro MATSUO  Kunimasa SAITOH  Masanori KOSHIBA  

     
    PAPER

      Vol:
    E94-B No:2
      Page(s):
    409-416

    The length dependence of the crosstalk in multi-core fibers has been investigated by introducing random fluctuation along longitudinal direction. The power coupling coefficients in the coupled-power theory in heterogeneous multi-core fiber with seven cores were estimated based on consideration of the power coupling coefficients of the homogeneous multi-core fiber. The crosstalk can be quantitatively evaluated by employing coupled-power theory instead of coupled-mode theory.

  • Photonic Crystal Multi-Core Fibers for Future High-Capacity Transmission Systems Open Access

    Kazunori MUKASA  Katsunori IMAMURA  Yukihiro TSUCHIDA  Ryuichi SUGIZAKI  

     
    INVITED PAPER

      Vol:
    E94-B No:2
      Page(s):
    376-383

    This paper describes recent developments of photonic crystal fibers (PCFs), which can realize ultra wide-band transmission or large Aeff, as well as photonic crystal multi-core fibers (PC-MCFs), which have large potentials as future high-capacity transmission lines using Space Division Multiplexing.

  • Energy-Saving Stochastic Scheduling of a Real-Time Parallel Task with Varying Computation Amount on Multi-Core Processors

    Wan Yeon LEE  Kyong Hoon KIM  

     
    LETTER-Systems and Control

      Vol:
    E94-A No:2
      Page(s):
    842-845

    The proposed scheduling scheme minimizes the mean energy consumption of a real-time parallel task, where the task has the probabilistic computation amount and can be executed concurrently on multiple cores. The scheme determines a pertinent number of cores allocated to the task execution and the instant frequency supplied to the allocated cores. Evaluation shows that the scheme saves manifest amount of the energy consumed by the previous method minimizing the mean energy consumption on a single core.

  • Task Allocation with Algorithm Transformation for Reducing Data-Transfer Bottlenecks in Heterogeneous Multi-Core Processors: A Case Study of HOG Descriptor Computation

    Hasitha Muthumala WAIDYASOORIYA  Daisuke OKUMURA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E93-A No:12
      Page(s):
    2570-2580

    Heterogeneous multi-core processors are attracted by the media processing applications due to their capability of drawing strengths of different cores to improve the overall performance. However, the data transfer bottlenecks and limitations in the task allocation due to the accelerator-incompatible operations prevents us from gaining full potential of the heterogeneous multi-core processors. This paper presents a task allocation method based on algorithm transformation to increase the freedom of task allocation. We use approximation methods such as CORDIC algorithms to map the accelerator-incompatible operations to accelerator cores. According to the experimental results using HOG descriptor computation, the proposed task allocation method reduces the data transfer time by more than 82% and the total processing time by more than 79% compared to the conventional task allocation method.

  • Efficient Parallel Learning of Hidden Markov Chain Models on SMPs

    Lei LI  Bin FU  Christos FALOUTSOS  

     
    INVITED PAPER

      Vol:
    E93-D No:6
      Page(s):
    1330-1342

    Quad-core cpus have been a common desktop configuration for today's office. The increasing number of processors on a single chip opens new opportunity for parallel computing. Our goal is to make use of the multi-core as well as multi-processor architectures to speed up large-scale data mining algorithms. In this paper, we present a general parallel learning framework, Cut-And-Stitch, for training hidden Markov chain models. Particularly, we propose two model-specific variants, CAS-LDS for learning linear dynamical systems (LDS) and CAS-HMM for learning hidden Markov models (HMM). Our main contribution is a novel method to handle the data dependencies due to the chain structure of hidden variables, so as to parallelize the EM-based parameter learning algorithm. We implement CAS-LDS and CAS-HMM using OpenMP on two supercomputers and a quad-core commercial desktop. The experimental results show that parallel algorithms using Cut-And-Stitch achieve comparable accuracy and almost linear speedups over the traditional serial version.

  • A Performance/Energy Analysis and Optimization of Multi-Core Architectures with Voltage Scaling Techniques

    Jeong-Gun LEE  Wook SHIN  Suk-Jin KIM  Eun-Gu JUNG  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E93-A No:6
      Page(s):
    1215-1225

    In this paper, we develop asymptotic analysis and simulation models to better understand the characteristics of performance and energy consumption in a multi-core processor design in which dynamic voltage scaling is used. Our asymptotic model is derived using Amdahl's law, Rent's rule and power equations to derive the optimum number of cores and their voltage levels. Our model can predict the possible impact of different multi-core processor configurations on the performance and energy consumption for given workload characteristics (e.g. available parallelism) and process technology parameters (e.g. ratios of dynamic and static energies to total energy). Through the asymptotic analysis and optimization based on the models, we can observe an asymptotic relationship between design parameters such as "the number of cores," "core size" and "voltage scaling strategies" of a multi-core architecture with regards to performance and energy consumption at an initial phase of the design.

  • Multi-Core/Multi-IP Technology for Embedded Applications Open Access

    Naohiko IRIE  Toshihiro HATTORI  

     
    INVITED PAPER

      Vol:
    E92-C No:10
      Page(s):
    1232-1239

    SoC has driven the evolution of embedded systems or consumer electronics. Multi-core/multi-IP is the key technology to integrate many functions on a SoC for future embedded applications. In this paper, the transition of SoC and its required functions for cellular phones as an example is described. And the state-of-the-art multi-core technology of homogeneous type and heterogeneous type are shown. When many cores and IPs are integrated on a chip, collaboration between cores and IPs becomes important to meet requirement. To realize it, "MPSoC Platform" concept and elementary technology for this platform is described.

21-37hit(37hit)