IEICE global.ieice.org Site

Keyword Search Result

[Keyword] multi-core(37hit)

21-37hit(37hit)

Acceleration of Block Matching on a Low-Power Heterogeneous Multi-Core Processor Based on DTU Data-Transfer with Data Re-Allocation
Yoshitaka HIRAMATSU Hasitha Muthumala WAIDYASOORIYA Masanori HARIYAMA Toru NOJIRI Kunio UCHIYAMA Michitaka KAMEYAMA

PAPER-Integrated Electronics

Vol:
E95-C No:12
Page(s):
1872-1882
The large data-transfer time among different cores is a big problem in heterogeneous multi-core processors. This paper presents a method to accelerate the data transfers exploiting data-transfer-units together with complex memory allocation. We used block matching, which is very common in image processing, to evaluate our technique. The proposed method reduces the data-transfer time by more than 42% compared to the earlier works that use CPU-based data transfers. Moreover, the total processing time is only 15 ms for a VGA image with 1616 pixel blocks.
Cache-Aware Virtual Machine Scheduling on Multi-Core Architecture
Cheol-Ho HONG Young-Pil KIM Seehwan YOO Chi-Young LEE Chuck YOO

PAPER-Software System

Vol:
E95-D No:10
Page(s):
2377-2392
Facing practical limits to increasing processor frequencies, manufacturers have resorted to multi-core designs in their commercial products. In multi-core implementations, cores in a physical package share the last-level caches to improve inter-core communication. To efficiently exploit this facility, operating systems must employ cache-aware schedulers. Unfortunately, virtualization software, which is a foundation technology of cloud computing, is not yet cache-aware or does not fully exploit the locality of the last-level caches. In this paper, we propose a cache-aware virtual machine scheduler for multi-core architectures. The proposed scheduler exploits the locality of the last-level caches to improve the performance of concurrent applications running on virtual machines. For this purpose, we provide a space-partitioning algorithm that migrates and clusters communicating virtual CPUs (VCPUs) in the same cache domain. Second, we provide a time-partitioning algorithm that co-schedules or schedules in sequence clustered VCPUs. Finally, we present a theoretical analysis that proves our scheduling algorithm is more efficient in supporting concurrent applications than the default credit scheduler in Xen. We implemented our virtual machine scheduler in the recent Xen hypervisor with para-virtualized Linux-based operating systems. We show that our approach can improve performance of concurrent virtual machines by up to 19% compared to the credit scheduler.
C- and L-Band Parallel Configuration Optical Fiber Amplifier Employing Bundled Er³⁺-Doped Fiber
Makoto YAMADA Masaharu UNO Hirotaka ONO

LETTER-Fiber-Optic Transmission for Communications

Vol:
E95-B No:10
Page(s):
3294-3297
We propose a new configuration for a parallel fiber amplifier that can amplify both the C- and L-bands simultaneously by employing bundled Er3+-doped fiber (EDF). The bundled EDF is a candidate amplification medium for multi-core optical fiber amplifiers. Our parallel fiber amplifier is another application of the multi-core amplification medium. The amplifier achieves almost the same signal gain of 20 dB for both the C- and L-bands by using a bundled EDF, which is realized by bundling seven identical single-core EDFs.
Process Scheduling Based Memory Energy Management for Multi-Core Mobile Devices
Tiefei ZHANG Tianzhou CHEN

PAPER-Systems and Control

Vol:
E95-A No:10
Page(s):
1700-1707
The energy consumption is always a serious problem for mobile devices powered by battery. As the capacity and density of off-chip memory continuous to scale, its energy consumption accounts for a considerable amount of the whole system energy. There are therefore strong demands for energy efficient techniques towards memory system. Different from previous works, we explore the different power management modes of the off-chip memory by process scheduling for the multi-core mobile devices. In particular, we schedule the processes based on their memory access characteristics to maximize the number of the memory banks being in low power mode. We propose a fast approximation algorithm to solve the scheduling process problem for the dual-core mobile device. And for those equipped with more than two cores, we prove that the scheduling process problem is NP-Hard, and propose two heuristic algorithms. The proposed algorithms are evaluated through a series of experiments, for which we have encouraging results.
Efficient Implementation of OFDM Inner Receiver on a Programmable Multi-Core Processor Platform
Wenhua FAN Chen CHEN Yun CHEN Zhiyi YU Xiaoyang ZENG

PAPER

Vol:
E95-B No:4
Page(s):
1241-1248
This paper presents an efficient implementation of OFDM inner receiver on a programmable multi-core processor platform with CMMB as an application. The platform consists of an array of programmable SIMD processors interconnected in a 2-D mesh network, which can provide high performance and is quite suitable for wireless communication applications. Implemented on one cluster with 8 cores, the receiver includes symbol timing, carrier frequency offset and sampling frequency offset synchronization, channel estimation and equalization. Multiple optimization techniques are explored to improve system throughput such as: task-level parallelism on many cores, data-level parallelism on SIMD cores, minimization of memory access and route-length-minimization task mapping techniques. Besides, efficient memory strategy and specific instructions for complex computation increase the performance. The simulation results show that the inner receiver could achieve a throughput of up to 120 Mbps when operating at 750 MHz.
Hybrid Wired/Wireless On-Chip Network Design for Application-Specific SoC
Shouyi YIN Yang HU Zhen ZHANG Leibo LIU Shaojun WEI

PAPER

Vol:
E95-C No:4
Page(s):
495-505
Hybrid wired/wireless on-chip network is a promising communication architecture for multi-/many-core SoC. For application-specific SoC design, it is important to design a dedicated on-chip network architecture according to the application-specific nature. In this paper, we propose a heuristic wireless link allocation algorithm for creating hybrid on-chip network architecture. The algorithm can eliminate the performance bottleneck by replacing multi-hop wired paths by high-bandwidth single-hop long-range wireless links. The simulation results show that the hybrid on-chip network designed by our algorithm improves the performance in terms of both communication delay and energy consumption significantly.
Minimum-Energy Semi-Static Scheduling of a Periodic Real-Time Task on DVFS-Enabled Multi-Core Processors
Wan Yeon LEE Hyogon KIM Heejo LEE

LETTER

Vol:
E94-D No:12
Page(s):
2389-2392
The proposed scheduling scheme minimizes the energy consumption of a real-time task on the multi-core processor with the dynamic voltage and frequency scaling capability. The scheme allocates a pertinent number of cores to the task execution, inactivates unused cores, and assigns the lowest frequency meeting the deadline. For a periodic real-time task with consecutive real-time instances, the scheme prepares the minimum-energy solutions for all input cases at off-line time, and applies one of the prepared solutions to each real-time instance at runtime.
Computation-Communication Overlap of Linpack on a GPU-Accelerated PC Cluster
Junichi OHMURA Takefumi MIYOSHI Hidetsugu IRIE Tsutomu YOSHINAGA

PAPER

Vol:
E94-D No:12
Page(s):
2319-2327
In this paper, we propose an approach to obtaining enhanced performance of the Linpack benchmark on a GPU-accelerated PC cluster connected via relatively slow inter-node connections. For one node with a quad-core Intel Xeon W3520 processor and a NVIDIA Tesla C1060 GPU card, we implement a CPU–GPU parallel double-precision general matrix–matrix multiplication (dgemm) operation, and achieve a performance improvement of 34% compared with the GPU-only case and 64% compared with the CPU-only case. For an entire 16-node cluster, each node of which is the same as the above and is connected with two gigabit Ethernet links, we use a computation-communication overlap scheme with GPU acceleration for the Linpack benchmark, and achieve a performance improvement of 28% compared with the GPU-accelerated high-performance Linpack benchmark (HPL) without overlapping. Our overlap GPU acceleration solution uses overlaps in which the main inter-node communication and data transfer to the GPU device memory are overlapped with the main computation task on the CPU cores. These overlaps use multi-core processors, which almost all of today's high-performance computers use. In particular, as well as using a CPU core for communication tasks, we also simultaneously use other CPU cores and the GPU for computation tasks. In order to enable overlap between inter-node communication and computation tasks, we eliminate their close dependence by breaking the main computation task into smaller tasks and rescheduling. Based on a scheme in which part of the CPU computation power is simultaneously used for tasks other than computation tasks, we experimentally find the optimal computation ratio for CPUs; this ratio differs from the case of parallel dgemm operation of one node.
Hybrid Parallel Extraction of Isosurface Components from 3D Rectilinear Volume Data
Bong-Soo SOHN

LETTER-Computer Graphics

Vol:
E94-D No:12
Page(s):
2553-2556
We describe an efficient algorithm that extracts a connected component of an isosurface, or a contour, from a 3D rectilinear volume data. The efficiency of the algorithm is achieved by three factors: (i) directly working with rectilinear grids, (ii) parallel utilization of a multi-core CPU for extracting active cells, the cells containing the contour, and (iii) parallel utilization of a many-core GPU for computing the geometries of a contour surface in each active cell using CUDA. Experimental results show that our hybrid parallel implementation achieved up to 20x speedup over existing methods on an ordinary PC. Our work coupled with the Contour Tree framework is useful for quickly segmenting, displaying, and analyzing a feature of interest in 3D rectilinear volume data without being distracted by other features.
7T SRAM Enabling Low-Energy Instantaneous Block Copy and Its Application to Transactional Memory
Shunsuke OKUMURA Yuki KAGIYAMA Yohei NAKATA Shusuke YOSHIMOTO Hiroshi KAWAGUCHI Masahiko YOSHIMOTO

PAPER-Circuit Design

Vol:
E94-A No:12
Page(s):
2693-2700
This paper proposes 7T SRAM which realizes block-level simultaneous copying feature. The proposed SRAM can be used for data transfer between local memories such as checkpoint data storage and transactional memory. The 1-Mb SRAM is comprised of 32-kb blocks, in which 16-kb data can be copied in 33.3 ns at 1.2 V. The proposed scheme reduces energy consumption in copying by 92.7% compared to the conventional read-modify-write manner. By applying the proposed scheme to transactional memory, the number of write back cycles is possibly reduced by 98.7% compared with the conventional memory system.
An Investigation on Crosstalk in Multi-Core Fibers by Introducing Random Fluctuation along Longitudinal Direction
Katsuhiro TAKENAGA Yoko ARAKAWA Shoji TANIGAWA Ning GUAN Shoichiro MATSUO Kunimasa SAITOH Masanori KOSHIBA

PAPER

Vol:
E94-B No:2
Page(s):
409-416
The length dependence of the crosstalk in multi-core fibers has been investigated by introducing random fluctuation along longitudinal direction. The power coupling coefficients in the coupled-power theory in heterogeneous multi-core fiber with seven cores were estimated based on consideration of the power coupling coefficients of the homogeneous multi-core fiber. The crosstalk can be quantitatively evaluated by employing coupled-power theory instead of coupled-mode theory.
Photonic Crystal Multi-Core Fibers for Future High-Capacity Transmission Systems Open Access
Kazunori MUKASA Katsunori IMAMURA Yukihiro TSUCHIDA Ryuichi SUGIZAKI

INVITED PAPER

Vol:
E94-B No:2
Page(s):
376-383
This paper describes recent developments of photonic crystal fibers (PCFs), which can realize ultra wide-band transmission or large Aeff, as well as photonic crystal multi-core fibers (PC-MCFs), which have large potentials as future high-capacity transmission lines using Space Division Multiplexing.
Energy-Saving Stochastic Scheduling of a Real-Time Parallel Task with Varying Computation Amount on Multi-Core Processors
Wan Yeon LEE Kyong Hoon KIM

LETTER-Systems and Control

Vol:
E94-A No:2
Page(s):
842-845
The proposed scheduling scheme minimizes the mean energy consumption of a real-time parallel task, where the task has the probabilistic computation amount and can be executed concurrently on multiple cores. The scheme determines a pertinent number of cores allocated to the task execution and the instant frequency supplied to the allocated cores. Evaluation shows that the scheme saves manifest amount of the energy consumed by the previous method minimizing the mean energy consumption on a single core.
Task Allocation with Algorithm Transformation for Reducing Data-Transfer Bottlenecks in Heterogeneous Multi-Core Processors: A Case Study of HOG Descriptor Computation
Hasitha Muthumala WAIDYASOORIYA Daisuke OKUMURA Masanori HARIYAMA Michitaka KAMEYAMA

PAPER-High-Level Synthesis and System-Level Design

Vol:
E93-A No:12
Page(s):
2570-2580
Heterogeneous multi-core processors are attracted by the media processing applications due to their capability of drawing strengths of different cores to improve the overall performance. However, the data transfer bottlenecks and limitations in the task allocation due to the accelerator-incompatible operations prevents us from gaining full potential of the heterogeneous multi-core processors. This paper presents a task allocation method based on algorithm transformation to increase the freedom of task allocation. We use approximation methods such as CORDIC algorithms to map the accelerator-incompatible operations to accelerator cores. According to the experimental results using HOG descriptor computation, the proposed task allocation method reduces the data transfer time by more than 82% and the total processing time by more than 79% compared to the conventional task allocation method.
Efficient Parallel Learning of Hidden Markov Chain Models on SMPs
Lei LI Bin FU Christos FALOUTSOS

INVITED PAPER

Vol:
E93-D No:6
Page(s):
1330-1342
Quad-core cpus have been a common desktop configuration for today's office. The increasing number of processors on a single chip opens new opportunity for parallel computing. Our goal is to make use of the multi-core as well as multi-processor architectures to speed up large-scale data mining algorithms. In this paper, we present a general parallel learning framework, Cut-And-Stitch, for training hidden Markov chain models. Particularly, we propose two model-specific variants, CAS-LDS for learning linear dynamical systems (LDS) and CAS-HMM for learning hidden Markov models (HMM). Our main contribution is a novel method to handle the data dependencies due to the chain structure of hidden variables, so as to parallelize the EM-based parameter learning algorithm. We implement CAS-LDS and CAS-HMM using OpenMP on two supercomputers and a quad-core commercial desktop. The experimental results show that parallel algorithms using Cut-And-Stitch achieve comparable accuracy and almost linear speedups over the traditional serial version.
A Performance/Energy Analysis and Optimization of Multi-Core Architectures with Voltage Scaling Techniques
Jeong-Gun LEE Wook SHIN Suk-Jin KIM Eun-Gu JUNG

PAPER-VLSI Design Technology and CAD

Vol:
E93-A No:6
Page(s):
1215-1225
In this paper, we develop asymptotic analysis and simulation models to better understand the characteristics of performance and energy consumption in a multi-core processor design in which dynamic voltage scaling is used. Our asymptotic model is derived using Amdahl's law, Rent's rule and power equations to derive the optimum number of cores and their voltage levels. Our model can predict the possible impact of different multi-core processor configurations on the performance and energy consumption for given workload characteristics (e.g. available parallelism) and process technology parameters (e.g. ratios of dynamic and static energies to total energy). Through the asymptotic analysis and optimization based on the models, we can observe an asymptotic relationship between design parameters such as "the number of cores," "core size" and "voltage scaling strategies" of a multi-core architecture with regards to performance and energy consumption at an initial phase of the design.
Multi-Core/Multi-IP Technology for Embedded Applications Open Access
Naohiko IRIE Toshihiro HATTORI

INVITED PAPER

Vol:
E92-C No:10
Page(s):
1232-1239
SoC has driven the evolution of embedded systems or consumer electronics. Multi-core/multi-IP is the key technology to integrate many functions on a SoC for future embedded applications. In this paper, the transition of SoC and its required functions for cellular phones as an example is described. And the state-of-the-art multi-core technology of homogeneous type and heterogeneous type are shown. When many cores and IPs are integrated on a chip, collaboration between cores and IPs becomes important to meet requirement. To realize it, "MPSoC Platform" concept and elementary technology for this platform is described.

21-37hit(37hit)

Keyword Search Result

[Keyword] multi-core(37hit)

Acceleration of Block Matching on a Low-Power Heterogeneous Multi-Core Processor Based on DTU Data-Transfer with Data Re-Allocation

Cache-Aware Virtual Machine Scheduling on Multi-Core Architecture

C- and L-Band Parallel Configuration Optical Fiber Amplifier Employing Bundled Er³⁺-Doped Fiber

Process Scheduling Based Memory Energy Management for Multi-Core Mobile Devices

Efficient Implementation of OFDM Inner Receiver on a Programmable Multi-Core Processor Platform

Hybrid Wired/Wireless On-Chip Network Design for Application-Specific SoC

Minimum-Energy Semi-Static Scheduling of a Periodic Real-Time Task on DVFS-Enabled Multi-Core Processors

Computation-Communication Overlap of Linpack on a GPU-Accelerated PC Cluster

Hybrid Parallel Extraction of Isosurface Components from 3D Rectilinear Volume Data

7T SRAM Enabling Low-Energy Instantaneous Block Copy and Its Application to Transactional Memory

An Investigation on Crosstalk in Multi-Core Fibers by Introducing Random Fluctuation along Longitudinal Direction

Photonic Crystal Multi-Core Fibers for Future High-Capacity Transmission Systems Open Access

Energy-Saving Stochastic Scheduling of a Real-Time Parallel Task with Varying Computation Amount on Multi-Core Processors

Task Allocation with Algorithm Transformation for Reducing Data-Transfer Bottlenecks in Heterogeneous Multi-Core Processors: A Case Study of HOG Descriptor Computation

Efficient Parallel Learning of Hidden Markov Chain Models on SMPs

A Performance/Energy Analysis and Optimization of Multi-Core Architectures with Voltage Scaling Techniques

Multi-Core/Multi-IP Technology for Embedded Applications Open Access

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles