IEICE global.ieice.org Site

Keyword Search Result

[Keyword] CPU(15hit)

1-15hit

Real-Time Image Processing Based on Service Function Chaining Using CPU-FPGA Architecture
Yuta UKON Koji YAMAZAKI Koyo NITTA

PAPER-Network System

Pubricized:
2019/08/05
Vol:
E103-B No:1
Page(s):
11-19
Advanced information-processing services based on cloud computing are in great demand. However, users want to be able to customize cloud services for their own purposes. To provide image-processing services that can be optimized for the purpose of each user, we propose a technique for chaining image-processing functions in a CPU-field programmable gate array (FPGA) coupled server architecture. One of the most important requirements for combining multiple image-processing functions on a network, is low latency in server nodes. However, large delay occurs in the conventional CPU-FPGA architecture due to the overheads of packet reordering for ensuring the correctness of image processing and data transfer between the CPU and FPGA at the application level. This paper presents a CPU-FPGA server architecture with a real-time packet reordering circuit for low-latency image processing. In order to confirm the efficiency of our idea, we evaluated the latency of histogram of oriented gradients (HOG) feature calculation as an offloaded image-processing function. The results show that the latency is about 26 times lower than that of the conventional CPU-FPGA architecture. Moreover, the throughput decreased by less than 3.7% under the worst-case condition where 90 percent of the packets are randomly swapped at a 40-Gbps input rate. Finally, we demonstrated that a real-time video monitoring service can be provided by combining image processing functions using our architecture.
Reducing CPU Power Consumption with Device Utilization-Aware DVFS for Low-Latency SSDs
Satoshi IMAMURA Eiji YOSHIDA Kazuichi OE

PAPER-Computer System

Pubricized:
2019/06/18
Vol:
E102-D No:9
Page(s):
1740-1749
Emerging solid state drives (SSDs) based on a next-generation memory technology have been recently released in market. In this work, we call them low-latency SSDs because the device latency of them is an order of magnitude lower than that of conventional NAND flash SSDs. Although low-latency SSDs can drastically reduce an I/O latency perceived by an application, the overhead of OS processing included in the I/O latency has become noticeable because of the very low device latency. Since the OS processing is executed on a CPU core, its operating frequency should be maximized for reducing the OS overhead. However, a higher core frequency causes the higher CPU power consumption during I/O accesses to low-latency SSDs. Therefore, we propose the device utilization-aware DVFS (DU-DVFS) technique that periodically monitors the utilization of a target block device and applies dynamic voltage and frequency scaling (DVFS) to CPU cores executing I/O-intensive processes only when the block device is fully utilized. In this case, DU-DVFS can reduce the CPU power consumption without hurting performance because the delay of OS processing incurred by decreasing the core frequency can be hidden. Our evaluation with 28 I/O-intensive workloads on a real server containing an Intel® Optane™ SSD demonstrates that DU-DVFS reduces the CPU power consumption by 41.4% on average (up to 53.8%) with a negligible performance degradation, compared to a standard DVFS governor on Linux. Moreover, the evaluation with multiprogrammed workloads composed of I/O-intensive and non-I/O-intensive programs shows that DU-DVFS is also effective for them because it can apply DVFS only to CPU cores executing I/O-intensive processes.
Real-Time and Energy-Efficient Face Detection on CPU-GPU Heterogeneous Embedded Platforms
Chanyoung OH Saehanseul YI Youngmin YI

PAPER-Real-time Systems

Pubricized:
2018/09/18
Vol:
E101-D No:12
Page(s):
2878-2888
As energy efficiency has become a major design constraint or objective, heterogeneous manycore architectures have emerged as mainstream target platforms not only in server systems but also in embedded systems. Manycore accelerators such as GPUs are getting also popular in embedded domains, as well as the heterogeneous CPU cores. However, as the number of cores in an embedded GPU is far less than that of a server GPU, it is important to utilize both heterogeneous multi-core CPUs and GPUs to achieve the desired throughput with the minimal energy consumption. In this paper, we present a case study of mapping LBP-based face detection onto a recent CPU-GPU heterogeneous embedded platform, which exploits both task parallelism and data parallelism to achieve maximal energy efficiency with a real-time constraint. We first present the parallelization technique of each task for the GPU execution, then we propose performance and energy models for both task-parallel and data-parallel executions on heterogeneous processors, which are used in design space exploration for the optimal mapping. The design space is huge since not only processor heterogeneity such as CPU-GPU and big.LITTLE, but also various data partitioning ratios for the data-parallel execution on these heterogeneous processors are considered. In our case study of LBP face detection on Exynos 5422, the estimation error of the proposed performance and energy models were on average -2.19% and -3.67% respectively. By systematically finding the optimal mappings with the proposed models, we could achieve 28.6% less energy consumption compared to the manual mapping, while still meeting the real-time constraint.
A GPU-Based Rasterization Algorithm for Boolean Operations on Polygons
Yi GAO Jianxin LUO Hangping QIU Bin TANG Bo WU Weiwei DUAN

LETTER-Fundamentals of Information Systems

Pubricized:
2017/09/29
Vol:
E101-D No:1
Page(s):
234-238
This paper presents a new GPU-based rasterization algorithm for Boolean operations that handles arbitary closed polygons. We construct an efficient data structure for interoperation of CPU and GPU and propose a fast GPU-based contour extraction method to ensure the performance of our algorithm. We then design a novel traversing strategy to achieve an error-free calculation of intersection point for correct Boolean operations. We finally give a detail evaluation and the results show that our algorithm has a higher performance than exsiting algorithms on processing polygons with large amount of vertices.
Optimizing Hash Join with MapReduce on Multi-Core CPUs
Tong YUAN Zhijing LIU Hui LIU

PAPER-Data Engineering, Web Information Systems

Pubricized:
2016/02/04
Vol:
E99-D No:5
Page(s):
1316-1325
In this paper, we exploit MapReduce framework and other optimizations to improve the performance of hash join algorithms on multi-core CPUs, including No partition hash join and partition hash join. We first implement hash join algorithms with a shared-memory MapReduce model on multi-core CPUs, including partition phase, build phase, and probe phase. Then we design an improved cuckoo hash table for our hash join, which consists of a cuckoo hash table and a chained hash table. Based on our implementation, we also propose two optimizations, one for the usage of SIMD instructions, and the other for partition phase. Through experimental result and analysis, we finally find that the partition hash join often outperforms the No partition hash join, and our hash join algorithm is faster than previous work by an average of 30%.
Blocked United Algorithm for the All-Pairs Shortest Paths Problem on Hybrid CPU-GPU Systems
Kazuya MATSUMOTO Naohito NAKASATO Stanislav G. SEDUKHIN

PAPER-Parallel and Distributed Computing

Vol:
E95-D No:12
Page(s):
2759-2768
This paper presents a blocked united algorithm for the all-pairs shortest paths (APSP) problem. This algorithm simultaneously computes both the shortest-path distance matrix and the shortest-path construction matrix for a graph. It is designed for a high-speed APSP solution on hybrid CPU-GPU systems. In our implementation, two most compute intensive parts of the algorithm are performed on the GPU. The first part is to solve the APSP sub-problem for a block of sub-matrices, and the other part is a matrix-matrix “multiplication” for the APSP problem. Moreover, the amount of data communication between CPU (host) memory and GPU memory is reduced by reusing blocks once sent to the GPU. When a problem size (the number of vertices in a graph) is large enough compared to a block size, our implementation of the blocked algorithm requires CPU GPU exchanging of three blocks during a block computation on the GPU. We measured the performance of the algorithm implementation on two different CPU-GPU systems. A system containing an Intel Sandy Bridge CPU (Core i7 2600K) and an AMD Cayman GPU (Radeon HD 6970) achieves the performance up to 1.1 TFlop/s in a single precision.
CPU Model-Based Mechatronics/Hardware/Software Co-design Technology for Real-Time Embedded Control Systems
Makoto ISHIKAWA George SAIKALIS Shigeru OHO

PAPER-VLSI Design Technology

Vol:
E90-C No:10
Page(s):
1992-2001
We review practical case studies of a developing method of highly reliable real-time embedded control systems using a CPU model-based hardware/software co-simulation. We take an approach that enables us to fully simulate a virtual mechanical control system including a mechatronics plant, microcontroller hardware, and object code level software. This full virtual system approach simulates control system behavior, especially that of the microcontroller hardware and software. It enables design space exploration of microarchitecture, control design validation, robustness evaluation of the system, software optimization before components design. It also avoids potential problems. The advantage of this work is that it comprises all the components in a typical control system, enabling the designers to analyze effects from different domains, for example mechanical analysis of behavior due to differences in controller microarchitecture. To further improve system design, evaluation and analysis, we implemented an integrated behavior analyzer in the development environment. This analyzer can graphically display the processor behavior during the simulation without affecting simulation results such as task level CPU load, interrupt statistics, and the software variable transition chart. It also provides useful information on the system behavior. This virtual system analysis does not require software modification, does not change the control timing, and does not require any processing power from the target microcontroller. Therefore this method is suitable for real-time embedded control system design, in particular automotive control system design that requires a high level of reliability, robustness, quality, and safety. In this study, a Renesas SH-2A microcontroller model was developed on a CoMETTMplatform from VaST Systems Technology. An electronic throttle control (ETC) system and an engine control system were chosen to prove this concept. The electronic throttle body (ETB) model on the Saber® simulator from Synopsys® and the engine model on MATLAB®/Simulink® simulator from MathWorks can be simulated with the SH-2A model using a newly developed co-simulation interface between MATLAB®/Simulink® and CoMETTM. Though the SH-2A chip was being developed as the project was being executed, we were able to complete the OSEK OS development, control software design, and verification of the entire system using the virtual environment. After releasing a working sample chip in a later stage of the project, we found that such software could run on both actual ETC system and engine control system without critical problem. This demonstrates that our models and simulation environment are sufficiently credible and trustworthy.
CPU Load Predictions on the Computational Grid
Yuanyuan ZHANG Wei SUN Yasushi INOGUCHI

PAPER-Grid Computing

Vol:
E90-D No:1
Page(s):
40-47
To make the best use of the resources in a shared grid environment, an application scheduler must make a prediction of available performance on each resource. In this paper, we examine the problem of predicting available CPU performance in time-shared grid system. We present and evaluate a new and innovative method to predict the one-step-ahead CPU load in a grid. Our prediction strategy forecasts the future CPU load based on the variety tendency in several past steps and in previous similar patterns, and uses a polynomial fitting method. Our experimental results on large load traces collected from four different kinds of machines demonstrate that this new prediction strategy achieves average prediction errors which are between 22% and 86% less than those incurred by four previous methods.
A Feed-Forward Dynamic Voltage Control Algorithm for Low Power MPEG4 on Multi-Regulated Voltage CPU
Hideo OHIRA Kentaro KAWAKAMI Miwako KANAMORI Yasuhiro MORITA Masayuki MIYAMA Masahiko YOSHIMOTO

PAPER

Vol:
E87-C No:4
Page(s):
457-465
In this paper, we describe a feed-forward dynamic voltage/clock-frequency control method enabling low power MPEG4 on multi-regulated voltage CPU with combining the characteristics of the CPU and the video encoding processing. This method theoretically achieves minimum low power consumption which is close to the hardware-level power consumption. Required processing performance for MPEG4 visual encoding totally depends on the activity of the sequence, and high motion sequence requires high performance and low motion sequence requires low performance. If required performance is predictable, lower power consumption can be achieved with controlling the adequate voltage and clock-frequency dynamically at every frame. The proposed method in this paper is predicting the required processing performance of a future frame using our unique feed-forward analysis method and controlling a voltage and frequency dynamically at every frame along with the forward analysis value. The simulation results indicate that the proposed feed-forward analysis method adequately predicts the required processing performance of every future frame, and enables to minimize power consumption on software basis MPEG4 visual encoding processing. In the case that CPU has Frequency-Voltage characteristics of 1.8 V @400 MHz to 1.0 V @189 MHz, the proposed method reduces the power consumption approximately 37% at high motion sequences or 65% at low motion sequences comparing with the conventional software video encoding method.
A 100 MIPS High Speed and Low Power Digital Signal Processor
Hiroshi TAKAHASHI Shigeshi ABIKO Shintaro MIZUSHIMA Yuji OZAWA Kenichi TASHIRO Shigetoshi MURAMATSU Masahiro FUSUMADA Akemi TODOROKI Youichi TANAKA Masayasu ITOIGAWA Isao MORIOKA Hiroyuki MIZUNO Miki KOJIMA Giovanni NASO Emmanuel EGO Frank CHIRAT

PAPER

Vol:
E80-C No:12
Page(s):
1546-1552
A 100MIPS high speed and low power fixed point Digital Signal Processor (DSP) has been developed applying 0.45µm CMOS TLM technology. The DSP contains a 16-bit32K full CMOS static RAM with a hierarchical low power architecture. The device is a RAM based DSP with a total of 4.2 million transistors and a new low power design and process which enabled an approximate 50% reduction in power as compared to conventional DSPs at 40 MHz. In order to cover very wide application requirements, this DSP is capable of operating at 1.0 V for DSP core and 3.3 V for I/O. This was achieved by new level shifter circuitry to interface with cost effective 3 V external commodity products and confirmed 80% of power reduction at Core VDD=2.0 V, I/O VDD=3.3 V at 40MHz. This paper describes the new features of the high speed and low power DSP.
A Rate Regulating Scheme for Scheduling Multimedia Tasks
Kisok KONG Manhee KIM Hyogun LEE Joonwon LEE

PAPER-Computer Systems

Vol:
E80-D No:12
Page(s):
1166-1175
This paper presents a proportional-share CPU scheduler which can support multimedia applications in a general-purpose workstation environment. For this purpose, we have extended the stride scheduler which is designed originally for conventional tasks. New scheduling parameters are introduced to specify timing requirements of multimedia applications. Through the use of the rate regulator, the accuracy error of the scheduling is reduced to 0 (1). Separate task groups are proposed to represent both relative shares and absolute shares. The proposed scheduler is evaluated using a simulation study. The results show that the proposed scheduler achieves improved accuracy and adaptability as well as flexibility.
Embedded System Cost Optimization via Data Path Width Adjustment
Barry SHACKLEFORD Mitsuhiro YASUDA Etsuko OKUSHI Hisao KOIZUMI Hiroyuki TOMIYAMA Akihiko INOUE Hiroto YASUURA

PAPER-High Level Synthesis

Vol:
E80-D No:10
Page(s):
974-981
Entire systems embedded in a chip and consisting of a processor, memory, and system-specific peripheral hardware are now commonly contained in commodity electronic devices. Cost minimization of these systems is of paramount economic importance to manufactures of these devices. By employing a variable configuration processor in conjunction with a multi-precision compiler generator, we show that there are situations in which considerable system cost reduction can be obtained by synthesizing a CPU that is narrower than the largest variable in the application program.
A Circuit Library for Low Power and High Speed Digital Signal Processor
Hiroshi TAKAHASHI Shigeshi ABIKO Shintaro MIZUSHIMA Yuni OZAWA

PAPER

Vol:
E78-C No:12
Page(s):
1717-1725
A new high performance digital signal processor (DSP) that lowers power consumption, reduces chip count, and enables system cost savings for wireless communications applications was developed. The new device contains high performance, hard-wired functionality with a specialized instruction set to effectively implement the worldwide digital cellular standard algorithms, including GSM, PDC and NADC, and also features both full rate and future half rate processing by software modules. The device provides a wider operating voltage ranging from 1.5 V to 5.5 V using 5 V process based on the market requirement of 5 V supply voltage, even though a power supply voltage in most applications will be shifted to 3 V. Several circuits was newly developed to achieve low power consumption and high speed operation at both 5 V and 3 V process using the same data base. The device also features over 50 MIPS of processing power with low power consumption and 100 nA stand-by current at either 3 V or 5 V. One remarkable advantage is a flexible CPU core approach for the future spin-off devices with different ROM/RAM configurations and peripheral modules without requiring any CPU design changes. This paper describes the architecture of a lower power and high speed design with effective hardware and software modules implementations.
PEAS-I： A Hardware/Software Codesign System for ASIP Development
Jun SATO Alauddin Y. ALOMARY Yoshimichi HONMA Takeharu NAKATA Akichika SHIOMI Nobuyuki HIKICHI Masaharu IMAI

PAPER-Computer Aided Design (CAD)

Vol:
E77-A No:3
Page(s):
483-491
This paper describes the current implementation and experimental results of a hardware/software codesign system for ASIP (Application Specific Integrated Processor) development： the PEAS-I System. The PEAS-I system accepts a set of application programs written in C language, associated data set, module database, and design constraints such as chip area and power consumption. The system then generates an optimized CPU core design in the form of an HDL as well as a set of application program development tools such as a C compiler, an assembler and a simulator. Another important feature of the PEAS-I system is that the system is able to give accurate estimations of chip area and performance before the detailed design of the ASIP is completed. According to the experimental results, the PEAS-I system has been found to be highly effective and efficient for ASIP development.
A Design of Static Operatable Low-Power 16-bit Microprocessor
Hiroaki KANEKO Takashi MIYAZAKI Hideki SUGIMOTO

PAPER-Low-Voltage Operation

Vol:
E75-C No:10
Page(s):
1188-1195
This paper describes a 16-bit microprocessor using circuit and process technology that realize static operation considering low-power consumption. The microprocessor so called V30HL achieved 4 times of performance per a unit power consumption as well as kept a complete software/hardware compatibility with standard 16-bit microprocessors. Also, the microprocessor operates in the range of DC-8 MHz for 2.7-5.5 V supply.

Keyword Search Result

[Keyword] CPU(15hit)

Real-Time Image Processing Based on Service Function Chaining Using CPU-FPGA Architecture

Reducing CPU Power Consumption with Device Utilization-Aware DVFS for Low-Latency SSDs

Real-Time and Energy-Efficient Face Detection on CPU-GPU Heterogeneous Embedded Platforms

A GPU-Based Rasterization Algorithm for Boolean Operations on Polygons

Optimizing Hash Join with MapReduce on Multi-Core CPUs

Blocked United Algorithm for the All-Pairs Shortest Paths Problem on Hybrid CPU-GPU Systems

CPU Model-Based Mechatronics/Hardware/Software Co-design Technology for Real-Time Embedded Control Systems

CPU Load Predictions on the Computational Grid

A Feed-Forward Dynamic Voltage Control Algorithm for Low Power MPEG4 on Multi-Regulated Voltage CPU

A 100 MIPS High Speed and Low Power Digital Signal Processor

A Rate Regulating Scheme for Scheduling Multimedia Tasks

Embedded System Cost Optimization via Data Path Width Adjustment

A Circuit Library for Low Power and High Speed Digital Signal Processor

PEAS-I： A Hardware/Software Codesign System for ASIP Development

A Design of Static Operatable Low-Power 16-bit Microprocessor

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles