IEICE global.ieice.org Site

Keyword Search Result

[Keyword] High performance(30hit)

1-20hit(30hit)

Hybrid Electrical/Optical Switch Architectures for Training Distributed Deep Learning in Large-Scale
Thao-Nguyen TRUONG Ryousei TAKANO

PAPER-Information Network

Pubricized:
2021/04/23
Vol:
E104-D No:8
Page(s):
1332-1339
Data parallelism is the dominant method used to train deep learning (DL) models on High-Performance Computing systems such as large-scale GPU clusters. When training a DL model on a large number of nodes, inter-node communication becomes bottle-neck due to its relatively higher latency and lower link bandwidth (than intra-node communication). Although some communication techniques have been proposed to cope with this problem, all of these approaches target to deal with the large message size issue while diminishing the effect of the limitation of the inter-node network. In this study, we investigate the benefit of increasing inter-node link bandwidth by using hybrid switching systems, i.e., Electrical Packet Switching and Optical Circuit Switching. We found that the typical data-transfer of synchronous data-parallelism training is long-lived and rarely changed that can be speed-up with optical switching. Simulation results on the Simgrid simulator show that our approach speed-up the training time of deep learning applications, especially in a large-scale manner.
A Generalized Theory Based on the Turn Model for Deadlock-Free Irregular Networks
Ryuta KAWANO Ryota YASUDO Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO

PAPER-Computer System

Pubricized:
2019/10/08
Vol:
E103-D No:1
Page(s):
101-110
Recently proposed irregular networks can reduce the latency for both on-chip and off-chip systems with a large number of computing nodes and thus can improve the performance of parallel applications. However, these networks usually suffer from deadlocks in routing packets when using a naive minimal path routing algorithm. To solve this problem, we focus attention on a lately proposed theory that generalizes the turn model to maintain the network performance with deadlock-freedom. The theorems remain a challenge of applying themselves to arbitrary topologies including fully irregular networks. In this paper, we advance the theorems to completely general ones. Moreover, we provide a feasible implementation of a deadlock-free routing method based on our advanced theorem. Experimental results show that the routing method based on our proposed theorem can improve the network throughput by up to 138 % compared to a conventional deterministic minimal routing method. Moreover, when utilized as the escape path in Duato's protocol, it can improve the throughput by up to 26.3 % compared with the conventional up*/down* routing.
Improving Per-Node Computing Efficiency by an Adaptive Lock-Free Scheduling Model
Zhishuo ZHENG Deyu QI Naqin ZHOU Xinyang WANG Mincong YU

PAPER-Fundamentals of Information Systems

Pubricized:
2018/07/06
Vol:
E101-D No:10
Page(s):
2423-2435
Job scheduling on many-core computers with tens or even hundreds of processing cores is one of the key technologies in High Performance Computing (HPC) systems. Despite many scheduling algorithms have been proposed, scheduling remains a challenge for executing highly effective jobs that are assigned in a single computing node with diverse scheduling objectives. On the other hand, the increasing scale and the need for rapid response to changing requirements are hard to meet with existing scheduling models in an HPC node. To address these issues, we propose a novel adaptive scheduling model that is applied to a single node with a many-core processor; this model solves the problems of scheduling efficiency and scalability through an adaptive optimistic control mechanism. This mechanism exposes information such that all the cores are provided with jobs and the tools necessary to take advantage of that information and thus compete for resources in an uncoordinated manner. At the same time, the mechanism is equipped with adaptive control, allowing it to adjust the number of running tools dynamically when frequent conflict happens. We justify this scheduling model and present the simulation results for synthetic and real-world HPC workloads, in which we compare our proposed model with two widely used scheduling models, i.e. multi-path monolithic and two-level scheduling. The proposed approach outperforms the other models in scheduling efficiency and scalability. Our results demonstrate that the adaptive optimistic control affords significant improvements for HPC workloads in the parallelism of the node-level scheduling model and performance.
G-HBase: A High Performance Geographical Database Based on HBase
Hong Van LE Atsuhiro TAKASU

PAPER

Pubricized:
2018/01/18
Vol:
E101-D No:4
Page(s):
1053-1065
With the recent explosion of geographic data generated by smartphones, sensors, and satellites, a data storage that can handle the massive volume of data and support high-computational spatial queries is becoming essential. Although key-value stores efficiently handle large-scale data, they are not equipped with effective functions for supporting geographic data. To solve this problem, in this paper, we present G-HBase, a high-performance geographical database based on HBase, a standard key-value store. To index geographic data, we first use Geohash as the rowkey in HBase. Then, we present a novel partitioning method, namely binary Geohash rectangle partitioning, to support spatial queries. Our extensive experiments on real datasets have demonstrated an improved performance with k nearest neighbors and range query in G-HBase when compared with SpatialHadoop, a state-of-the-art framework with native support for spatial data. We also observed that performance of spatial join in G-HBase is on par with SpatialHadoop and outperforms SJMR algorithm in HBase.
A Layout-Oriented Routing Method for Low-Latency HPC Networks
Ryuta KAWANO Hiroshi NAKAHARA Ikki FUJIWARA Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO

PAPER-Interconnection networks

Pubricized:
2017/07/14
Vol:
E100-D No:12
Page(s):
2796-2807
End-to-end network latency has become an important issue for parallel application on large-scale high performance computing (HPC) systems. It has been reported that randomly-connected inter-switch networks can lower the end-to-end network latency. This latency reduction is established in exchange for a large amount of routing information. That is, minimal routing on irregular networks is achieved by using routing tables for all destinations in the networks. In this work, a novel distributed routing method called LOREN (Layout-Oriented Routing with Entries for Neighbors) to achieve low-latency with a small routing table is proposed for irregular networks whose link length is limited. The routing tables contain both physically and topologically nearby neighbor nodes to ensure livelock-freedom and a small number of hops between nodes. Experimental results show that LOREN reduces the average latencies by 5.8% and improves the network throughput by up to 62% compared with a conventional compact routing method. Moreover, the number of required routing table entries is reduced by up to 91%, which improves scalability and flexibility for implementation.
A Novel Channel Assignment Method to Ensure Deadlock-Freedom for Deterministic Routing
Ryuta KAWANO Hiroshi NAKAHARA Seiichi TADE Ikki FUJIWARA Hiroki MATSUTANI Michihiro KOIBUCHI Hideharu AMANO

PAPER-Computer System

Pubricized:
2017/05/19
Vol:
E100-D No:8
Page(s):
1798-1806
Inter-switch networks for HPC systems and data-centers can be improved by applying random shortcut topologies with a reduced number of hops. With minimal routing in such networks; however, deadlock-freedom is not guaranteed. Multiple Virtual Channels (VCs) are efficiently used to avoid this problem. However, previous works do not provide good trade-offs between the number of required VCs and the time and memory complexities of an algorithm. In this work, a novel and fast algorithm, named ACRO, is proposed to endorse the arbitrary routing functions with deadlock-freedom, as well as consuming a small number of VCs. A heuristic approach to reduce VCs is achieved with a hash table, which improves the scalability of the algorithm compared with our previous work. Moreover, experimental results show that ACRO can reduce the average number of VCs by up to 63% when compared with a conventional algorithm that has the same time complexity. Furthermore, ACRO reduces the time complexity by a factor of O(|N|⋅log|N|), when compared with another conventional algorithm that requires almost the same number of VCs.
EDISON Science Gateway: A Cyber-Environment for Domain-Neutral Scientific Computing
Hoon RYU Jung-Lok YU Duseok JIN Jun-Hyung LEE Dukyun NAM Jongsuk LEE Kumwon CHO Hee-Jung BYUN Okhwan BYEON

PAPER-Scientific Application

Vol:
E97-D No:8
Page(s):
1953-1964
We discuss a new high performance computing service (HPCS) platform that has been developed to provide domain-neutral computing service under the governmental support from “EDucation-research Integration through Simulation On the Net” (EDISON) project. With a first focus on technical features, we not only present in-depth explanations of the implementation details, but also describe the strengths of the EDISON platform against the successful nanoHUB.org gateway. To validate the performance and utility of the platform, we provide benchmarking results for the resource virtualization framework, and prove the stability and promptness of the EDISON platform in processing simulation requests by analyzing several statistical datasets obtained from a three-month trial service in the initiative area of computational nanoelectronics. We firmly believe that this work provides a good opportunity for understanding the science gateway project ongoing for the first time in Republic of Korea, and that the technical details presented here can be served as an useful guideline for any potential designs of HPCS platforms.
Digital Calibration and Correction Methods for CMOS Analog-to-Digital Converters Open Access
Shiro DOSHO

INVITED PAPER

Vol:
E95-C No:4
Page(s):
421-431
Along with the miniaturization of CMOS-LSIs, control methods for LSIs have been extensively developed. The most predominant method is to digitize observed values as early as possible and to use digital control. Thus, many types of analog-to-digital converters (ADCs) have been developed such as temperature, time, delay, and frequency converters. ADCs are the easiest circuits into which digital correction methods can be introduced because their outputs are digital. Various types of calibration method have been developed, which has markedly improved the figure of merits by alleviating margins for device variations. The above calibration and correction methods not only overcome a circuit's weak points but also give us the chance to develop quite new circuit topologies and systems. In this paper, several digital calibration and correction methods for major analog-to-digital converters are described, such as pipelined ADCs, delta-sigma ADCs, and successive approximation ADCs.
An Energy-Efficient Full Adder Cell Using CNFET Technology
Mohammad Reza RESHADINEZHAD Mohammad Hossein MOAIYERI Kaivan NAVI

PAPER-Electronic Circuits

Vol:
E95-C No:4
Page(s):
744-751
The reduction in the gate length of the current devices to 65 nm causes their I-V characteristics to depart from the traditional MOSFETs. As a result, manufacturing of new efficient devices in nanoscale is inevitable. The fundamental properties of the metallic and semi-conducting carbon Nanotubes (CNTs) make them alternatives to the conventional silicon-based devices. In this paper an ultra high-speed and energy-efficient full adder is proposed, using Carbon Nanotube Field Effect Transistor (CNFET) in nanoscale. Extensive simulation results using HSPICE are reported to show that the proposed adder consumes lower power, and is faster compared to the previous adders.
Several Types of Antennas Composed of Microwave Metamaterials Open Access
Tie Jun CUI Xiao-Yang ZHOU Xin Mi YANG Wei Xiang JIANG Qiang CHENG Hui Feng MA

INVITED PAPER

Vol:
E94-B No:5
Page(s):
1142-1152
We present a review of several types of microwave antennas made of metamaterials, including the resonant electrically small antennas, metamaterial-substrate patch antennas, metamaterial flat-lens antennas, and Luneburg lens antennas. In particular, we propose a new type of conformal antennas using anisotropic zero-index metamaterials, which have high gains and low sidelobes. Numerical simulations and experimental results show that metamaterials have unique properties to design new antennas with high performance.
Heuristic Sizing Methodology for Designing High-Performance CMOS Level Converters with Balanced Rise and Fall Delays
Jinn-Shyan WANG Yu-Juey CHANG Chingwei YEH

BRIEF PAPER-Electronic Circuits

Vol:
E93-C No:10
Page(s):
1540-1543
CMOS SoCs can reduce power consumption by adopting voltage scaling (VS) technologies, where the level converter (LC) is required between voltage domains to avoid dc current. However, the LC often induces high delay penalty and usually results in non-balanced rise and fall delays. Therefore, the performance of the LC strongly affects the effectiveness of VS technologies. In this paper, heuristic sizing methodology for designing a state-of-the-art LC is developed and proposed. Using the proposed methodology, we can design the LC to achieve high performance with balanced rise and fall delay times in a deterministic way.
A High Performance and Low Bandwidth Multi-Standard Motion Compensation Design for HD Video Decoder
Xianmin CHEN Peilin LIU Dajiang ZHOU Jiayi ZHU Xingguang PAN Satoshi GOTO

PAPER

Vol:
E93-C No:3
Page(s):
253-260
Motion compensation is widely used in many video coding standards. Due to its bandwidth requirement and complexity, motion compensation is one of the most challenging parts in the design of high definition video decoder. In this paper, we propose a high performance and low bandwidth motion compensation design, which supports H.264/AVC, MPEG-1/2 and Chinese AVS standards. We introduce a 2-Dimensional cache that can greatly reduce the external bandwidth requirement. Similarities among the 3 standards are also explored to reduce hardware cost. We also propose a block-pipelining strategy to conceal the long latency of external memory access. Experimental results show that our motion compensation design can reduce the bandwidth by 74% in average and it can real-time decode 1920x1088@30 fps video stream at 80 MHz.
A Conditional Isolation Technique for Low-Energy and High-Performance Wide Domino Gates
How-Rern LIN Wei-Hao CHIU Tsung-Yi WU

PAPER

Vol:
E92-C No:4
Page(s):
386-390
A new conditional isolation technique (CI-Domino) in domino logic is proposed for wide domino gates. This technique can not only reduce the subthreshold and gate oxide leakage currents simultaneously without sacrificing circuit performance, but also it can be utilized to speed up the evaluation time of domino gate. Simulations on high fan-in domino OR gates with 0.18 µm process technology show that the proposed technique achieves reduction on total static power by 36%, dynamic power by 49.14%, and delay time by 60.27% compared to the conventional domino gate. Meanwhile, the proposed technique also gains about 48.14% improvement on leakage tolerance.
Robust Noise Suppression Algorithm with the Kalman Filter Theory for White and Colored Disturbance
Nari TANABE Toshihiro FURUKAWA Shigeo TSUJII

PAPER-Digital Signal Processing

Vol:
E91-A No:3
Page(s):
818-829
We propose a noise suppression algorithm with the Kalman filter theory. The algorithm aims to achieve robust noise suppression for the additive white and colored disturbance from the canonical state space models with (i) a state equation composed of the speech signal and (ii) an observation equation composed of the speech signal and additive noise. The remarkable features of the proposed algorithm are (1) applied to adaptive white and colored noises where the additive colored noise uses babble noise, (2) realization of high performance noise suppression without sacrificing high quality of the speech signal despite simple noise suppression using only the Kalman filter algorithm, while many conventional methods based on the Kalman filter theory usually perform the noise suppression using the parameter estimation algorithm of AR (auto-regressive) system and the Kalman filter algorithm. We show the effectiveness of the proposed method, which utilizes the Kalman filter theory for the proposed canonical state space model with the colored driving source, using numerical results and subjective evaluation results.
Cache Efficient Radix Sort for String Sorting
Waihong NG Katsuhiko KAKEHI

PAPER-Algorithms and Data Structures

Vol:
E90-A No:2
Page(s):
457-466
In this paper, we propose CRadix sort, a new string sorting algorithm based on MSD radix sort. CRadix sort causes fewer cache misses than MSD radix sort by uniquely associating a small block of main memory called the key buffer to each key and temporarily storing a portion of each key into its corresponding key buffer. Experimental results in running time comparisons with other string sorting algorithms are provided for showing the effectiveness of CRadix sort.
Algorithmic Concept Recognition to Support High Performance Code Reengineering
Beniamino DI MARTINO

PAPER-Software Support and Optimization Techniques

Vol:
E87-D No:7
Page(s):
1743-1750
Techniques for automatic program recognition, at the algorithmic level, could be of high interest for the area of Software Maintenance, in particular for knowledge based reengineering, because the selection of suitable restructuring strategies is mainly driven by algorithmic features of the code. In this paper an automated hierarchical concept parsing recognition technique, and a formalism for the specification of algorithmic concepts, is presented. Based on this technique, the design and development of ALCOR, a production rule based system for automatic recognition of algorithmic concepts within programs, aimed at support of knowledge based reengineering for high performance, is presented.
A Novel Timing-Driven Global Routing Algorithm Considering Coupling Effects for High Performance Circuit Design
Jingyu XU Xianlong HONG Tong JING Yici CAI Jun GU

PAPER-Place and Routing

Vol:
E86-A No:12
Page(s):
3158-3167
As the CMOS technology enters the very deep submicron era, inter-wire coupling capacitance becomes the dominant part of load capacitance. The coupling effects have brought new challenges to routing algorithms on both delay estimation and optimization. In this paper, we propose a timing-driven global routing algorithm with consideration of coupling effects. Our two-phase algorithm based on timing-relax method includes a heuristic Steiner tree algorithm to guarantee the timing performance of the initial solution and an optimization algorithm based on coupling-effect-transference. Experimental results are given to demonstrate the efficiency and accuracy of the algorithm.
The Development of the Earth Simulator
Shinichi HABATA Mitsuo YOKOKAWA Shigemune KITAWAKI

INVITED PAPER

Vol:
E86-D No:10
Page(s):
1947-1954
The Earth Simulator (ES), developed by the Japanese government's initiative "Earth Simulator project," is a highly parallel vector supercomputer system. In May 2002, the ES was proven to be the most powerful computer in the world by achieving 35.86 teraflops on the LINPACK benchmark and 26.58 teraflops for a global atmospheric circulation model with the spectral method. Three architectural features enabled these great achievements; vector processor, shared-memory and high-bandwidth non-blocking interconnection crossbar network. In this paper, an overview of the ES, the three architectural features and the result of performance evaluation are described particularly with its hardware realization of the interconnection among 640 processor nodes.
Selective Multi-Threshold Technique for High-Performance and Low-Standby Applications
Kimiyoshi USAMI Naoyuki KAWABE Masayuki KOIZUMI Katsuhiro SETA Toshiyuki FURUSAWA

PAPER-Optimization of Power and Timing

Vol:
E85-A No:12
Page(s):
2667-2673
In portable applications such as W-CDMA cell phones, high performance and low standby leakage are both required. We propose an automated design technique to selectively use multi-threshold CMOS (MTCMOS) in a cell-by-cell fashion. MT cells consisting of low-Vth transistors and high-Vth sleep transistors are newly introduced. MT cells are assigned to critical paths to speed up, while High-Vth cells are assigned to non-critical paths to reduce leakage. Compared to the conventional MTCMOS, the gate delay is not affected by the discharge patterns of other gates because there is no virtual ground to be shared. We applied this technique to a test chip of a DSP core for W-CDMA baseband LSI. The worst path-delay was improved by 14% over the single high-Vth design without increasing standby leakage at 10% area overhead.
Trends in High-Performance, Low-Power Cache Memory Architectures
Koji INOUE Vasily G. MOSHNYAGA Kazuaki MURAKAMI

PAPER-High-Performance Technologies

Vol:
E85-C No:2
Page(s):
304-314
One of uncompromising requirements from portable computing is energy efficiency, because that affects directly the battery life. On the other hand, portable computing will target more demanding applications, for example moving pictures, so that higher performance is still required. Cache memories have been employed as one of the most important components of computer systems. In this paper, we briefly survey architectural techniques for high performance, low power cache memories.

1-20hit(30hit)

Keyword Search Result

[Keyword] High performance(30hit)

Hybrid Electrical/Optical Switch Architectures for Training Distributed Deep Learning in Large-Scale

A Generalized Theory Based on the Turn Model for Deadlock-Free Irregular Networks

Improving Per-Node Computing Efficiency by an Adaptive Lock-Free Scheduling Model

G-HBase: A High Performance Geographical Database Based on HBase

A Layout-Oriented Routing Method for Low-Latency HPC Networks

A Novel Channel Assignment Method to Ensure Deadlock-Freedom for Deterministic Routing

EDISON Science Gateway: A Cyber-Environment for Domain-Neutral Scientific Computing

Digital Calibration and Correction Methods for CMOS Analog-to-Digital Converters Open Access

An Energy-Efficient Full Adder Cell Using CNFET Technology

Several Types of Antennas Composed of Microwave Metamaterials Open Access

Heuristic Sizing Methodology for Designing High-Performance CMOS Level Converters with Balanced Rise and Fall Delays

A High Performance and Low Bandwidth Multi-Standard Motion Compensation Design for HD Video Decoder

A Conditional Isolation Technique for Low-Energy and High-Performance Wide Domino Gates

Robust Noise Suppression Algorithm with the Kalman Filter Theory for White and Colored Disturbance

Cache Efficient Radix Sort for String Sorting

Algorithmic Concept Recognition to Support High Performance Code Reengineering

A Novel Timing-Driven Global Routing Algorithm Considering Coupling Effects for High Performance Circuit Design

The Development of the Earth Simulator

Selective Multi-Threshold Technique for High-Performance and Low-Standby Applications

Trends in High-Performance, Low-Power Cache Memory Architectures

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles