The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] High performance(30hit)

1-20hit(30hit)

  • Hybrid Electrical/Optical Switch Architectures for Training Distributed Deep Learning in Large-Scale

    Thao-Nguyen TRUONG  Ryousei TAKANO  

     
    PAPER-Information Network

      Pubricized:
    2021/04/23
      Vol:
    E104-D No:8
      Page(s):
    1332-1339

    Data parallelism is the dominant method used to train deep learning (DL) models on High-Performance Computing systems such as large-scale GPU clusters. When training a DL model on a large number of nodes, inter-node communication becomes bottle-neck due to its relatively higher latency and lower link bandwidth (than intra-node communication). Although some communication techniques have been proposed to cope with this problem, all of these approaches target to deal with the large message size issue while diminishing the effect of the limitation of the inter-node network. In this study, we investigate the benefit of increasing inter-node link bandwidth by using hybrid switching systems, i.e., Electrical Packet Switching and Optical Circuit Switching. We found that the typical data-transfer of synchronous data-parallelism training is long-lived and rarely changed that can be speed-up with optical switching. Simulation results on the Simgrid simulator show that our approach speed-up the training time of deep learning applications, especially in a large-scale manner.

  • A Generalized Theory Based on the Turn Model for Deadlock-Free Irregular Networks

    Ryuta KAWANO  Ryota YASUDO  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Computer System

      Pubricized:
    2019/10/08
      Vol:
    E103-D No:1
      Page(s):
    101-110

    Recently proposed irregular networks can reduce the latency for both on-chip and off-chip systems with a large number of computing nodes and thus can improve the performance of parallel applications. However, these networks usually suffer from deadlocks in routing packets when using a naive minimal path routing algorithm. To solve this problem, we focus attention on a lately proposed theory that generalizes the turn model to maintain the network performance with deadlock-freedom. The theorems remain a challenge of applying themselves to arbitrary topologies including fully irregular networks. In this paper, we advance the theorems to completely general ones. Moreover, we provide a feasible implementation of a deadlock-free routing method based on our advanced theorem. Experimental results show that the routing method based on our proposed theorem can improve the network throughput by up to 138 % compared to a conventional deterministic minimal routing method. Moreover, when utilized as the escape path in Duato's protocol, it can improve the throughput by up to 26.3 % compared with the conventional up*/down* routing.

  • Improving Per-Node Computing Efficiency by an Adaptive Lock-Free Scheduling Model

    Zhishuo ZHENG  Deyu QI  Naqin ZHOU  Xinyang WANG  Mincong YU  

     
    PAPER-Fundamentals of Information Systems

      Pubricized:
    2018/07/06
      Vol:
    E101-D No:10
      Page(s):
    2423-2435

    Job scheduling on many-core computers with tens or even hundreds of processing cores is one of the key technologies in High Performance Computing (HPC) systems. Despite many scheduling algorithms have been proposed, scheduling remains a challenge for executing highly effective jobs that are assigned in a single computing node with diverse scheduling objectives. On the other hand, the increasing scale and the need for rapid response to changing requirements are hard to meet with existing scheduling models in an HPC node. To address these issues, we propose a novel adaptive scheduling model that is applied to a single node with a many-core processor; this model solves the problems of scheduling efficiency and scalability through an adaptive optimistic control mechanism. This mechanism exposes information such that all the cores are provided with jobs and the tools necessary to take advantage of that information and thus compete for resources in an uncoordinated manner. At the same time, the mechanism is equipped with adaptive control, allowing it to adjust the number of running tools dynamically when frequent conflict happens. We justify this scheduling model and present the simulation results for synthetic and real-world HPC workloads, in which we compare our proposed model with two widely used scheduling models, i.e. multi-path monolithic and two-level scheduling. The proposed approach outperforms the other models in scheduling efficiency and scalability. Our results demonstrate that the adaptive optimistic control affords significant improvements for HPC workloads in the parallelism of the node-level scheduling model and performance.

  • G-HBase: A High Performance Geographical Database Based on HBase

    Hong Van LE  Atsuhiro TAKASU  

     
    PAPER

      Pubricized:
    2018/01/18
      Vol:
    E101-D No:4
      Page(s):
    1053-1065

    With the recent explosion of geographic data generated by smartphones, sensors, and satellites, a data storage that can handle the massive volume of data and support high-computational spatial queries is becoming essential. Although key-value stores efficiently handle large-scale data, they are not equipped with effective functions for supporting geographic data. To solve this problem, in this paper, we present G-HBase, a high-performance geographical database based on HBase, a standard key-value store. To index geographic data, we first use Geohash as the rowkey in HBase. Then, we present a novel partitioning method, namely binary Geohash rectangle partitioning, to support spatial queries. Our extensive experiments on real datasets have demonstrated an improved performance with k nearest neighbors and range query in G-HBase when compared with SpatialHadoop, a state-of-the-art framework with native support for spatial data. We also observed that performance of spatial join in G-HBase is on par with SpatialHadoop and outperforms SJMR algorithm in HBase.

  • A Layout-Oriented Routing Method for Low-Latency HPC Networks

    Ryuta KAWANO  Hiroshi NAKAHARA  Ikki FUJIWARA  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Interconnection networks

      Pubricized:
    2017/07/14
      Vol:
    E100-D No:12
      Page(s):
    2796-2807

    End-to-end network latency has become an important issue for parallel application on large-scale high performance computing (HPC) systems. It has been reported that randomly-connected inter-switch networks can lower the end-to-end network latency. This latency reduction is established in exchange for a large amount of routing information. That is, minimal routing on irregular networks is achieved by using routing tables for all destinations in the networks. In this work, a novel distributed routing method called LOREN (Layout-Oriented Routing with Entries for Neighbors) to achieve low-latency with a small routing table is proposed for irregular networks whose link length is limited. The routing tables contain both physically and topologically nearby neighbor nodes to ensure livelock-freedom and a small number of hops between nodes. Experimental results show that LOREN reduces the average latencies by 5.8% and improves the network throughput by up to 62% compared with a conventional compact routing method. Moreover, the number of required routing table entries is reduced by up to 91%, which improves scalability and flexibility for implementation.

  • A Novel Channel Assignment Method to Ensure Deadlock-Freedom for Deterministic Routing

    Ryuta KAWANO  Hiroshi NAKAHARA  Seiichi TADE  Ikki FUJIWARA  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Computer System

      Pubricized:
    2017/05/19
      Vol:
    E100-D No:8
      Page(s):
    1798-1806

    Inter-switch networks for HPC systems and data-centers can be improved by applying random shortcut topologies with a reduced number of hops. With minimal routing in such networks; however, deadlock-freedom is not guaranteed. Multiple Virtual Channels (VCs) are efficiently used to avoid this problem. However, previous works do not provide good trade-offs between the number of required VCs and the time and memory complexities of an algorithm. In this work, a novel and fast algorithm, named ACRO, is proposed to endorse the arbitrary routing functions with deadlock-freedom, as well as consuming a small number of VCs. A heuristic approach to reduce VCs is achieved with a hash table, which improves the scalability of the algorithm compared with our previous work. Moreover, experimental results show that ACRO can reduce the average number of VCs by up to 63% when compared with a conventional algorithm that has the same time complexity. Furthermore, ACRO reduces the time complexity by a factor of O(|N|⋅log|N|), when compared with another conventional algorithm that requires almost the same number of VCs.

  • EDISON Science Gateway: A Cyber-Environment for Domain-Neutral Scientific Computing

    Hoon RYU  Jung-Lok YU  Duseok JIN  Jun-Hyung LEE  Dukyun NAM  Jongsuk LEE  Kumwon CHO  Hee-Jung BYUN  Okhwan BYEON  

     
    PAPER-Scientific Application

      Vol:
    E97-D No:8
      Page(s):
    1953-1964

    We discuss a new high performance computing service (HPCS) platform that has been developed to provide domain-neutral computing service under the governmental support from “EDucation-research Integration through Simulation On the Net” (EDISON) project. With a first focus on technical features, we not only present in-depth explanations of the implementation details, but also describe the strengths of the EDISON platform against the successful nanoHUB.org gateway. To validate the performance and utility of the platform, we provide benchmarking results for the resource virtualization framework, and prove the stability and promptness of the EDISON platform in processing simulation requests by analyzing several statistical datasets obtained from a three-month trial service in the initiative area of computational nanoelectronics. We firmly believe that this work provides a good opportunity for understanding the science gateway project ongoing for the first time in Republic of Korea, and that the technical details presented here can be served as an useful guideline for any potential designs of HPCS platforms.

  • Digital Calibration and Correction Methods for CMOS Analog-to-Digital Converters Open Access

    Shiro DOSHO  

     
    INVITED PAPER

      Vol:
    E95-C No:4
      Page(s):
    421-431

    Along with the miniaturization of CMOS-LSIs, control methods for LSIs have been extensively developed. The most predominant method is to digitize observed values as early as possible and to use digital control. Thus, many types of analog-to-digital converters (ADCs) have been developed such as temperature, time, delay, and frequency converters. ADCs are the easiest circuits into which digital correction methods can be introduced because their outputs are digital. Various types of calibration method have been developed, which has markedly improved the figure of merits by alleviating margins for device variations. The above calibration and correction methods not only overcome a circuit's weak points but also give us the chance to develop quite new circuit topologies and systems. In this paper, several digital calibration and correction methods for major analog-to-digital converters are described, such as pipelined ADCs, delta-sigma ADCs, and successive approximation ADCs.

  • An Energy-Efficient Full Adder Cell Using CNFET Technology

    Mohammad Reza RESHADINEZHAD  Mohammad Hossein MOAIYERI  Kaivan NAVI  

     
    PAPER-Electronic Circuits

      Vol:
    E95-C No:4
      Page(s):
    744-751

    The reduction in the gate length of the current devices to 65 nm causes their I-V characteristics to depart from the traditional MOSFETs. As a result, manufacturing of new efficient devices in nanoscale is inevitable. The fundamental properties of the metallic and semi-conducting carbon Nanotubes (CNTs) make them alternatives to the conventional silicon-based devices. In this paper an ultra high-speed and energy-efficient full adder is proposed, using Carbon Nanotube Field Effect Transistor (CNFET) in nanoscale. Extensive simulation results using HSPICE are reported to show that the proposed adder consumes lower power, and is faster compared to the previous adders.

  • Several Types of Antennas Composed of Microwave Metamaterials Open Access

    Tie Jun CUI  Xiao-Yang ZHOU  Xin Mi YANG  Wei Xiang JIANG  Qiang CHENG  Hui Feng MA  

     
    INVITED PAPER

      Vol:
    E94-B No:5
      Page(s):
    1142-1152

    We present a review of several types of microwave antennas made of metamaterials, including the resonant electrically small antennas, metamaterial-substrate patch antennas, metamaterial flat-lens antennas, and Luneburg lens antennas. In particular, we propose a new type of conformal antennas using anisotropic zero-index metamaterials, which have high gains and low sidelobes. Numerical simulations and experimental results show that metamaterials have unique properties to design new antennas with high performance.

  • Heuristic Sizing Methodology for Designing High-Performance CMOS Level Converters with Balanced Rise and Fall Delays

    Jinn-Shyan WANG  Yu-Juey CHANG  Chingwei YEH  

     
    BRIEF PAPER-Electronic Circuits

      Vol:
    E93-C No:10
      Page(s):
    1540-1543

    CMOS SoCs can reduce power consumption by adopting voltage scaling (VS) technologies, where the level converter (LC) is required between voltage domains to avoid dc current. However, the LC often induces high delay penalty and usually results in non-balanced rise and fall delays. Therefore, the performance of the LC strongly affects the effectiveness of VS technologies. In this paper, heuristic sizing methodology for designing a state-of-the-art LC is developed and proposed. Using the proposed methodology, we can design the LC to achieve high performance with balanced rise and fall delay times in a deterministic way.

  • A High Performance and Low Bandwidth Multi-Standard Motion Compensation Design for HD Video Decoder

    Xianmin CHEN  Peilin LIU  Dajiang ZHOU  Jiayi ZHU  Xingguang PAN  Satoshi GOTO  

     
    PAPER

      Vol:
    E93-C No:3
      Page(s):
    253-260

    Motion compensation is widely used in many video coding standards. Due to its bandwidth requirement and complexity, motion compensation is one of the most challenging parts in the design of high definition video decoder. In this paper, we propose a high performance and low bandwidth motion compensation design, which supports H.264/AVC, MPEG-1/2 and Chinese AVS standards. We introduce a 2-Dimensional cache that can greatly reduce the external bandwidth requirement. Similarities among the 3 standards are also explored to reduce hardware cost. We also propose a block-pipelining strategy to conceal the long latency of external memory access. Experimental results show that our motion compensation design can reduce the bandwidth by 74% in average and it can real-time decode 1920x1088@30 fps video stream at 80 MHz.

  • A Conditional Isolation Technique for Low-Energy and High-Performance Wide Domino Gates

    How-Rern LIN  Wei-Hao CHIU  Tsung-Yi WU  

     
    PAPER

      Vol:
    E92-C No:4
      Page(s):
    386-390

    A new conditional isolation technique (CI-Domino) in domino logic is proposed for wide domino gates. This technique can not only reduce the subthreshold and gate oxide leakage currents simultaneously without sacrificing circuit performance, but also it can be utilized to speed up the evaluation time of domino gate. Simulations on high fan-in domino OR gates with 0.18 µm process technology show that the proposed technique achieves reduction on total static power by 36%, dynamic power by 49.14%, and delay time by 60.27% compared to the conventional domino gate. Meanwhile, the proposed technique also gains about 48.14% improvement on leakage tolerance.

  • Robust Noise Suppression Algorithm with the Kalman Filter Theory for White and Colored Disturbance

    Nari TANABE  Toshihiro FURUKAWA  Shigeo TSUJII  

     
    PAPER-Digital Signal Processing

      Vol:
    E91-A No:3
      Page(s):
    818-829

    We propose a noise suppression algorithm with the Kalman filter theory. The algorithm aims to achieve robust noise suppression for the additive white and colored disturbance from the canonical state space models with (i) a state equation composed of the speech signal and (ii) an observation equation composed of the speech signal and additive noise. The remarkable features of the proposed algorithm are (1) applied to adaptive white and colored noises where the additive colored noise uses babble noise, (2) realization of high performance noise suppression without sacrificing high quality of the speech signal despite simple noise suppression using only the Kalman filter algorithm, while many conventional methods based on the Kalman filter theory usually perform the noise suppression using the parameter estimation algorithm of AR (auto-regressive) system and the Kalman filter algorithm. We show the effectiveness of the proposed method, which utilizes the Kalman filter theory for the proposed canonical state space model with the colored driving source, using numerical results and subjective evaluation results.

  • Cache Efficient Radix Sort for String Sorting

    Waihong NG  Katsuhiko KAKEHI  

     
    PAPER-Algorithms and Data Structures

      Vol:
    E90-A No:2
      Page(s):
    457-466

    In this paper, we propose CRadix sort, a new string sorting algorithm based on MSD radix sort. CRadix sort causes fewer cache misses than MSD radix sort by uniquely associating a small block of main memory called the key buffer to each key and temporarily storing a portion of each key into its corresponding key buffer. Experimental results in running time comparisons with other string sorting algorithms are provided for showing the effectiveness of CRadix sort.

  • Algorithmic Concept Recognition to Support High Performance Code Reengineering

    Beniamino DI MARTINO  

     
    PAPER-Software Support and Optimization Techniques

      Vol:
    E87-D No:7
      Page(s):
    1743-1750

    Techniques for automatic program recognition, at the algorithmic level, could be of high interest for the area of Software Maintenance, in particular for knowledge based reengineering, because the selection of suitable restructuring strategies is mainly driven by algorithmic features of the code. In this paper an automated hierarchical concept parsing recognition technique, and a formalism for the specification of algorithmic concepts, is presented. Based on this technique, the design and development of ALCOR, a production rule based system for automatic recognition of algorithmic concepts within programs, aimed at support of knowledge based reengineering for high performance, is presented.

  • A Novel Timing-Driven Global Routing Algorithm Considering Coupling Effects for High Performance Circuit Design

    Jingyu XU  Xianlong HONG  Tong JING  Yici CAI  Jun GU  

     
    PAPER-Place and Routing

      Vol:
    E86-A No:12
      Page(s):
    3158-3167

    As the CMOS technology enters the very deep submicron era, inter-wire coupling capacitance becomes the dominant part of load capacitance. The coupling effects have brought new challenges to routing algorithms on both delay estimation and optimization. In this paper, we propose a timing-driven global routing algorithm with consideration of coupling effects. Our two-phase algorithm based on timing-relax method includes a heuristic Steiner tree algorithm to guarantee the timing performance of the initial solution and an optimization algorithm based on coupling-effect-transference. Experimental results are given to demonstrate the efficiency and accuracy of the algorithm.

  • The Development of the Earth Simulator

    Shinichi HABATA  Mitsuo YOKOKAWA  Shigemune KITAWAKI  

     
    INVITED PAPER

      Vol:
    E86-D No:10
      Page(s):
    1947-1954

    The Earth Simulator (ES), developed by the Japanese government's initiative "Earth Simulator project," is a highly parallel vector supercomputer system. In May 2002, the ES was proven to be the most powerful computer in the world by achieving 35.86 teraflops on the LINPACK benchmark and 26.58 teraflops for a global atmospheric circulation model with the spectral method. Three architectural features enabled these great achievements; vector processor, shared-memory and high-bandwidth non-blocking interconnection crossbar network. In this paper, an overview of the ES, the three architectural features and the result of performance evaluation are described particularly with its hardware realization of the interconnection among 640 processor nodes.

  • Selective Multi-Threshold Technique for High-Performance and Low-Standby Applications

    Kimiyoshi USAMI  Naoyuki KAWABE  Masayuki KOIZUMI  Katsuhiro SETA  Toshiyuki FURUSAWA  

     
    PAPER-Optimization of Power and Timing

      Vol:
    E85-A No:12
      Page(s):
    2667-2673

    In portable applications such as W-CDMA cell phones, high performance and low standby leakage are both required. We propose an automated design technique to selectively use multi-threshold CMOS (MTCMOS) in a cell-by-cell fashion. MT cells consisting of low-Vth transistors and high-Vth sleep transistors are newly introduced. MT cells are assigned to critical paths to speed up, while High-Vth cells are assigned to non-critical paths to reduce leakage. Compared to the conventional MTCMOS, the gate delay is not affected by the discharge patterns of other gates because there is no virtual ground to be shared. We applied this technique to a test chip of a DSP core for W-CDMA baseband LSI. The worst path-delay was improved by 14% over the single high-Vth design without increasing standby leakage at 10% area overhead.

  • Trends in High-Performance, Low-Power Cache Memory Architectures

    Koji INOUE  Vasily G. MOSHNYAGA  Kazuaki MURAKAMI  

     
    PAPER-High-Performance Technologies

      Vol:
    E85-C No:2
      Page(s):
    304-314

    One of uncompromising requirements from portable computing is energy efficiency, because that affects directly the battery life. On the other hand, portable computing will target more demanding applications, for example moving pictures, so that higher performance is still required. Cache memories have been employed as one of the most important components of computer systems. In this paper, we briefly survey architectural techniques for high performance, low power cache memories.

1-20hit(30hit)