The search functionality is under construction.

Author Search Result

[Author] Masaaki KONDO(12hit)

1-12hit
  • Efficient and Precise Profiling, Modeling and Management on Power and Performance for Power Constrained HPC Systems

    Yuan HE  Yasutaka WADA  Wenchao LUO  Ryuichi SAKAMOTO  Guanqin PAN  Thang CAO  Masaaki KONDO  

     
    PAPER

      Pubricized:
    2020/12/01
      Vol:
    E104-C No:6
      Page(s):
    237-246

    Due to the slowdown of Moore's Law, power limitation has been one of the most critical issues for current and future HPC systems. To more efficiently utilize HPC systems when power budgets or deadlines are given, it is very desirable to accurately estimate the performance or power consumption of applications before conducting their tuned production runs on any specific systems. In order to ease such estimations, we showcase a straight-forward and yet effective method, based on the enhanced power management framework and DSL we developed, to help HPC users to clarify the performance and power relationships of their applications. This method demonstrates an easy process of profiling, modeling and management on both performance and power of HPC systems and applications. In our evaluations, only a few (up to 3) profiled runs are necessary before very precise models of HPC applications can be obtained through this method (and algorithm), which has dramatically improved the efficiency of and lowered the difficulty in utilizing HPC systems under limited power budgets.

  • A Fine-Grained Power Gating Control on Linux Monitoring Power Consumption of Processor Functional Units

    Atsushi KOSHIBA  Motoki WADA  Ryuichi SAKAMOTO  Mikiko SATO  Tsubasa KOSAKA  Kimiyoshi USAMI  Hideharu AMANO  Masaaki KONDO  Hiroshi NAKAMURA  Mitaro NAMIKI  

     
    PAPER

      Vol:
    E98-C No:7
      Page(s):
    559-568

    The authors have been researching on reducing the power consumption of microprocessors, and developed a low-power processor called “Geyser” by applying power gating (PG) function to the individual functional units of the processor. PG function on Geyser reduces the power consumption of functional units by shutting off the power voltage of idle units. However, the energy overhead of switching the supply voltage for units on and off causes power increases. The amount of the energy overhead varies with the behavior of each functional unit which is influenced by running application, and also with the core temperature. It is therefore necessary to switch the PG function itself on or off according to the state of the processor at runtime to reduce power consumption more effectively. In this paper, the authors propose a PG control method to take the power overhead into account by the operating system (OS). In the proposed method, for achieving much power reduction, the OS calculates the power consumption of each functional unit periodically and inhibits the PG function of the unit whose energy overhead is judged too high. The method was implemented in the Linux process scheduler and evaluated. The results show that the average power consumption of the functional units is reduced by up to 17.2%.

  • A Prototype System for Many-Core Architecture SMYLEref with FPGA Evaluation Boards

    Son-Truong NGUYEN  Masaaki KONDO  Tomoya HIRAO  Koji INOUE  

     
    PAPER-Architecture

      Vol:
    E96-D No:8
      Page(s):
    1645-1653

    Nowadays, the trend of developing micro-processor with hundreds of cores brings a promising prospect for embedded systems. Realizing a high performance and low power many-core processor is becoming a primary technical challenge. Generally, three major issues required to be resolved includes: 1) realizing efficient massively parallel processing, 2) reducing dynamic power consumption, and 3) improving software productivity. To deal with these issues, we propose a solution to use many low-performance but small and very low-power cores to obtain very high performance, and develop a referential many-core architecture and a program development environment. This paper introduces a many-core architecture named SMYLEref and its prototype system with off-the-shelf FPGA evaluation boards. The initial evaluation results of several SPLASH2 benchmark programs conducted on our developed 128-core platform are also presented and discussed in this paper.

  • Fine-Grained Run-Tume Power Gating through Co-optimization of Circuit, Architecture, and System Software Design Open Access

    Hiroshi NAKAMURA  Weihan WANG  Yuya OHTA  Kimiyoshi USAMI  Hideharu AMANO  Masaaki KONDO  Mitaro NAMIKI  

     
    INVITED PAPER

      Vol:
    E96-C No:4
      Page(s):
    404-412

    Power consumption has recently emerged as a first class design constraint in system LSI designs. Specially, leakage power has occupied a large part of the total power consumption. Therefore, reduction of leakage power is indispensable for efficient design of high-performance system LSIs. Since 2006, we have carried out a research project called “Innovative Power Control for Ultra Low-Power and High-Performance System LSIs”, supported by Japan Science and Technology Agency as a CREST research program. One of the major objectives of this project is reducing the leakage power consumption of system LSIs by innovative power control through tight cooperation and co-optimization of circuit technology, architecture, and system software designs. In this project, we focused on power gating as a circuit technique for reducing leakage power. Temporal granularity is one of the most important issue in power gating. Thus, we have developed a series of Geysers as proof-of-concept CPUs which provide several mechanisms of fine-grained run-time power gating. In this paper, we describe their concept and design, and explain why co-optimization of different design layers are important. Then, three kinds of power gating implementations and their evaluation are presented from the view point of power saving and temporal granularity.

  • An Interactive and Reductive Graph Processing Library for Edge Computing in Smart Society

    Jun ZHOU  Masaaki KONDO  

     
    PAPER

      Pubricized:
    2022/11/07
      Vol:
    E106-D No:3
      Page(s):
    319-327

    Due to the limitations of cloud computing on latency, bandwidth and data confidentiality, edge computing has emerged as a novel location-aware paradigm to provide them with more processing capacity to improve the computing performance and quality of service (QoS) in several typical domains of human activity in smart society, such as social networks, medical diagnosis, telecommunications, recommendation systems, internal threat detection, transports, Internet of Things (IoT), etc. These application domains often handle a vast collection of entities with various relationships, which can be naturally represented by the graph data structure. Graph processing is a powerful tool to model and optimize complex problems in which the graph-based data is involved. In view of the relatively insufficient resource provisioning of the portable terminals, in this paper, for the first time to our knowledge, we propose an interactive and reductive graph processing library (GPL) for edge computing in smart society at low overhead. Experimental evaluation is conducted to indicate that the proposed GPL is more user-friendly and highly competitive compared with other established systems, such as igraph, NetworKit and NetworkX, based on different graph datasets over a variety of popular algorithms.

  • Evaluation of Performance and Power Consumption on Supercomputer Fugaku Using SPEC HPC Benchmarks

    Yuetsu KODAMA  Masaaki KONDO  Mitsuhisa SATO  

     
    PAPER

      Pubricized:
    2022/12/12
      Vol:
    E106-C No:6
      Page(s):
    303-311

    The supercomputer, “Fugaku”, which ranked number one in multiple supercomputing lists, including the Top500 in June 2020, has various power control features, such as (1) an eco mode that utilizes only one of two floating-point pipelines while decreasing the power supply to the chip; (2) a boost mode that increases clock frequency; and (3) a core retention feature that turns unused cores to the low-power state. By orchestrating these power-performance features while considering the characteristics of running applications, we can potentially gain even better system-level energy efficiency. In this paper, we report on the performance and power consumption of Fugaku using SPEC HPC benchmarks. Consequently, we confirmed that it is possible to reduce the energy by about 17% while improving the performance by about 2% from the normal mode by combining boost mode and eco mode.

  • Adaptive Lossy Data Compression Extended Architecture for Memory Bandwidth Conservation in SpMV

    Siyi HU  Makiko ITO  Takahide YOSHIKAWA  Yuan HE  Hiroshi NAKAMURA  Masaaki KONDO  

     
    PAPER

      Pubricized:
    2023/07/20
      Vol:
    E106-D No:12
      Page(s):
    2015-2025

    Widely adopted by machine learning and graph processing applications nowadays, sparse matrix-Vector multiplication (SpMV) is a very popular algorithm in linear algebra. This is especially the case for fully-connected MLP layers, which dominate many SpMV computations and play a substantial role in diverse services. As a consequence, a large fraction of data center cycles is spent on SpMV kernels. Meanwhile, despite having efficient storage options against sparsity (such as CSR or CSC), SpMV kernels still suffer from the problem of limited memory bandwidth during data transferring because of the memory hierarchy of modern computing systems. In more detail, we find that both integer and floating-point data used in SpMV kernels are handled plainly without any necessary pre-processing. Therefore, we believe bandwidth conservation techniques, such as data compression, may dramatically help SpMV kernels when data is transferred between the main memory and the Last Level Cache (LLC). Furthermore, we also observe that convergence conditions in some typical scientific computation benchmarks (based on SpMV kernels) will not be degraded when adopting lower precision floating-point data. Based on these findings, in this work, we propose a simple yet effective data compression scheme that can be extended to general purpose computing architectures or HPC systems preferably. When it is adopted, a best-case speedup of 1.92x is made. Besides, evaluations with both the CG kernel and the PageRank algorithm indicate that our proposal introduces negligible overhead on both the convergence speed and the accuracy of final results.

  • An Operating System Guided Fine-Grained Power Gating Control Based on Runtime Characteristics of Applications

    Atsushi KOSHIBA  Mikiko SATO  Kimiyoshi USAMI  Hideharu AMANO  Ryuichi SAKAMOTO  Masaaki KONDO  Hiroshi NAKAMURA  Mitaro NAMIKI  

     
    PAPER

      Vol:
    E99-C No:8
      Page(s):
    926-935

    Fine-grained power gating (FGPG) is a power-saving technique by switching off circuit blocks while the blocks are idle. Although FGPG can reduce power consumption without compromising computational performance, switching the power supply on and off causes energy overhead. To prevent power increase caused by the energy overhead, in our prior research we proposed an FGPG control method of the operating system(OS) based on pre-analyzing applications' power usage. However, modern computing systems have a wide variety of use cases and run many types of application; this makes it difficult to analyze the behavior of all these applications in advance. This paper therefore proposes a new FGPG control method without profiling application programs in advance. In the new proposed method, the OS monitors a circuit's idle interval periodically while application programs are running. The OS enables FGPG only if the interval time is long enough to reduce the power consumption. The experimental results in this paper show that the proposed method reduces power consumption by 9.8% on average and up to 17.2% at 25°C. The results also show that the proposed method achieves almost the same power-saving efficiency as the previous profile-based method.

  • Reducing Memory System Energy by Software-Controlled On-Chip Memory

    Masaaki KONDO  Hiroshi NAKAMURA  

     
    PAPER-Architecture and Algorithms

      Vol:
    E86-C No:4
      Page(s):
    580-588

    In recent computer systems, a large portion of energy is consumed by on-chip cache accesses and data movement between cache and off-chip main memory. Reducing these memory system energy is indispensable for future microprocessors because power and thermal issues certainly become a key factor of limiting processor performance. In this paper, we discuss and evaluate how our architecture called SCIMA contributes to energy saving. SCIMA integrates software-controllable memory (SCM) into processor chip. SCIMA can save total memory system energy by using SCM under the support of compiler. The evaluation results reveal that SCIMA can reduce 5-50% of memory system energy and still faster than conventional cache based architecture.

  • Design Method of High Performance and Low Power Functional Units Considering Delay Variations

    Kouichi WATANABE  Masashi IMAI  Masaaki KONDO  Hiroshi NAKAMURA  Takashi NANYA  

     
    PAPER-Circuit Synthesis

      Vol:
    E89-A No:12
      Page(s):
    3519-3528

    As VLSI technology advances, delay variations will become more serious. Delay-insensitive asynchronous dual-rail circuits tolerate any delay variation, but their energy consumption is more than double that of the single-rail circuits because signal transitions occur every cycle in all bits regardless of the input bit pattern. However, in functional units, a significant number of input bits may not change from the previous input in many cases. In such a situation, calculation of these bits is not required. Thus, we propose a method, called unflip-bits control, makes use of the above situation, to reduce energy consumption. We evaluate the energy consumption and performance penalty for the method using HSPICE and the verilog-XL simulator, and compare the method with the conventional dual-rail circuit and a synchronous circuit. Our evaluation results reveal that the proposed asynchronous dual-rail circuit has a 12-60% lower energy consumption compared with a conventional asynchronous dual-rail circuit.

  • A Runtime Optimization Selection Framework to Realize Energy Efficient Networks-on-Chip

    Yuan HE  Masaaki KONDO  Takashi NAKADA  Hiroshi SASAKI  Shinobu MIWA  Hiroshi NAKAMURA  

     
    PAPER-Architecture

      Pubricized:
    2016/08/24
      Vol:
    E99-D No:12
      Page(s):
    2881-2890

    Networks-on-Chip (or NoCs, for short) play important roles in modern and future multi-core processors as they are highly related to both performance and power consumption of the entire chip. Up to date, many optimization techniques have been developed to improve NoC's bandwidth, latency and power consumption. But a clear answer to how energy efficiency is affected with these optimization techniques is yet to be found since each of these optimization techniques comes with its own benefits and overheads while there are also too many of them. Thus, here comes the problem of when and how such optimization techniques should be applied. In order to solve this problem, we build a runtime framework to throttle these optimization techniques based on concise performance and energy models. With the help of this framework, we can successfully establish adaptive selections over multiple optimization techniques to further improve performance or energy efficiency of the network at runtime.

  • Evaluation of Checkpointing Mechanism on SCore Cluster System

    Masaaki KONDO  Takuro HAYASHIDA  Masashi IMAI  Hiroshi NAKAMURA  Takashi NANYA  Atsushi HORI  

     
    PAPER-Dependable Software

      Vol:
    E86-D No:12
      Page(s):
    2553-2562

    Cluster systems are getting widely used because of good performance / cost ratio. However, their reliability has not been well discussed in practical environment so far. As the number of commodity components in a cluster system gets increased, it is indispensable to support reliability by system software. SCore cluster system software is a parallel programming environment for High Performance Computing (HPC). SCore provides checkpointing and rollback-recovery mechanism for high availability. In this paper, we analyze and evaluate the checkpointing and rollback-recovery mechanisms of SCore quantitively. The experimental results reveal that the required time for checkpointing scales very well in respect to the number of computing nodes. However, the required time is quite long due to the low effective network bandwidth. Based on the results, we modify SCore and successfully make checkpointing and recovery 1.8 2.8 times and 3.7 5.0 times faster respectively. This is very helpful for cluster systems to achieve high performance and high availability.