1-12hit |
Yuan HE Yasutaka WADA Wenchao LUO Ryuichi SAKAMOTO Guanqin PAN Thang CAO Masaaki KONDO
Due to the slowdown of Moore's Law, power limitation has been one of the most critical issues for current and future HPC systems. To more efficiently utilize HPC systems when power budgets or deadlines are given, it is very desirable to accurately estimate the performance or power consumption of applications before conducting their tuned production runs on any specific systems. In order to ease such estimations, we showcase a straight-forward and yet effective method, based on the enhanced power management framework and DSL we developed, to help HPC users to clarify the performance and power relationships of their applications. This method demonstrates an easy process of profiling, modeling and management on both performance and power of HPC systems and applications. In our evaluations, only a few (up to 3) profiled runs are necessary before very precise models of HPC applications can be obtained through this method (and algorithm), which has dramatically improved the efficiency of and lowered the difficulty in utilizing HPC systems under limited power budgets.
Atsushi KOSHIBA Motoki WADA Ryuichi SAKAMOTO Mikiko SATO Tsubasa KOSAKA Kimiyoshi USAMI Hideharu AMANO Masaaki KONDO Hiroshi NAKAMURA Mitaro NAMIKI
The authors have been researching on reducing the power consumption of microprocessors, and developed a low-power processor called “Geyser” by applying power gating (PG) function to the individual functional units of the processor. PG function on Geyser reduces the power consumption of functional units by shutting off the power voltage of idle units. However, the energy overhead of switching the supply voltage for units on and off causes power increases. The amount of the energy overhead varies with the behavior of each functional unit which is influenced by running application, and also with the core temperature. It is therefore necessary to switch the PG function itself on or off according to the state of the processor at runtime to reduce power consumption more effectively. In this paper, the authors propose a PG control method to take the power overhead into account by the operating system (OS). In the proposed method, for achieving much power reduction, the OS calculates the power consumption of each functional unit periodically and inhibits the PG function of the unit whose energy overhead is judged too high. The method was implemented in the Linux process scheduler and evaluated. The results show that the average power consumption of the functional units is reduced by up to 17.2%.
Son-Truong NGUYEN Masaaki KONDO Tomoya HIRAO Koji INOUE
Nowadays, the trend of developing micro-processor with hundreds of cores brings a promising prospect for embedded systems. Realizing a high performance and low power many-core processor is becoming a primary technical challenge. Generally, three major issues required to be resolved includes: 1) realizing efficient massively parallel processing, 2) reducing dynamic power consumption, and 3) improving software productivity. To deal with these issues, we propose a solution to use many low-performance but small and very low-power cores to obtain very high performance, and develop a referential many-core architecture and a program development environment. This paper introduces a many-core architecture named SMYLEref and its prototype system with off-the-shelf FPGA evaluation boards. The initial evaluation results of several SPLASH2 benchmark programs conducted on our developed 128-core platform are also presented and discussed in this paper.
Hiroshi NAKAMURA Weihan WANG Yuya OHTA Kimiyoshi USAMI Hideharu AMANO Masaaki KONDO Mitaro NAMIKI
Power consumption has recently emerged as a first class design constraint in system LSI designs. Specially, leakage power has occupied a large part of the total power consumption. Therefore, reduction of leakage power is indispensable for efficient design of high-performance system LSIs. Since 2006, we have carried out a research project called “Innovative Power Control for Ultra Low-Power and High-Performance System LSIs”, supported by Japan Science and Technology Agency as a CREST research program. One of the major objectives of this project is reducing the leakage power consumption of system LSIs by innovative power control through tight cooperation and co-optimization of circuit technology, architecture, and system software designs. In this project, we focused on power gating as a circuit technique for reducing leakage power. Temporal granularity is one of the most important issue in power gating. Thus, we have developed a series of Geysers as proof-of-concept CPUs which provide several mechanisms of fine-grained run-time power gating. In this paper, we describe their concept and design, and explain why co-optimization of different design layers are important. Then, three kinds of power gating implementations and their evaluation are presented from the view point of power saving and temporal granularity.
Due to the limitations of cloud computing on latency, bandwidth and data confidentiality, edge computing has emerged as a novel location-aware paradigm to provide them with more processing capacity to improve the computing performance and quality of service (QoS) in several typical domains of human activity in smart society, such as social networks, medical diagnosis, telecommunications, recommendation systems, internal threat detection, transports, Internet of Things (IoT), etc. These application domains often handle a vast collection of entities with various relationships, which can be naturally represented by the graph data structure. Graph processing is a powerful tool to model and optimize complex problems in which the graph-based data is involved. In view of the relatively insufficient resource provisioning of the portable terminals, in this paper, for the first time to our knowledge, we propose an interactive and reductive graph processing library (GPL) for edge computing in smart society at low overhead. Experimental evaluation is conducted to indicate that the proposed GPL is more user-friendly and highly competitive compared with other established systems, such as igraph, NetworKit and NetworkX, based on different graph datasets over a variety of popular algorithms.
Yuetsu KODAMA Masaaki KONDO Mitsuhisa SATO
The supercomputer, “Fugaku”, which ranked number one in multiple supercomputing lists, including the Top500 in June 2020, has various power control features, such as (1) an eco mode that utilizes only one of two floating-point pipelines while decreasing the power supply to the chip; (2) a boost mode that increases clock frequency; and (3) a core retention feature that turns unused cores to the low-power state. By orchestrating these power-performance features while considering the characteristics of running applications, we can potentially gain even better system-level energy efficiency. In this paper, we report on the performance and power consumption of Fugaku using SPEC HPC benchmarks. Consequently, we confirmed that it is possible to reduce the energy by about 17% while improving the performance by about 2% from the normal mode by combining boost mode and eco mode.
Siyi HU Makiko ITO Takahide YOSHIKAWA Yuan HE Hiroshi NAKAMURA Masaaki KONDO
Widely adopted by machine learning and graph processing applications nowadays, sparse matrix-Vector multiplication (SpMV) is a very popular algorithm in linear algebra. This is especially the case for fully-connected MLP layers, which dominate many SpMV computations and play a substantial role in diverse services. As a consequence, a large fraction of data center cycles is spent on SpMV kernels. Meanwhile, despite having efficient storage options against sparsity (such as CSR or CSC), SpMV kernels still suffer from the problem of limited memory bandwidth during data transferring because of the memory hierarchy of modern computing systems. In more detail, we find that both integer and floating-point data used in SpMV kernels are handled plainly without any necessary pre-processing. Therefore, we believe bandwidth conservation techniques, such as data compression, may dramatically help SpMV kernels when data is transferred between the main memory and the Last Level Cache (LLC). Furthermore, we also observe that convergence conditions in some typical scientific computation benchmarks (based on SpMV kernels) will not be degraded when adopting lower precision floating-point data. Based on these findings, in this work, we propose a simple yet effective data compression scheme that can be extended to general purpose computing architectures or HPC systems preferably. When it is adopted, a best-case speedup of 1.92x is made. Besides, evaluations with both the CG kernel and the PageRank algorithm indicate that our proposal introduces negligible overhead on both the convergence speed and the accuracy of final results.
Atsushi KOSHIBA Mikiko SATO Kimiyoshi USAMI Hideharu AMANO Ryuichi SAKAMOTO Masaaki KONDO Hiroshi NAKAMURA Mitaro NAMIKI
Fine-grained power gating (FGPG) is a power-saving technique by switching off circuit blocks while the blocks are idle. Although FGPG can reduce power consumption without compromising computational performance, switching the power supply on and off causes energy overhead. To prevent power increase caused by the energy overhead, in our prior research we proposed an FGPG control method of the operating system(OS) based on pre-analyzing applications' power usage. However, modern computing systems have a wide variety of use cases and run many types of application; this makes it difficult to analyze the behavior of all these applications in advance. This paper therefore proposes a new FGPG control method without profiling application programs in advance. In the new proposed method, the OS monitors a circuit's idle interval periodically while application programs are running. The OS enables FGPG only if the interval time is long enough to reduce the power consumption. The experimental results in this paper show that the proposed method reduces power consumption by 9.8% on average and up to 17.2% at 25°C. The results also show that the proposed method achieves almost the same power-saving efficiency as the previous profile-based method.
Masaaki KONDO Hiroshi NAKAMURA
In recent computer systems, a large portion of energy is consumed by on-chip cache accesses and data movement between cache and off-chip main memory. Reducing these memory system energy is indispensable for future microprocessors because power and thermal issues certainly become a key factor of limiting processor performance. In this paper, we discuss and evaluate how our architecture called SCIMA contributes to energy saving. SCIMA integrates software-controllable memory (SCM) into processor chip. SCIMA can save total memory system energy by using SCM under the support of compiler. The evaluation results reveal that SCIMA can reduce 5-50% of memory system energy and still faster than conventional cache based architecture.
Kouichi WATANABE Masashi IMAI Masaaki KONDO Hiroshi NAKAMURA Takashi NANYA
As VLSI technology advances, delay variations will become more serious. Delay-insensitive asynchronous dual-rail circuits tolerate any delay variation, but their energy consumption is more than double that of the single-rail circuits because signal transitions occur every cycle in all bits regardless of the input bit pattern. However, in functional units, a significant number of input bits may not change from the previous input in many cases. In such a situation, calculation of these bits is not required. Thus, we propose a method, called unflip-bits control, makes use of the above situation, to reduce energy consumption. We evaluate the energy consumption and performance penalty for the method using HSPICE and the verilog-XL simulator, and compare the method with the conventional dual-rail circuit and a synchronous circuit. Our evaluation results reveal that the proposed asynchronous dual-rail circuit has a 12-60% lower energy consumption compared with a conventional asynchronous dual-rail circuit.
Yuan HE Masaaki KONDO Takashi NAKADA Hiroshi SASAKI Shinobu MIWA Hiroshi NAKAMURA
Networks-on-Chip (or NoCs, for short) play important roles in modern and future multi-core processors as they are highly related to both performance and power consumption of the entire chip. Up to date, many optimization techniques have been developed to improve NoC's bandwidth, latency and power consumption. But a clear answer to how energy efficiency is affected with these optimization techniques is yet to be found since each of these optimization techniques comes with its own benefits and overheads while there are also too many of them. Thus, here comes the problem of when and how such optimization techniques should be applied. In order to solve this problem, we build a runtime framework to throttle these optimization techniques based on concise performance and energy models. With the help of this framework, we can successfully establish adaptive selections over multiple optimization techniques to further improve performance or energy efficiency of the network at runtime.
Masaaki KONDO Takuro HAYASHIDA Masashi IMAI Hiroshi NAKAMURA Takashi NANYA Atsushi HORI
Cluster systems are getting widely used because of good performance / cost ratio. However, their reliability has not been well discussed in practical environment so far. As the number of commodity components in a cluster system gets increased, it is indispensable to support reliability by system software. SCore cluster system software is a parallel programming environment for High Performance Computing (HPC). SCore provides checkpointing and rollback-recovery mechanism for high availability. In this paper, we analyze and evaluate the checkpointing and rollback-recovery mechanisms of SCore quantitively. The experimental results reveal that the required time for checkpointing scales very well in respect to the number of computing nodes. However, the required time is quite long due to the low effective network bandwidth. Based on the results, we modify SCore and successfully make checkpointing and recovery 1.8 2.8 times and 3.7 5.0 times faster respectively. This is very helpful for cluster systems to achieve high performance and high availability.