1-7hit |
Yuetsu KODAMA Masaaki KONDO Mitsuhisa SATO
The supercomputer, “Fugaku”, which ranked number one in multiple supercomputing lists, including the Top500 in June 2020, has various power control features, such as (1) an eco mode that utilizes only one of two floating-point pipelines while decreasing the power supply to the chip; (2) a boost mode that increases clock frequency; and (3) a core retention feature that turns unused cores to the low-power state. By orchestrating these power-performance features while considering the characteristics of running applications, we can potentially gain even better system-level energy efficiency. In this paper, we report on the performance and power consumption of Fugaku using SPEC HPC benchmarks. Consequently, we confirmed that it is possible to reduce the energy by about 17% while improving the performance by about 2% from the normal mode by combining boost mode and eco mode.
Due to the rapid development of different processors, e.g., x86 and Sunway, software porting between different platforms is becoming more frequent. However, the migrated software's execution efficiency on the target platform is different from that of the source platform, and most of the previous studies have investigated the improvement of the efficiency from the hardware perspective. To the best of our knowledge, this is the first paper to exclusively focus on studying what software factors can result in performance change after software migration. To perform our study, we used SonarQube to detect and measure five software factors, namely Duplicated Lines (DL), Code Smells Density (CSD), Big Functions (BF), Cyclomatic Complexity (CC), and Complex Functions (CF), from 13 selected projects of SPEC CPU2006 benchmark suite. Then, we measured the change of software performance by calculating the acceleration ratio of execution time before (x86) and after (Sunway) software migration. Finally, we performed a multiple linear regression model to analyze the relationship between the software performance change and the software factors. The results indicate that the performance change of software migration from the x86 platform to the Sunway platform is mainly affected by three software factors, i.e., Code Smell Density (CSD), Cyclomatic Complexity (CC), and Complex Functions (CF). The findings can benefit both researchers and practitioners.
Antoine BOSSARD Keiichi KANEKO
Extending the very popular tori interconnection networks[1]-[3], Torus-Connected Cycles (TCC) have been proposed as a novel network topology for massively parallel systems [5]. Here, the set-to-set disjoint paths routing problem in a TCC is solved. In a TCC(k,n), it is proved that paths of lengths at most kn2+2n can be selected in O(kn2) time.
Naofumi TAKAGI Kazuaki MURAKAMI Akira FUJIMAKI Nobuyuki YOSHIKAWA Koji INOUE Hiroaki HONDA
We propose a desk-side supercomputer with large-scale reconfigurable data-paths (LSRDPs) using superconducting rapid single-flux-quantum (RSFQ) circuits. It has several sets of computing unit which consists of a general-purpose microprocessor, an LSRDP and a memory. An LSRDP consists of a lot of, e.g., a few thousand, floating-point units (FPUs) and operand routing networks (ORNs) which connect the FPUs. We reconfigure the LSRDP to fit a computation, i.e., a group of floating-point operations, which appears in a 'for' loop of numerical programs by setting the route in ORNs before the execution of the loop. We propose to implement the LSRDPs by RSFQ circuits. The processors and the memories can be implemented by semiconductor technology. We expect that a 10 TFLOPS supercomputer, as well as a refrigerating engine, will be housed in a desk-side rack, using a near-future RSFQ process technology, such as 0.35 µm process.
Shinichi HABATA Mitsuo YOKOKAWA Shigemune KITAWAKI
The Earth Simulator (ES), developed by the Japanese government's initiative "Earth Simulator project," is a highly parallel vector supercomputer system. In May 2002, the ES was proven to be the most powerful computer in the world by achieving 35.86 teraflops on the LINPACK benchmark and 26.58 teraflops for a global atmospheric circulation model with the spectral method. Three architectural features enabled these great achievements; vector processor, shared-memory and high-bandwidth non-blocking interconnection crossbar network. In this paper, an overview of the ES, the three architectural features and the result of performance evaluation are described particularly with its hardware realization of the interconnection among 640 processor nodes.
Tadayuki SAKAKIBARA Katsuyoshi KITAI Tadaaki ISOBE Shigeko YAZAWA Teruo TANAKA Yoshiko TAMAKI Yasuhiro INAGAMI
We propose an instruction-based variable priority scheme (IBVPS) which achieves high sustained memory throughput on a TCMP type vector supercomputer. Generally, there are two approaches to arbitrating interprocessor memory access conflict: request level priority control and fixed priority control. Each approach, however, affects performance in its own way: In the case of request level priority control, mutual obstruction causes a performance degradation, and in the case of fixed priority control, memory bank monopoly causes a performance degradation. Mutual obstruction refers to the interference among access requests coming from different instructions; memory bank monopoly refers to the un-interrupted accessing of the same memory bank by a series of higher priority instructions. The strategy of the instruction-based variable priority scheme consists in: (a) generally changing the priority assignment of all load/store pipelines at the end of any instruction running in the system, and (b) changing the priority assignment of all load/store pipelines more than once in the middle of an access instruction with a stride greater than 1 or an indirect access instruction which may monopolize some memory banks for an extended period of time. This strategy reduces mutual obstruction because the priority assignment is reshuffled for the entire group of load/store pipelines at a time. it also reduces memory bank monopoly because the opportunity for memory access is made equal among different instructions by changing the priority assignment at the end of an instruction. Moreover, it prevents the memory bank monopoly by a memory access instruction with a stride greater than 1 or an indirect access instruction, by changing the priority assignment more frequently. Consequently, high sustained memory throughput is achieved on TCMP type vector supercomputers.
Tadayuki SAKAKIBARA Katsuyoshi KITAI Tadaaki ISOBE Shigeko YAZAWA Teruo TANAKA Yasuhiro INAGAMI Yoshiko TAMAKI
We present a scalable parallel memory architecture with a skew scheme by which permanent-concentration-free strides, if any, do not depend on the number of ways in parallel memory interleaving. The permanent-concentration is a kind of memory access conflict. With conventional skew schemes, permanent-concentration-free strides depended on the number of banks (or bank groups) in parallel memory (=number of ways in parallel memory interleaving). We analyze two kinds of cause of conflicts: permanent-concentration occurs when memory access requests concentrate in limited number of banks (or bank groups) in parallel memory, and transient-concentration, when memory access requests transiently concentrate in some banks (or bank groups) in parallel memory. We have identified permanent-concentration-free strides, which are independent of the number of banks (or bank groups) in parallel memory, by solving two concentrations separately. The strategy is to increase the size of address block of shifting address assignment to the parallel memory in order to reduce permanent-concentrations, and make the size of the buffer for each banks (or bank groups) in the parallel memory match the size of address block of shifting in order to absorb transient-concentrations. The skew scheme uses the same size of address block of shifting address assignment for memory systems for different numbers of banks (or bank groups) in parallel memory. As a result, scalability for permanent-concentration-free strides is achieved independent of the number of banks (or bank groups) in parallel memory.