The search functionality is under construction.
The search functionality is under construction.

Author Search Result

[Author] Koji INOUE(27hit)

1-20hit(27hit)

  • Rapid Design Space Exploration of a Reconfigurable Instruction-Set Processor

    Farhad MEHDIPOUR  Hamid NOORI  Koji INOUE  Kazuaki MURAKAMI  

     
    PAPER-Embedded, Real-Time and Reconfigurable Systems

      Vol:
    E92-A No:12
      Page(s):
    3182-3192

    Multitude parameters in the design process of a reconfigurable instruction-set processor (RISP) may lead to a large design space and remarkable complexity. Quantitative design approach uses the data collected from applications to satisfy design constraints and optimize the design goals while considering the applications' characteristics; however it highly depends on designer observations and analyses. Exploring design space can be considered as an effective technique to find a proper balance among various design parameters. Indeed, this approach would be computationally expensive when the performance evaluation of the design points is accomplished based on the synthesis-and-simulation technique. A combined analytical and simulation-based model (CAnSO**) is proposed and validated for performance evaluation of a typical RISP. The proposed model consists of an analytical core that incorporates statistics collected from cycle-accurate simulation to make a reasonable evaluation and provide a valuable insight. CAnSO has clear speed advantages and therefore it can be used for easing a cumbersome design space exploration of a reconfigurable RISP processor and quick performance evaluation of slightly modified architectures.

  • Towards Ultra-High-Speed Cryogenic Single-Flux-Quantum Computing Open Access

    Koki ISHIDA  Masamitsu TANAKA  Takatsugu ONO  Koji INOUE  

     
    INVITED PAPER

      Vol:
    E101-C No:5
      Page(s):
    359-369

    CMOS microprocessors are limited in their capacity for clock speed improvement because of increasing computing power, i.e., they face a power-wall problem. Single-flux-quantum (SFQ) circuits offer a solution with their ultra-fast-speed and ultra-low-power natures. This paper introduces our contributions towards ultra-high-speed cryogenic SFQ computing. The first step is to design SFQ microprocessors. From qualitatively and quantitatively evaluating past-designed SFQ microprocessors, we have found that revisiting the architecture of SFQ microprocessors and on-chip caches is the first critical challenge. On the basis of cross-layer discussions and analysis, we came to the conclusion that a bit-parallel gate-level pipeline architecture is the best solution for SFQ designs. This paper summarizes our current research results targeting SFQ microprocessors and on-chip cache architectures.

  • Trends in High-Performance, Low-Power Cache Memory Architectures

    Koji INOUE  Vasily G. MOSHNYAGA  Kazuaki MURAKAMI  

     
    PAPER-High-Performance Technologies

      Vol:
    E85-C No:2
      Page(s):
    304-314

    One of uncompromising requirements from portable computing is energy efficiency, because that affects directly the battery life. On the other hand, portable computing will target more demanding applications, for example moving pictures, so that higher performance is still required. Cache memories have been employed as one of the most important components of computer systems. In this paper, we briefly survey architectural techniques for high performance, low power cache memories.

  • NSIM: An Interconnection Network Simulator for Extreme-Scale Parallel Computers

    Hideki MIWA  Ryutaro SUSUKITA  Hidetomo SHIBAMURA  Tomoya HIRAO  Jun MAKI  Makoto YOSHIDA  Takayuki KANDO  Yuichiro AJIMA  Ikuo MIYOSHI  Toshiyuki SHIMIZU  Yuji OINAGA  Hisashige ANDO  Yuichi INADOMI  Koji INOUE  Mutsumi AOYAGI  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E94-D No:12
      Page(s):
    2298-2308

    In the near future, interconnection networks of massively parallel computer systems will connect more than a hundred thousands of computing nodes. The performance evaluation of the interconnection networks can provide real insights to help the development of efficient communication library. Hence, to evaluate the performance of such interconnection networks, simulation tools capable of modeling the networks with sufficient details, supporting a user-friendly interface to describe communication patterns, providing the users with enough performance information, completing simulations within a reasonable time, are a real necessity. This paper introduces a novel interconnection network simulator NSIM, for the evaluation of the performance of extreme-scale interconnection networks. The simulator implements a simplified simulation model so as to run faster without any loss of accuracy. Unlike the existing simulators, NSIM is built on the execution-driven simulation approach. The simulator also provides a MPI-compatible programming interface. Thus, the simulator can emulate parallel program execution and correctly simulate point-to-point and collective communications that are dynamically changed by network congestion. The experimental results in this paper showed sufficient accuracy of this simulator by comparing the simulator and the real machine. We also confirmed that the simulator is capable of evaluating ultra large-scale interconnection networks, consumes smaller memory area, and runs faster than the existing simulator. This paper also introduces a simulation service built on a cloud environment. Without installing NSIM, users can simulate interconnection networks with various configurations by using a web browser.

  • Omitting Cache Look-up for High-Performance, Low-Power Microprocessors

    Koji INOUE  Vasily G. MOSHNYAGA  Kazuaki MURAKAMI  

     
    PAPER-Low-Power Technologies

      Vol:
    E85-C No:2
      Page(s):
    279-287

    In this paper, we propose a novel architecture for low-power direct-mapped instruction caches, called "history-based tag-comparison (HBTC) cache. " The cache attempts to reuse tag-comparison results for avoiding unnecessary tag checks. Execution footprints are recorded into an extended BTB (Branch Target Buffer). In our evaluation, it is observed that the energy for tag comparison can be reduced by more than 90% in many applications.

  • A Linear Manifold Color Descriptor for Medicine Package Recognition

    Kenjiro SUGIMOTO  Koji INOUE  Yoshimitsu KUROKI  Sei-ichiro KAMATA  

     
    PAPER-Image Processing

      Vol:
    E95-D No:5
      Page(s):
    1264-1271

    This paper presents a color-based method for medicine package recognition, called a linear manifold color descriptor (LMCD). It describes a color distribution (a set of color pixels) of a color package image as a linear manifold (an affine subspace) in the color space, and recognizes an anonymous package by linear manifold matching. Mainly due to low dimensionality of color spaces, LMCD can provide more compact description and faster computation than description styles based on histogram and dominant-color. This paper also proposes distance-based dissimilarities for linear manifold matching. Specially designed for color distribution matching, the proposed dissimilarities are theoretically appropriate more than J-divergence and canonical angles. Experiments on medicine package recognition validates that LMCD outperforms competitors including MPEG-7 color descriptors in terms of description size, computational cost and recognition rate.

  • Quantitative Evaluation of State-Preserving Leakage Reduction Algorithm for L1 Data Caches

    Reiko KOMIYA  Koji INOUE  Vasily G. MOSHNYAGA  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E88-A No:4
      Page(s):
    862-868

    As the transistor feature sizes and threshold voltages reduce, leakage energy consumption has become an inevitable issue for high-performance microprocessor designs. Since on-chip caches are major contributors of the leakage, a number of researchers have proposed efficient leakage reduction techniques. However, it is still not clear that 1) what kind of algorithm can be considered and 2) how much they have impact on energy and performance. To answer these questions, we explore run-time cache management algorithm, and evaluate the energy-performance efficiency for several alternatives.

  • Static Mapping of Multiple Data-Parallel Applications on Embedded Many-Core SoCs

    Junya KAIDA  Yuko HARA-AZUMI  Takuji HIEDA  Ittetsu TANIGUCHI  Hiroyuki TOMIYAMA  Koji INOUE  

     
    LETTER-Computer System

      Vol:
    E96-D No:10
      Page(s):
    2268-2271

    This paper studies the static mapping of multiple applications on embedded many-core SoCs. The mapping techniques proposed in this paper take into account both inter-application and intra-application parallelism in order to fully utilize the potential parallelism of the many-core architecture. Two approaches are proposed for static mapping: one approach is based on integer linear programming and the other is based on a greedy algorithm. Experiments show the effectiveness of the proposed techniques.

  • Dynamically Variable Line-Size Cache Architecture for Merged DRAM/Logic LSIs

    Koji INOUE  Koji KAI  Kazuaki MURAKAMI  

     
    PAPER-Computer System Element

      Vol:
    E83-D No:5
      Page(s):
    1048-1057

    This paper proposes a novel cache architecture suitable for merged DRAM/logic LSIs, which is called "dynamically variable line-size cache (D-VLS cache). " The D-VLS cache can optimize its line-size according to the characteristic of programs, and attempts to improve the performance by exploiting the high on-chip memory bandwidth on merged DRAM/logic LSIs appropriately. In our evaluation, it is observed that an average memory-access time improvement achieved by a direct-mapped D-VLS cache is about 20% compared to a conventional direct-mapped cache with fixed 32-byte lines. This performance improvement is better than that of a doubled-size conventional direct-mapped cache.

  • High Bandwidth, Variable Line-Size Cache Architecture for Merged DRAM/Logic LSIs

    Koji INOUE  Koji KAI  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E81-C No:9
      Page(s):
    1438-1447

    Merged DRAM/logic LSIs could provide high on-chip memory bandwidth by interconnecting logic portions and DRAM with wider on-chip buses. For merged DRAM/logic LSIs with the memory hierarchy including cache memory, we can exploit such high on-chip memory bandwidth by means of replacing a whole cache line (or cache block) at a time on cache misses. This approach tends to increase the cache-line size if we attempt to improve the attainable memory bandwidth. Larger cache lines, however, might worsen the system performance if programs running on the LSIs do not have enough spatial locality of references and cache misses frequently take place. This paper describes a novel cache architecture suitable for merged DRAM/logic LSIs, called variable line-size cache or VLS cache, for resolving the above-mentioned dilemma. The VLS cache can make good use of the high on-chip memory bandwidth by means of larger cache lines and, at the same time, alleviate the negative effects of larger cache-line size by partitioning each large cache line into multiple sub-lines and allowing every sub-line to work as an independent cache line. The number of sub-lines involved when a cache replacement occurs can be determined depending on the characteristics of programs. This paper also evaluates the cost/performance improvements attainable by the VLS cache and compares it with those of conventional cache architectures. As a result, it is observed that a VLS cache reduces the average memory-access time by 16. 4% while it increases the hardware cost by only 13%, compared to a conventional direct-mapped cache with fixed 32-byte lines.

  • Evaluating Energy-Efficiency of DRAM Channel Interleaving Schemes for Multithreaded Programs

    Satoshi IMAMURA  Yuichiro YASUI  Koji INOUE  Takatsugu ONO  Hiroshi SASAKI  Katsuki FUJISAWA  

     
    PAPER-Computer System

      Pubricized:
    2018/06/08
      Vol:
    E101-D No:9
      Page(s):
    2247-2257

    The power consumption of server platforms has been increasing as the amount of hardware resources equipped on them is increased. Especially, the capacity of DRAM continues to grow, and it is not rare that DRAM consumes higher power than processors on modern servers. Therefore, a reduction in the DRAM energy consumption is a critical challenge to reduce the system-level energy consumption. Although it is well known that improving row buffer locality(RBL) and bank-level parallelism (BLP) is effective to reduce the DRAM energy consumption, our preliminary evaluation on a real server demonstrates that RBL is generally low across 15 multithreaded benchmarks. In this paper, we investigate the memory access patterns of these benchmarks using a simulator and observe that cache line-grained channel interleaving schemes, which are widely applied to modern servers including multiple memory channels, hurt the RBL each of the benchmarks potentially possesses. In order to address this problem, we focus on a row-grained channel interleaving scheme and compare it with three cache line-grained schemes. Our evaluation shows that it reduces the DRAM energy consumption by 16.7%, 12.3%, and 5.5% on average (up to 34.7%, 28.2%, and 12.0%) compared to the other schemes, respectively.

  • A High-Performance/Low-Power On-Chip Memory-Path Architecture with Variable Cache-Line Size

    Koji INOUE  Koji KAI  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E83-C No:11
      Page(s):
    1716-1723

    This paper proposes an on-chip memory-path architecture employing the dynamically variable line-size (D-VLS) cache for high performance and low energy consumption. The D-VLS cache exploits the high on-chip memory bandwidth attainable on merged DRAM/logic LSIs by replacing a whole large cache line in one cycle. At the same time, it attempts to avoid frequent evictions by decreasing the cache-line size when programs have poor spatial locality. Activating only on-chip DRAM subarrays corresponding to a replaced cache-line size produces a significant energy reduction. In our simulation, it is observed that our proposed on-chip memory-path architecture, which employs a direct-mapped D-VLS cache, improves the ED (Energy Delay) product by more than 75% over a conventional memory-path model.

  • A High-Performance and Low-Power Cache Architecture with Speculative Way-Selection

    Koji INOUE  Tohru ISHIHARA  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E83-C No:2
      Page(s):
    186-194

    This paper proposes a new approach to achieving high performance and low energy consumption for set-associative caches. The cache, called way-predicting set-associative cache, speculatively selects a single way, which is likely to contain the data desired by the processor, from the set designated by a memory address, before it starts a normal cache access. By accessing only the single way predicted, instead of accessing all the ways in a set, energy consumption can be reduced. In order for the way-predicting cache to perform well, accuracy of way prediction is important. This paper shows that the accuracy of an MRU (most recently used)-based way prediction is higher than 90% for most of the benchmark programs. The proposed way-predicting cache improves the ED (energy-delay) product by 60-70% compared to the conventional set-associative cache.

  • Reducing On-Chip DRAM Energy via Data Transfer Size Optimization

    Takatsugu ONO  Koji INOUE  Kazuaki MURAKAMI  Kenji YOSHIDA  

     
    PAPER

      Vol:
    E92-C No:4
      Page(s):
    433-443

    This paper proposes a software-controllable variable line-size (SC-VLS) cache architecture for low power embedded systems. High bandwidth between logic and a DRAM is realized by means of advanced integrated technology. System-in-Silicon is one of the architectural frameworks to realize the high bandwidth. An ASIC and a specific SRAM are mounted onto a silicon interposer. Each chip is connected to the silicon interposer by eutectic solder bumps. In the framework, it is important to reduce the DRAM energy consumption. The specific DRAM needs a small cache memory to improve the performance. We exploit the cache to reduce the DRAM energy consumption. During application program executions, an adequate cache line size which produces the lowest cache miss ratio is varied because the amount of spatial locality of memory references changes. If we employ a large cache line size, we can expect the effect of prefetching. However, the DRAM energy consumption is larger than a small line size because of the huge number of banks are accessed. The SC-VLS cache is able to change a line size to an adequate one at runtime with a small area and power overheads. We analyze the adequate line size and insert line size change instructions at the beginning of each function of a target program before executing the program. In our evaluation, it is observed that the SC-VLS cache reduces the DRAM energy consumption up to 88%, compared to a conventional cache with fixed 256 B lines.

  • A Next-Generation Enterprise Server System with Advanced Cache Coherence Chips

    Mariko SAKAMOTO  Akira KATSUNO  Go SUGIZAKI  Toshio YOSHIDA  Aiichiro INOUE  Koji INOUE  Kazuaki MURAKAMI  

     
    PAPER-VLSI Architecture for Communication/Server Systems

      Vol:
    E90-C No:10
      Page(s):
    1972-1982

    Broadcast and synchronization techniques are used for cache coherence control in conventional larger scale snoop-based SMP systems. The penalty for synchronization is directly proportional to system size. Meanwhile, advances in LSI technology now enable placing a memory controller on a CPU die. The latency to access directly linked memory is drastically reduced by an on-die controller. Developing an enterprise server system with these CPUs allows us an opportunity to achieve higher performance. Though the penalty of synchronization is counted whenever a cache miss occurs, it is necessary to improve the coherence method to receive the full benefit of this effect. In this paper, we demonstrate a coherence directory organization that fits into DSM enterprise server systems. Originally, a directory-based method was adopted in high performance computing systems because of its huge scalability in comparison with snoop-based method. Though directory capacity miss and long directory access latency are the major problems of this method, the relaxed scalability requirement of enterprise servers is advantageous to us to solve these problems along with an advanced LSI technology. Our proposed directory solves both problems by implementing a full bit vector level map of the coherence directory on an LSI chip. Our experimental results validate that a system controlled by our proposed directory can surpass a snoop-based system in performance even without applying data localization optimization to an online transaction processing (OLTP) workload.

  • Adaptive Mode Control for Low-Power Caches Based on Way-Prediction Accuracy

    Hidekazu TANAKA  Koji INOUE  

     
    PAPER-Low Power Methodology

      Vol:
    E88-A No:12
      Page(s):
    3274-3281

    This paper proposes a novel cache architecture for low power consumption, called "Adaptive Way-Predicting Cache (AWP cache)." The AWP cache has multi-operation modes and dynamically adapts the operation mode based on the accuracy of way-prediction results. A confidence counter for way prediction is implemented to each cache set. In order to analyze the effectiveness of the AWP cache, we perform a SRAM design using 0.18 µm CMOS technology and cycle-accurate processor simulations. As the results, for a benchmark program (179.art), it is observed that a performance-aware AWP cache reduces the 49% of performance overhead caused by an original way-predicting cache to 17%. Furthermore, a energy-aware AWP cache achieves 73% of energy reduction, whereas that obtained from the original way-predicting scheme is only 38%, compared to an non-optimized conventional cache. For the consideration of energy-performance efficiency, we see that the energy-aware AWP cache produces better results; the energy-delay product of conventional organization is reduced to only 35% in average which is 6% better than the original way-predicting scheme.

  • A Reconfigurable Functional Unit with Conditional Execution for Multi-Exit Custom Instructions

    Hamid NOORI  Farhad MEHDIPOUR  Koji INOUE  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    497-508

    Encapsulating critical computation subgraphs as application-specific instruction set extensions is an effective technique to enhance the performance of embedded processors. However, the addition of custom functional units to the base processor is required to support the execution of these custom instructions. Although automated tools have been developed to reduce the long design time needed to produce a new extensible processor for each application, short time-to-market, significant non-recurring engineering and design costs are issues. To address these concerns, we introduce an adaptive extensible processor in which custom instructions are generated and added after chip-fabrication. To support this feature, custom functional units (CFUs) are replaced by a reconfigurable functional unit (RFU). The proposed RFU is based on a matrix of functional units which is multi-cycle with the capability of conditional execution. A quantitative approach is utilized to propose an efficient architecture for the RFU and fix its constraints. To generate more effective custom instructions, they are extended over basic blocks and hence, multiple exits custom instructions are proposed. Conditional execution has been added to the RFU to support the multi-exit feature of custom instructions. Experimental results show that multi-exit custom instructions enhance the performance by an average of 67% compared to custom instructions limited to one basic block. A maximum speedup of 4.7, compared to a general embedded processor, and an average speedup of 1.85 was achieved on MiBench benchmark suite.

  • Critical Path Based Microarchitectural Bottleneck Analysis for Out-of-Order Execution

    Teruo TANIMOTO  Takatsugu ONO  Koji INOUE  

     
    PAPER

      Vol:
    E102-A No:6
      Page(s):
    758-766

    Correctly understanding microarchitectural bottlenecks is important to optimize performance and energy of OoO (Out-of-Order) processors. Although CPI (Cycles Per Instruction) stack has been utilized for this purpose, it stacks architectural events heuristically by counting how many times the events occur, and the order of stacking affects the result, which may be misleading. It is because CPI stack does not consider the execution path of dynamic instructions. Critical path analysis (CPA) is a well-known method to identify the critical execution path of dynamic instruction execution on OoO processors. The critical path consists of the sequence of events that determines the execution time of a program on a certain processor. We develop a novel representation of CPCI stack (Cycles Per Critical Instruction stack), which is CPI stack based on CPA. The main challenge in constructing CPCI stack is how to analyze a large number of paths because CPA often results in numerous critical paths. In this paper, we show that there are more than ten to the tenth power critical paths in the execution of only one thousand instructions in 35 benchmarks out of 48 from SPEC CPU2006. Then, we propose a statistical method to analyze all the critical paths and show a case study using the benchmarks.

  • Temperature-Aware Configurable Cache to Reduce Energy in Embedded Systems

    Hamid NOORI  Maziar GOUDARZI  Koji INOUE  Kazuaki MURAKAMI  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    418-431

    Energy consumption is a major concern in embedded computing systems. Several studies have shown that cache memories account for 40% or more of the total energy consumed in these systems. Active power used to be the primary contributor to total power dissipation of CMOS designs, but with the technology scaling, the share of leakage in total power consumption of digital systems continues to grow. Moreover, temperature is another factor that exponentially increases the leakage current. In this paper, we show the effect of temperature on the optimal (minimum-energy-consuming) cache configuration for low energy embedded systems. Our results show that for a given application and technology, the optimal cache size moves toward smaller caches at higher temperatures, due to the larger leakage. Consequently, a Temperature-Aware Configurable Cache (TACC) is an effective way to save energy in finer technologies when the embedded system is used in different temperatures. Our results show that using a TACC, up to 61% energy can be saved for instruction cache and 77% for data cache compared to a configurable cache that has been configured for only the corner-case temperature (100). Furthermore, the TACC also enhances the performance by up to 28% for the instruction cache and up to 17% for the data cache.

  • A Prototype System for Many-Core Architecture SMYLEref with FPGA Evaluation Boards

    Son-Truong NGUYEN  Masaaki KONDO  Tomoya HIRAO  Koji INOUE  

     
    PAPER-Architecture

      Vol:
    E96-D No:8
      Page(s):
    1645-1653

    Nowadays, the trend of developing micro-processor with hundreds of cores brings a promising prospect for embedded systems. Realizing a high performance and low power many-core processor is becoming a primary technical challenge. Generally, three major issues required to be resolved includes: 1) realizing efficient massively parallel processing, 2) reducing dynamic power consumption, and 3) improving software productivity. To deal with these issues, we propose a solution to use many low-performance but small and very low-power cores to obtain very high performance, and develop a referential many-core architecture and a program development environment. This paper introduces a many-core architecture named SMYLEref and its prototype system with off-the-shelf FPGA evaluation boards. The initial evaluation results of several SPLASH2 benchmark programs conducted on our developed 128-core platform are also presented and discussed in this paper.

1-20hit(27hit)