IEICE global.ieice.org Site

Author Search Result

[Author] Hiroshi NAKAMURA(24hit)

1-20hit(24hit)

Evaluation of a New Power-Gating Scheme Utilizing Data Retentiveness on Caches
Kyundong KIM Seidai TAKEDA Shinobu MIWA Hiroshi NAKAMURA

PAPER-Logic Synthesis, Test and Verification

Vol:
E95-A No:12
Page(s):
2301-2308
Caches are one of the most leakage consuming components in modern processor because of massive amount of transistors. To reduce leakage power of caches, several techniques using power-gating (PG) were proposed. Despite of its high leakage saving, a side effect of PG for caches is the loss of data during a sleep. If useful data is lost in sleep mode, it should be fetched again from a lower level memory. This consumes a considerable amount of energy, which very unfortunately mitigates the leakage saving. This paper proposes a new PG scheme considering data retentiveness of SRAM. After entering the sleep mode, data of an SRAM cell is not lost immediately and is usable by checking the validity of the data. Therefore, we utilize data retentiveness of SRAM to avoid energy overhead for data recovery, which results in further chance of leakage saving. To check availability, we introduce a simple hardware whose overhead is ignorable. Our experimental result shows that utilizing data retentiveness saves up to 32.42% of more leakage than conventional PG.
A Fine-Grained Power Gating Control on Linux Monitoring Power Consumption of Processor Functional Units
Atsushi KOSHIBA Motoki WADA Ryuichi SAKAMOTO Mikiko SATO Tsubasa KOSAKA Kimiyoshi USAMI Hideharu AMANO Masaaki KONDO Hiroshi NAKAMURA Mitaro NAMIKI

PAPER

Vol:
E98-C No:7
Page(s):
559-568
The authors have been researching on reducing the power consumption of microprocessors, and developed a low-power processor called “Geyser” by applying power gating (PG) function to the individual functional units of the processor. PG function on Geyser reduces the power consumption of functional units by shutting off the power voltage of idle units. However, the energy overhead of switching the supply voltage for units on and off causes power increases. The amount of the energy overhead varies with the behavior of each functional unit which is influenced by running application, and also with the core temperature. It is therefore necessary to switch the PG function itself on or off according to the state of the processor at runtime to reduce power consumption more effectively. In this paper, the authors propose a PG control method to take the power overhead into account by the operating system (OS). In the proposed method, for achieving much power reduction, the OS calculates the power consumption of each functional unit periodically and inhibits the PG function of the unit whose energy overhead is judged too high. The method was implemented in the Linux process scheduler and evaluated. The results show that the average power consumption of the functional units is reduced by up to 17.2%.
Fine-Grained Run-Tume Power Gating through Co-optimization of Circuit, Architecture, and System Software Design Open Access
Hiroshi NAKAMURA Weihan WANG Yuya OHTA Kimiyoshi USAMI Hideharu AMANO Masaaki KONDO Mitaro NAMIKI

INVITED PAPER

Vol:
E96-C No:4
Page(s):
404-412
Power consumption has recently emerged as a first class design constraint in system LSI designs. Specially, leakage power has occupied a large part of the total power consumption. Therefore, reduction of leakage power is indispensable for efficient design of high-performance system LSIs. Since 2006, we have carried out a research project called “Innovative Power Control for Ultra Low-Power and High-Performance System LSIs”, supported by Japan Science and Technology Agency as a CREST research program. One of the major objectives of this project is reducing the leakage power consumption of system LSIs by innovative power control through tight cooperation and co-optimization of circuit technology, architecture, and system software designs. In this project, we focused on power gating as a circuit technique for reducing leakage power. Temporal granularity is one of the most important issue in power gating. Thus, we have developed a series of Geysers as proof-of-concept CPUs which provide several mechanisms of fine-grained run-time power gating. In this paper, we describe their concept and design, and explain why co-optimization of different design layers are important. Then, three kinds of power gating implementations and their evaluation are presented from the view point of power saving and temporal granularity.
An Energy-Efficient Task Scheduling for Near-Realtime Systems with Execution Time Variation
Takashi NAKADA Tomoki HATANAKA Hiroshi UEKI Masanori HAYASHIKOSHI Toru SHIMIZU Hiroshi NAKAMURA

PAPER-Software System

Pubricized:
2017/06/26
Vol:
E100-D No:10
Page(s):
2493-2504
Improving energy efficiency is critical for embedded systems in our rapidly evolving information society. Near real-time data processing tasks, such as multimedia streaming applications, exhibit a common fact that their deadline periods are longer than their input intervals due to buffering. In general, executing tasks at lower performance is more energy efficient. On the other hand, higher performance is necessary for huge tasks to meet their deadlines. To minimize the energy consumption while meeting deadlines strictly, adaptive task scheduling including dynamic performance mode selection is very important. In this work, we propose an energy efficient slack-based task scheduling algorithm for such tasks by adapting to task size variations and applying DVFS with the help of statistical analysis. We confirmed that our proposal can further reduce the energy consumption when compared to oracle frame-based scheduling.
FOREWORD
Kenkichi HIRADE Hiroshi SUZUKI Hideichi SASAOKA Hiroshi NAKAMURA Yukitsuna FURUYA

FOREWORD

Vol:
E77-B No:5
Page(s):
533-534
Adaptive Lossy Data Compression Extended Architecture for Memory Bandwidth Conservation in SpMV
Siyi HU Makiko ITO Takahide YOSHIKAWA Yuan HE Hiroshi NAKAMURA Masaaki KONDO

PAPER

Pubricized:
2023/07/20
Vol:
E106-D No:12
Page(s):
2015-2025
Widely adopted by machine learning and graph processing applications nowadays, sparse matrix-Vector multiplication (SpMV) is a very popular algorithm in linear algebra. This is especially the case for fully-connected MLP layers, which dominate many SpMV computations and play a substantial role in diverse services. As a consequence, a large fraction of data center cycles is spent on SpMV kernels. Meanwhile, despite having efficient storage options against sparsity (such as CSR or CSC), SpMV kernels still suffer from the problem of limited memory bandwidth during data transferring because of the memory hierarchy of modern computing systems. In more detail, we find that both integer and floating-point data used in SpMV kernels are handled plainly without any necessary pre-processing. Therefore, we believe bandwidth conservation techniques, such as data compression, may dramatically help SpMV kernels when data is transferred between the main memory and the Last Level Cache (LLC). Furthermore, we also observe that convergence conditions in some typical scientific computation benchmarks (based on SpMV kernels) will not be degraded when adopting lower precision floating-point data. Based on these findings, in this work, we propose a simple yet effective data compression scheme that can be extended to general purpose computing architectures or HPC systems preferably. When it is adopted, a best-case speedup of 1.92x is made. Besides, evaluations with both the CG kernel and the PageRank algorithm indicate that our proposal introduces negligible overhead on both the convergence speed and the accuracy of final results.
Mobile Service Control Point for Intelligent and Multimedia Mobile Communications
Hiroshi NAKAMURA Kenichi KIMURA Akihisa NAKAJIMA

PAPER

Vol:
E77-B No:9
Page(s):
1089-1095
To provide personal, intelligent, and multimedia services through a mobile communications network, a Mobile Service Control Point (M-SCP) was developed, which performs both the location register and service control functions. The M-SCP was constructed on a common platform to allow quick introduction of new services. Software techniques to reduce the frequency of process-switching, assign the highest priority to real-time tasks, and operate a multiple-CPU structure provide faster real-time processing. This is confirmed by computer simulation and research in the field.
An Operating System Guided Fine-Grained Power Gating Control Based on Runtime Characteristics of Applications
Atsushi KOSHIBA Mikiko SATO Kimiyoshi USAMI Hideharu AMANO Ryuichi SAKAMOTO Masaaki KONDO Hiroshi NAKAMURA Mitaro NAMIKI

PAPER

Vol:
E99-C No:8
Page(s):
926-935
Fine-grained power gating (FGPG) is a power-saving technique by switching off circuit blocks while the blocks are idle. Although FGPG can reduce power consumption without compromising computational performance, switching the power supply on and off causes energy overhead. To prevent power increase caused by the energy overhead, in our prior research we proposed an FGPG control method of the operating system(OS) based on pre-analyzing applications' power usage. However, modern computing systems have a wide variety of use cases and run many types of application; this makes it difficult to analyze the behavior of all these applications in advance. This paper therefore proposes a new FGPG control method without profiling application programs in advance. In the new proposed method, the OS monitors a circuit's idle interval periodically while application programs are running. The OS enables FGPG only if the interval time is long enough to reduce the power consumption. The experimental results in this paper show that the proposed method reduces power consumption by 9.8% on average and up to 17.2% at 25°C. The results also show that the proposed method achieves almost the same power-saving efficiency as the previous profile-based method.
A New 90 MBPS 68 APSK Modem with Honeycomb Constellation for Digital Radio Relay Systems
Hiroshi NAKAMURA Noboru IIZUKA Eisuke FUKUDA Morihiko MINOWA Yoshimasa DAIDO Sadao TAKENAKA

PAPER-Radio Communication

Vol:
E71-E No:6
Page(s):
591-599
This paper describes the first realization of the APSK system with a honeycomb constellation (HC) for high capacity digital radio links. Partial Gray coding method to improve receiver sensitivity is also described. It is shown that the 68 APSK modulation can increase receiver sensitivity by 0.7 dB compared with a 64 QAM. Possibility of the system with the HC using the present state of the art is confirmed by theoretical estimation of tolerances for modem impairments. Techniques necessary to realize a system with the HC are also described. Using the techniques, the 68 APSK modem with the HC was experimentally fabricated. The modem we fabricated has a transmission capacity of 90 Mbps within the FCC-authorized bandwidth of 20 MHz in the 4 GHz band. The BER performance and signature for multipath fading have been measured. It is confirmed that the signature of the 68 APSK is almost the same as that of the conventional 64 QAM. The signature improved using 5 tap transversal equalizer correspons to an outage of 2 seconds per year per hop.
Reducing Memory System Energy by Software-Controlled On-Chip Memory
Masaaki KONDO Hiroshi NAKAMURA

PAPER-Architecture and Algorithms

Vol:
E86-C No:4
Page(s):
580-588
In recent computer systems, a large portion of energy is consumed by on-chip cache accesses and data movement between cache and off-chip main memory. Reducing these memory system energy is indispensable for future microprocessors because power and thermal issues certainly become a key factor of limiting processor performance. In this paper, we discuss and evaluate how our architecture called SCIMA contributes to energy saving. SCIMA integrates software-controllable memory (SCM) into processor chip. SCIMA can save total memory system energy by using SCM under the support of compiler. The evaluation results reveal that SCIMA can reduce 5-50% of memory system energy and still faster than conventional cache based architecture.
A Double-Leve1-V_th Select Gate Array Architecture for Multilevel NAND Flash Memories
Ken TAKEUCHI Tomoharu TANAKA Hiroshi NAKAMURA

PAPER-Memory

Vol:
E79-C No:7
Page(s):
1013-1020
In multilevel flash memorles, the threshold voltages of the memory cells should be controlled precisely. This paper describes how in a conventional NAND flash memory, the threshold voltages of the memory cells fluctuate due to array noise during the bit-by-bit program verify operation, and as a result, the threshold voltage distribution becomes wider. This paper describes a new array architecture, "A double-level-Vth select gate array architecture" to eliminate the array noise, together with a reduction of the cell area. The array noise is mainly caused by interbitline capacitive coupling noise and by the high resistance of the diffused source-line. The threshold voltage fluctuation can be as much as 0.7 V in a conventional array. In the proposed array, bitlines are alternately selected, and the unselected bitlines are used as low resistance source-lines. Moreover, the unselected bitlines form a shield between the neighboring selected bitlines. As a result, the array noise is strongly suppressed. The threshold voltage fluctuation is estimated to be as small as 0.03 V in the proposed array and a reliable operation of a multilevel NAND flash memory can be realized.
Synthesis of Serial Local Clock Controllers for Asynchronous Circuit Design
Nattha SRETASEREEKUL Hiroshi SAITO Euiseok KIM Metehan OZCAN Masashi IMAI Hiroshi NAKAMURA Takashi NANYA

PAPER-IP Design

Vol:
E86-A No:12
Page(s):
3028-3037
Asynchronous controllers effectively control high concurrence of datapath operations for high speed. Signal Transition Graphs (STGs) can effectively represent these concurrent events. However, highly concurrent STGs cause the state explosion problem in asynchronous synthesis tools. Many small but highly concurrent STGs cannot be synthesized to obtain control circuits. Moreover, STGs also lead to some control-time overhead of the four-phase handshake protocol. In this paper, we propose a method for deriving the serial control nodes from Control Data Flow Graphs (CDFGs) such that the concurrence of datapath operations is still preserved. The STGs derived from the serialized control nodes are serial STGs which are simpler for synthesis than the concurrent STGs. We also propose an implementation using these serialized controllers to generate local clocks at any necessary times. The implementation results in very small control-time overhead. The experimental results show that the number of synthesis states is proportional to the number of control signals, and the circuits with satisfiable small control-time overhead are obtained.
An Energy-Efficient Task Scheduling for Near Real-Time Systems on Heterogeneous Multicore Processors
Takashi NAKADA Hiroyuki YANAGIHASHI Kunimaro IMAI Hiroshi UEKI Takashi TSUCHIYA Masanori HAYASHIKOSHI Hiroshi NAKAMURA

PAPER-Software System

Pubricized:
2019/11/01
Vol:
E103-D No:2
Page(s):
329-338
Near real-time periodic tasks, which are popular in multimedia streaming applications, have deadline periods that are longer than the input intervals thanks to buffering. For such applications, the conventional frame-based schedulings cannot realize the optimal scheduling due to their shortsighted deadline assumptions. To realize globally energy-efficient executions of these applications, we propose a novel task scheduling algorithm, which takes advantage of the long deadline period. We confirm our approach can take advantage of the longer deadline period and reduce the average power consumption by up to 18%.
Power Penalty of Multilevel QAM Modem Caused by Two Simultaneously Existing lmpairments
Yoshimasa DAIDO Sadao TAKENAKA Hiroshi NAKAMURA

PAPER-Radio Wave and Satellite Communication

Vol:
E70-E No:7
Page(s):
628-633
Since power penalty of multilevel QAM system is caused by many simultaneously existing impairments, it is important to know whether the sum of the penalties caused by each impairment coincides with the power penalty caused by the actual channel condition or not. To examine the combined effect caused by plural impairments, theoretical power penalty is estimated in detail when any two impairments among the typical seven exist simultaneously. The excess penalty is defined as deviation of the power penalty from the sum of penalties caused by each impairment. Calculation of the excess penalty shows that there are four categories for combinations of impairments. It is shown that there is a typical category of combinations which includes more than half of all possible combinations. Calculated excess penalties are very close for all combinations within this category. A simple algebraic equation is given to approximate the excess penalty for the category. The excess penalties of other categories will be shown and their characteristics will be discussed in detail.
Power Efficient High-Level Modulation for High-Capacity Digital Radio Systems
Hiroshi NAKAMURA Yoshimasa DAIDO

PAPER-Radio Communication

Vol:
E72-E No:5
Page(s):
633-640
This paper describes theoretical estimation of power efficiency improvement by adopting amplitude and phase modulation (APSK) with a honeycomb constellation (HC) instead of multilevel QAM modulation. Nonlinear distortion caused by a power amplifier is considered in the estimation, and nonlinearity of the amplifier is approximated by third and fifth order nonlinearity. To eliminated the difficulty in carrier reconstruction, a pilot carrier injection method is assumed for the APSK with the HC. However injecting the carrier reduces the power efficiency improvement, so dependence of power efficiency on the injected carrier level is estimated, theoretically. SNR of the recovered carrier as a function of the pilot carrier level is also estimated, experimentally. From these two estimations, an optimum pilot carrier level is determined for a 64 APSK system wit the HC. The possibility of reducing maximum available power of the amplifier by 2.0 dB is confirmed at the optimum pilot carrier level that corresponds to an offset of 1/4 data space in constellation. At the optimized level, SNR of the recovered carrier is 40 dB, which guarantees satisfactory operation of the system.
Area-Efficient Microarchitecture for Reinforcement of Turbo Mode
Shinobu MIWA Takara INOUE Hiroshi NAKAMURA

PAPER-Computer System

Vol:
E97-D No:5
Page(s):
1196-1210
Turbo mode, which accelerates many applications without major change of existing systems, is widely used in commercial processors. Since time duration or powerfulness of turbo mode depends on peak temperature of a processor chip, reducing the peak temperature can reinforce turbo mode. This paper presents that adding small amount of hardware allows microprocessors to reduce the peak temperature drastically and then to reinforce turbo mode successfully. Our approach is to find out a few small units that become heat sources in a processor and to appropriately duplicate them for reduction of their power density. By duplicating the limited units and using the copies evenly, the processor can show significant performance improvement while achieving area-efficiency. The experimental result shows that the proposed method achieves up to 14.5% of performance improvement in exchange for 2.8% of area increase.
Design Method of High Performance and Low Power Functional Units Considering Delay Variations
Kouichi WATANABE Masashi IMAI Masaaki KONDO Hiroshi NAKAMURA Takashi NANYA

PAPER-Circuit Synthesis

Vol:
E89-A No:12
Page(s):
3519-3528
As VLSI technology advances, delay variations will become more serious. Delay-insensitive asynchronous dual-rail circuits tolerate any delay variation, but their energy consumption is more than double that of the single-rail circuits because signal transitions occur every cycle in all bits regardless of the input bit pattern. However, in functional units, a significant number of input bits may not change from the previous input in many cases. In such a situation, calculation of these bits is not required. Thus, we propose a method, called unflip-bits control, makes use of the above situation, to reduce energy consumption. We evaluate the energy consumption and performance penalty for the method using HSPICE and the verilog-XL simulator, and compare the method with the conventional dual-rail circuit and a synchronous circuit. Our evaluation results reveal that the proposed asynchronous dual-rail circuit has a 12-60% lower energy consumption compared with a conventional asynchronous dual-rail circuit.
A 256 QAM Digital Radio System with a Low Rolloff Factor of 20% for Attaining 6.75 bps/Hz
Hiroshi NAKAMURA Eisuke FUKUDA Noburu IIZUKA Yoshimasa DAIDO Sadao TAKENAKA

PAPER-Radio Communication

Vol:
E71-E No:1
Page(s):
43-50
This paper describes a newly-developed 4 GHz 135 Mbps 256 QAM system with a rolloff factor of 20%, which can attain a spectrum efficiency of 6.75 bps/Hz. The key techniques are theoretically investigated to realize this system. It was predicted theoretically that the simultaneous incorporation of 7-tap transversal equalizers (TEQL) and a recursive slope equalizer (SEQL) would be required as countermeasure for multipath fading. The 256 QAM system was designed considering the results of the theoretical investigation. Excellent BER performance was obtained with the aid of forward error correction and pilot carrier injection. Since remarkable improvement in the signature was obtained by the simultaneous user of TEQL and SEQL, the 256 QAM system with a very low rolloff factor is promising.
A Runtime Optimization Selection Framework to Realize Energy Efficient Networks-on-Chip
Yuan HE Masaaki KONDO Takashi NAKADA Hiroshi SASAKI Shinobu MIWA Hiroshi NAKAMURA

PAPER-Architecture

Pubricized:
2016/08/24
Vol:
E99-D No:12
Page(s):
2881-2890
Networks-on-Chip (or NoCs, for short) play important roles in modern and future multi-core processors as they are highly related to both performance and power consumption of the entire chip. Up to date, many optimization techniques have been developed to improve NoC's bandwidth, latency and power consumption. But a clear answer to how energy efficiency is affected with these optimization techniques is yet to be found since each of these optimization techniques comes with its own benefits and overheads while there are also too many of them. Thus, here comes the problem of when and how such optimization techniques should be applied. In order to solve this problem, we build a runtime framework to throttle these optimization techniques based on concise performance and energy models. With the help of this framework, we can successfully establish adaptive selections over multiple optimization techniques to further improve performance or energy efficiency of the network at runtime.
Sleep Transistor Sizing Method Using Accurate Delay Estimation Considering Input Vector Pattern and Non-linear Current Model
Seidai TAKEDA Kyundong KIM Hiroshi NAKAMURA Kimiyoshi USAMI

PAPER-Physical Level Design

Vol:
E94-A No:12
Page(s):
2499-2509
Beyond deep sub-micron era, Power Gating (PG) is one of the most effective techniques to reduce leakage power of circuits. The most important issue of PG circuit design is how to decide the width of sleep transistor. Smaller total sleep transistor width provides smaller leakage power in standby mode, however, insufficient sleep transistor insertion suffers from significant performance degradation. In this paper, we present an accurate and fast gate-level delay estimation method for PG circuits and a novel sleep transistor sizing method utilizing our delay estimation for module-based PG circuits. This method achieves high accuracy within acceptable computation time utilizing accurate discharge current estimation based on delayed logic simulations with limited input vector patterns and by realizing precise current characteristics for logic gates and sleep transistors. Experimental results show that our delay estimation successfully achieves high accuracy and avoids overestimation and underestimation seen in conventional method. Also, our sleep transistor sizing method on average successfully reduces the width of sleep transistors by 40% when compared to conventional methods within an acceptable computation time.

1-20hit(24hit)

Author Search Result

[Author] Hiroshi NAKAMURA(24hit)

Evaluation of a New Power-Gating Scheme Utilizing Data Retentiveness on Caches

A Fine-Grained Power Gating Control on Linux Monitoring Power Consumption of Processor Functional Units

Fine-Grained Run-Tume Power Gating through Co-optimization of Circuit, Architecture, and System Software Design Open Access

An Energy-Efficient Task Scheduling for Near-Realtime Systems with Execution Time Variation

FOREWORD

Adaptive Lossy Data Compression Extended Architecture for Memory Bandwidth Conservation in SpMV

Mobile Service Control Point for Intelligent and Multimedia Mobile Communications

An Operating System Guided Fine-Grained Power Gating Control Based on Runtime Characteristics of Applications

A New 90 MBPS 68 APSK Modem with Honeycomb Constellation for Digital Radio Relay Systems

Reducing Memory System Energy by Software-Controlled On-Chip Memory

A Double-Leve1-V_th Select Gate Array Architecture for Multilevel NAND Flash Memories

Synthesis of Serial Local Clock Controllers for Asynchronous Circuit Design

An Energy-Efficient Task Scheduling for Near Real-Time Systems on Heterogeneous Multicore Processors

Power Penalty of Multilevel QAM Modem Caused by Two Simultaneously Existing lmpairments

Power Efficient High-Level Modulation for High-Capacity Digital Radio Systems

Area-Efficient Microarchitecture for Reinforcement of Turbo Mode

Design Method of High Performance and Low Power Functional Units Considering Delay Variations

A 256 QAM Digital Radio System with a Low Rolloff Factor of 20% for Attaining 6.75 bps/Hz

A Runtime Optimization Selection Framework to Realize Energy Efficient Networks-on-Chip

Sleep Transistor Sizing Method Using Accurate Delay Estimation Considering Input Vector Pattern and Non-linear Current Model

Latest Issue

Links

Call for Papers

Submit to IEICE Trans.

Transactions NEWS

Popular articles