This letter introduces an innovation for the heterogeneous storage architecture of AI chips, specifically focusing on the integration of six transistors(6T) and eight transistors(8T) hybrid SRAM. Traditional approaches to reducing SRAM power consumption typically involve lowering the operating voltage, a method that often substantially diminishes the recognition rate of neural networks. However, the innovative design detailed in this letter amalgamates the strengths of both SRAM types. It operates at a voltage lower than conventional SRAM, thereby significantly reducing the power consumption in neural networks without compromising performance.
Tadayoshi ENOMOTO Nobuaki KOBAYASHI
We developed a self-controllable voltage level (SVL) circuit and applied this circuit to a single-power-supply, six-transistor complementary metal-oxide-semiconductor static random-access memory (SRAM) to not only improve both write and read performances but also to achieve low standby power and data retention (holding) capability. The SVL circuit comprises only three MOSFETs (i.e., pull-up, pull-down and bypass MOSFETs). The SVL circuit is able to adaptively generate both optimal memory cell voltages and word line voltages depending on which mode of operation (i.e., write, read or hold operation) was used. The write margin (VWM) and read margin (VRM) of the developed (dvlp) SRAM at a supply voltage (VDD) of 1V were 0.470 and 0.1923V, respectively. These values were 1.309 and 2.093 times VWM and VRM of the conventional (conv) SRAM, respectively. At a large threshold voltage (Vt) variability (=+6σ), the minimum power supply voltage (VMin) for the write operation of the conv SRAM was 0.37V, whereas it decreased to 0.22V for the dvlp SRAM. VMin for the read operation of the conv SRAM was 1.05V when the Vt variability (=-6σ) was large, but the dvlp SRAM lowered it to 0.41V. These results show that the SVL circuit expands the operating voltage range for both write and read operations to lower voltages. The dvlp SRAM reduces the standby power consumption (PST) while retaining data. The measured PST of the 2k-bit, 90-nm dvlp SRAM was only 0.957µW at VDD=1.0V, which was 9.46% of PST of the conv SRAM (10.12µW). The Si area overhead of the SVL circuits was only 1.383% of the dvlp SRAM.
Yuki OKABE Daisuke KANEMOTO Osamu MAIDA Tetsuya HIROSE
We propose a sampling method that incorporates a normally distributed sampling series for EEG measurements using compressed sensing. We confirmed that the ADC sampling count and amount of wirelessly transmitted data can be reduced by 11% while maintaining a reconstruction accuracy similar to that of the conventional method.
Yasutaka MATSUDA Ryota SHIOYA Hideki ANDO
The high energy consumption of current processors causes several problems, including a limited clock frequency, short battery lifetime, and reduced device reliability. It is therefore important to reduce the energy consumption of the processor. Among resources in a processor, the issue queue (IQ) is a large consumer of energy, much of which is consumed by the wakeup logic. Within the wakeup logic, the tag comparison that checks source operand readiness consumes a significant amount of energy. This paper proposes an energy reduction scheme for tag comparison, called double-stage tag comparison. This scheme first compares the lower bits of the tag and then, only if these match, compares the higher bits. Because the energy consumption of tag comparison is roughly proportional to the total number of bits compared, energy is saved by reducing this number. However, this sequential comparison increases the delay of the IQ, thereby increasing the clock cycle time. Although this can be avoided by allocating an extra cycle to the issue operation, this in turn degrades the IPC. To avoid IPC degradation, we reconfigure a small number of entries in the IQ, where several oldest instructions that are likely to have an adverse effect on performance reside, to a single stage for tag comparison. Our evaluation results for SPEC2017 benchmark programs show that the double-stage tag comparison achieves on average a 21% reduction in the energy consumed by the wakeup logic (15% when including the overhead) with only 3.0% performance degradation.
Akira KITAYAMA Goichi ONO Tadashi KISHIMOTO Hiroaki ITO Naohiro KOHMU
Reducing power consumption is crucial for edge devices using convolutional neural network (CNN). The zero-skipping approach for CNNs is a processing technique widely known for its relatively low power consumption and high speed. This approach stops multiplication and accumulation (MAC) when the multiplication results of the input data and weight are zero. However, this technique requires large logic circuits with around 5% overhead, and the average rate of MAC stopping is approximately 30%. In this paper, we propose a precise zero-skipping method that uses input data and simple logic circuits to stop multipliers and accumulators precisely. We also propose an active data-skipping method to further reduce power consumption by slightly degrading recognition accuracy. In this method, each multiplier and accumulator are stopped by using small values (e.g., 1, 2) as input. We implemented single shot multi-box detector 500 (SSD500) network model on a Xilinx ZU9 and applied our proposed techniques. We verified that operations were stopped at a rate of 49.1%, recognition accuracy was degraded by 0.29%, power consumption was reduced from 9.2 to 4.4 W (-52.3%), and circuit overhead was reduced from 5.1 to 2.7% (-45.9%). The proposed techniques were determined to be effective for lowering the power consumption of CNN-based edge devices such as FPGA.
Satoshi IMAMURA Eiji YOSHIDA Kazuichi OE
Emerging solid state drives (SSDs) based on a next-generation memory technology have been recently released in market. In this work, we call them low-latency SSDs because the device latency of them is an order of magnitude lower than that of conventional NAND flash SSDs. Although low-latency SSDs can drastically reduce an I/O latency perceived by an application, the overhead of OS processing included in the I/O latency has become noticeable because of the very low device latency. Since the OS processing is executed on a CPU core, its operating frequency should be maximized for reducing the OS overhead. However, a higher core frequency causes the higher CPU power consumption during I/O accesses to low-latency SSDs. Therefore, we propose the device utilization-aware DVFS (DU-DVFS) technique that periodically monitors the utilization of a target block device and applies dynamic voltage and frequency scaling (DVFS) to CPU cores executing I/O-intensive processes only when the block device is fully utilized. In this case, DU-DVFS can reduce the CPU power consumption without hurting performance because the delay of OS processing incurred by decreasing the core frequency can be hidden. Our evaluation with 28 I/O-intensive workloads on a real server containing an Intel® Optane™ SSD demonstrates that DU-DVFS reduces the CPU power consumption by 41.4% on average (up to 53.8%) with a negligible performance degradation, compared to a standard DVFS governor on Linux. Moreover, the evaluation with multiprogrammed workloads composed of I/O-intensive and non-I/O-intensive programs shows that DU-DVFS is also effective for them because it can apply DVFS only to CPU cores executing I/O-intensive processes.
Hitoshi TAKESHITA Keiichi MATSUMOTO Hiroshi HASEGAWA Ken-ichi SATO Emmanuel Le Taillandier de GABORY
We realize a multicore erbium-doped fiber amplifier (MC-EDFA) with 2dB optical gain improvement (average) by recycling the residual 0.98μm pump light from the MC-EDF output. Eight-channel per core wavelength division multiplexed (WDM) Nyquist PM-16QAM optical signal amplification is demonstrated over a 40-minute period. Furthermore, we demonstrate the proposed MC-EDFA's stability by using it to amplify a Nyquist PM-16QAM signal and evaluating the resulting Q-factor variation. We found that our scheme contributes to reducing the total power consumption of MC-EDFAs in spatial division multiplexing (SDM)/WDM networks by up to 33.5%.
Kota MUROI Hayato MASHIKO Yukihide KOHIRA
Due to progressing process technology, yield of chips is reduced by timing violation caused by delay variation of gates and wires in fabrication. Recently, post-silicon delay tuning, which inserts programmable delay elements (PDEs) into clock trees before the fabrication and adjusts the delays of the PDEs to recover the timing violation after the fabrication, is promising to improve the yield. Although post-silicon delay tuning improves the yield, it increases circuit area and power consumption since the PDEs are inserted. In this paper, a PDE structure is taken into consideration to reduce the circuit area and the power consumption. Moreover, a delay selection algorithm, and a clustering method, in which some PDEs are merged into a PDE and the PDE is inserted for multiple registers, are proposed to reduce the circuit area and the power consumption. In computational experiments, the proposed method reduced the circuit area and the power consumption in comparison with an existing method.
Takuya HABARA Keiichi MIZUTANI Hiroshi HARADA
In this paper, we propose an IEEE 802.15.10-based layer 2 routing (L2R) method with a load balancing algorithm; the proposal considers fairness in terms of the cumulative number of sending packets at each terminal to resolve the packet concentration problem for the IEEE 802.15.4-based low-power consumption wireless smart utility network (Wi-SUN) systems. The proposal uses the accumulated sending times of each terminal as a weight in calculating each path quality metric (PQM) to decide multi-hopping routes with load balancing in the network. Computer simulation of the mesh network with 256 terminals shows that the proposed routing method can improve the maximum sending ratio (MSR), defined as the ratio of the maximum sending times to the average number of sending times in the network, by 56% with no degradation of the end-to-end communication success ratio (E2E-SR). The proposed algorithm is also experimentally evaluated by using actual Wi-SUN modules. The proposed routing method also improves the MSR by 84% with 70 terminals. Computer simulations and experiments prove the effectiveness of the proposed method in terms of load balancing.
Haruki MARUOKA Masashi HIFUMI Jun FURUTA Kazutoshi KOBAYASHI
We propose a radiation-hardened Flip-Flop (FF) with stacked transistors based on the Adaptive Coupling Flip-Flop (ACFF) with low power consumption in a 65 nm FDSOI process. The slave latch in ACFF is much weaker against soft errors than the master latch. We design several FFs with stacked transistors in the master or slave latches to mitigate soft errors. We investigate radiation hardness of the proposed FFs by α particle and neutron irradiation tests. The proposed FFs have higher radiation hardness than a conventional DFF and ACFF. Neutron irradiation and α particle tests revealed no error in the proposed AC Slave-Stacked FF (AC_SS FF) which has stacked transistors only in the slave latch. We also investigate radiation hardness of the proposed FFs by heavy ion irradiation. The proposed FFs maintain higher radiation hardness up to 40 MeV-cm2/mg than the conventional DFF. Stacked inverters become more sensitive to soft errors by increasing tilt angles. AC_SS FF achieves higher radiation hardness than ACFF with the performance equivalent to that of ACFF.
Toshihiro KATASHITA Masakazu HIOKI Yohei HORI Hanpei KOIKE
Field-programmable gate array (FPGA) devices are applied for accelerating specific calculations and reducing power consumption in a wide range of areas. One of the challenges associated with FPGAs is reducing static power for enforcing their power effectiveness. We propose a method involving fine-grained reconfiguration of body biases of logic and net resources to reduce the static power of FPGA devices. In addition, we develop an FPGA device called Flex Power FPGA with SOTB technology and demonstrate its power reduction function with a 32-bit counter circuit. In this paper, we describe the construction of an experimental platform to precisely evaluate power consumption and the maximum operating frequency of the device under various operating voltages and body biases with various practical circuits. Using the abovementioned platform, we evaluate the Flex Power FPGA chip at operating voltages of 0.5-1.0 V and at body biases of 0.0-0.5 V. In the evaluation, we use a 32-bit adder, 16-bit multiplier, and an SBOX circuit for AES cryptography. We operate the chip virtually with uniformed body bias voltage to drive all of the logic resources with the same threshold voltage. We demonstrate the advantage of the Flex Power FPGA by comparing its performance with non-reconfigurable biasing.
Xuechun WANG Yuan JI Wendong CHEN Feng RAN Aiying GUO
Hardware implementation of neural networks usually have high computational complexity that increase exponentially with the size of a circuit, leading to more uncertain and unreliable circuit performance. This letter presents a novel Radial Basis Function (RBF) neural network based on parallel fault tolerant stochastic computing, in which number is converted from deterministic domain to probabilistic domain. The Gaussian RBF for middle layer neuron is implemented using stochastic structure that reduce the hardware resources significantly. Our experimental results from two pattern recognition tests (the Thomas gestures and the MIT faces) show that the stochastic design is capable to maintain equivalent performance when the stream length set to 10Kbits. The stochastic hidden neuron uses only 1.2% hardware resource compared with the CORDIC algorithm. Furthermore, the proposed algorithm is very flexible in design tradeoff between computing accuracy, power consumption and chip area.
Shunsuke YAGAI Masato OGUCHI Miyuki NAKANO Saneyasu YAMAGUCHI
In data centers, large numbers of computers are run simultaneously. These computers consume an enormous amount of energy. Several challenges related to this issue have been published. An energy-efficient storage management method that cooperates with applications was one effective approach. In this method, data and storage devices are managed using application support and the power consumption of storage devices is significantly decreased. However, existing studies do not take the virtualized environment into account. Recently, many data-intensive applications have been run in a virtualized environment, such as the cloud computing environment. In this paper, we focus on a virtualized environment wherein multiple virtual machines run on a physical computer and a data intensive application runs on each virtual machine. We discuss a method for reducing storage device power consumption using application support. First, we propose two storage management methods using application information. One method optimizes the inter-HDD file layout. This method removes frequently-accessed files from a certain HDD and switches the HDD to power-off mode. To balance loads and reduce seek distances, this method separates a heavily accessed file and consolidates files in a virtual machine with low access frequency. The other method optimizes the intra-HDD file layout, in addition to performing inter-HDD optimization. This method places frequently accessed files near each other. Second, we present our experimental results and demonstrate that the proposed methods can create sufficiently long HDD access intervals that power-off mode can be used, and thereby, reduce the power consumption of storage devices.
Zhi-Ming LIN Po-Yu KUO Zhong-Cheng SU
The mixer is a crucial circuit block in a WiMax system receiver. The performance of a mixer depends on three specifications: conversion gain, linearity and noise figure. Many mixers have been recently proposed for UWB and wideband systems; however, they either cannot achieve the high conversion gain required for a WiMAX system or they are prone to high power consumption. In this paper, a folded mixer with a high conversion gain is designed for a 2-11GHz WiMAX system and it can achieve a 20MHz IF output signal. From the simulation results, the proposed folded mixer achieves a conversion gain of 18.9 to 21.5dB for the full bandwidth. With a 0.2 to 4.4dBm IIP3, the NF is 13.5 to 17.6dB. The folded mixer is designed using TSMC 0.18µm CMOS technology. The core power consumption of the mixer is 11.8mW.
Ji-Hoon CHOI Oh-Young LEE Myong-Young LEE Kyung-Jin KANG Jong-Ok KIM
With the appearance of large OLED panels, the OLED TV industry has experienced significant growth. However, this technology is still in the early stages of commercialization, and some technical challenges remain to be overcome. During the development phase of a product, power consumption is one of the most important considerations. To reduce power consumption in OLED displays, we propose a method based on just-noticeable difference (JND). JND refers to the minimum visibility threshold when visual content is altered and results from physiological and psychophysical phenomena in the human visual system (HVS). A JND model suitable for OLED displays is derived from numerous experiments with OLED displays. With the use of JND, it is possible to reduce power consumption while minimizing perceptual image quality degradation.
Balgeun YOO Seongjin LEE Youjip WON
SSDs consist of non-mechanical components (host interface, control core, DRAM, flash memory, etc.) whose integrated behavior is not well-known. This makes an SSD seem like a black-box to users. We analyzed power consumption of four SSDs with standard I/O operations. We find the following: (a) the power consumption of SSDs is not significantly lower than that of HDDs, (b) all SSDs we tested had similar power consumption patterns which, we assume, is a result of their internal parallelism. SSDs have a parallel architecture that connects flash memories by channel or by way. This parallel architecture improves performance of SSDs if the information is known to the file system. This paper proposes three SSD characterization algorithms to infer the characteristics of SSD, such as internal parallelism, I/O unit, and page allocation scheme, by measuring its power consumption with various sized workloads. These algorithms are applied to four real SSDs to find: (i) the internal parallelism to decide whether to perform I/Os in a concurrent or an interleaved manner, (ii) the I/O unit size that determines the maximum size that can be assigned to a flash memory, and (iii) a page allocation method to map the logical address of write operations, which are requested from the host to the physical address of flash memory. We developed a data sampling method to provide consistency in collecting power consumption patterns of each SSD. When we applied three algorithms to four real SSDs, we found flash memory configurations, I/O unit sizes, and page allocation schemes. We show that the performance of SSD can be improved by aligning the record size of file system with I/O unit of SSD, which we found by using our algorithm. We found that Q Pro has I/O unit of 32 KB, and by aligning the file system record size to 32 KB, the performance increased by 201% and energy consumption decreased by 85%, which compared to the record size of 4 KB.
Yung-Hao LAI Yang-Lang CHANG Jyh-Perng FANG Lena CHANG Hirokazu KOBAYASHI
Through-silicon vias (TSV) allow the stacking of dies into multilayer structures, and solve connection problems between neighboring tiers for three-dimensional (3D) integrated circuit (IC) technology. Several studies have investigated the placement and routing in 3D ICs, but not much has focused on circuit partitioning for 3D stacking. However, with the scaling trend of CMOS technology, the influence of the area of I/O pads, power/ground (P/G) pads, and TSVs should not be neglected in 3D partitioning technology. In this paper, we propose an iterative layer-aware partitioning algorithm called EX-iLap, which takes into account the area of I/O pads, P/G pads, and TSVs for area balancing and minimization of inter-tier interconnections in a 3D structure. Minimizing the quantity of TSVs reduces the total silicon die area, which is the main source of recurring costs during fabrication. Furthermore, estimations of the number of TSVs and the total area are somewhat imprecise if P/G TSVs are not taken into account. Therefore, we calculate the power consumption of each cell and estimate the number of P/G TSVs at each layer. Experimental results show that, after considering the power of interconnections and pads, our algorithm can reduce area-overhead by ~39% and area standard deviation by ~69%, while increasing the quantity of TSVs by only 12%, as compared to the algorithm without considering the power of interconnections and pads.
Dynamic instruction window resizing (DIWR) is a scheme that effectively exploits both memory-level parallelism and instruction-level parallelism by configuring the instruction window size appropriately for exploiting each parallelism. Although a previous study has shown that the DIWR processor achieves a significant speedup, power consumption has not been explored. The power consumption is increased in DIWR because the instruction window resources are enlarged in memory-intensive phases. If the power consumption exceeds the power budget determined by certain requirements, the DIWR processor must save power and thus, the performance previously presented cannot be achieved. In this paper, we explore to what extent the DIWR processor can achieve improved performance for a given power budget, assuming that dynamic voltage and frequency scaling (DVFS) is introduced as a power saving technique. Evaluation results using the SPEC2006 benchmark programs show that the DIWR processor, even with a constrained power budget, achieves a speedup over the conventional processor over a wide range of given power budgets. At the most important power budget point, i.e., when the power a conventional processor consumes without any power constraint is supplied, DIWR achieves a 16% speedup.
Analog and digital collaborative design techniques for wireless SoCs are reviewed in this paper. In wireless SoCs, delicate analog performance such as sensitivity of the receiver is easily degraded due to interferences from digital circuit blocks. On the other hand, an analog performance such as distortion is strongly compensated by digital assist techniques with low power consumption. In this paper, a sensitivity recovery technique using the analog and digital collaborative design, and digital assist techniques to achieve low-power and high-performance analog circuits are presented. Such analog and digital collaborative design is indispensable for wireless SoCs.
Takaaki DEGUCHI Yoshiaki TANIGUCHI Go HASEGAWA Yutaka NAKAMURA Norimichi UKITA Kazuhiro MATSUDA Morito MATSUOKA
In this paper, we propose a workload assignment policy for reducing power consumption by air conditioners in data centers. In the proposed policy, to reduce the air conditioner power consumption by raising the temperature set points of the air conditioners, the temperatures of all server back-planes are equalized by moving workload from the servers with the highest temperatures to the servers with the lowest temperatures. To evaluate the proposed policy, we use a computational fluid dynamics simulator for obtaining airflow and air temperature in data centers, and an air conditioner model based on experimental results from actual data center. Through evaluation, we show that the air conditioners' power consumption is reduced by 10.4% in a conventional data center. In addition, in a tandem data center proposed in our research group, the air conditioners' power consumption is reduced by 53%, and the total power consumption of the whole data center is exhibited to be reduced by 23% by reusing the exhaust heat from the servers.