Akram BEN AHMED Hiroki MATSUTANI Michihiro KOIBUCHI Kimiyoshi USAMI Hideharu AMANO
In this paper, the Multi-voltage (multi-Vdd) variable pipeline router is proposed to reduce the power consumption of Network-on-Chips (NoCs) designed for Chip Multi-processors (CMPs). The multi-Vdd variable pipeline router adjusts its pipeline depth (i.e., communication latency) and supply voltage level in response to the applied workload. Unlike Dynamic Voltage and Frequency Scaling (DVFS) routers, the operating frequency remains the same for all routers throughout the CMP; thus, omitting the need to synchronize neighboring routers working at different frequencies. Two types of router architectures are presented: a Coarse-Grained Variable Pipeline (CG-VP) router that changes the voltage supplied to the entire router, and a Fine-Grained Variable Pipeline (FG-VP) router that uses a finer power partition. The evaluation results showed that the CG-VP and FG-VP routers achieve a 22.9% and 35.3% power reduction on average with 14% and 23% area overhead in comparison with a baseline router without variable pipelines, respectively. Thanks to the adopted look-ahead mechanism to switch the supply voltage, the performance overhead is only 4.4%.
Lian ZENG Tieyuan PAN Xin JIANG Takahiro WATANABE
As the semiconductor technology continues to develop, hundreds of cores will be deployed on a single die in the future Chip-Multiprocessors (CMPs) design. Three-Dimensional Network-on-Chips (3D NoCs) has become an attractive solution which can provide impressive high performance. An efficient and deadlock-free routing algorithm is a critical to achieve the high performance of network-on-chip. Traditional methods based on deterministic and turn model are deadlock-free, but they are unable to distribute the traffic loads over the network. In this paper, we propose an efficient, adaptive and deadlock-free algorithm (EAR) based on a novel routing selection strategy in 3D NoC, which can distribute the traffic loads not only in intra-layers but also in inter-layers according to congestion information and path diversity. Simulation results show that the proposed method achieves the significant performance improvement compared with others.
Po-Chiun HUANG Shin-Jie HUANG Po-Hsiang LAN
Distributed power delivery is blooming in SoC power system because the fine-grained power management needs separate power sources to adjust each voltage island dynamically. In addition, dedicated power sources for critical circuit blocks can achieve better signal integrity. To extensively utilize the power modules when they are redundant and idle, this work applies the cooperation concept in SoC power management. The key controller is a mixed-signal estimator that executes the intelligent procedures, like real-time swap the power module depending on its loading and healthy condition, automatically configure the power system with phase interleaving, and support all the peripheral functions. To demonstrate the proposed concept, a prototype chip for voltage down-conversion is implemented. This chip contains four switched-inductor converter modules to emulate the cooperative power network. Each module is small therefore the power efficiency is not optimal for the heavy load. With the cooperation between power modules, the power efficiency is 88% for 300mA load, that is 8.5% higher than the single module operation.
Jie JIAN Mingche LAI Liquan XIAO
With the development of silicon-based Nano-photonics, Optical Network on Chip (ONoC) is, due to its high bandwidth and low latency, becoming an important choice for future multi-core networks. As a key ONoC technology, the arbitration scheme should provide differential arbitration service with high throughput and low latency for various types and priorities of traffic in CMPs. In this work, we propose a fast hierarchical arbitration scheme based on multi-level priority QoS. First, given multi-priority data buffer queue, arbiters provide differential transmissions with fair service for all nodes and guarantee the max-transmit-delay and min-communication-bandwidth for all queues. Second, arbiter adopts the transmit bound resource reservation scheme to reserve time slots for all nodes fairly, thereby achieving a throughput of 100%. Third, we propose fast arbitration with a layout of fast optical arbitration channels (FOACs) to reduce the arbitration period, thereby reducing packet transmitting delay. Simulation results show that with our hierarchical arbitration scheme, all nodes are allocated almost equal service access probability under various traffic patterns; thus, the min-communication-bandwidth and max-transmit-delay is guaranteed to be 5% and 80 cycles, respectively, under the overload demands. This scheme improves throughput by 17% compared to FeatherWeight under a self-similar traffic pattern and decreases arbitration delay by 15% compare to 2-pass arbitration, incurring a total power overhead of 5%.
Pil-Ho LEE Yu-Jeong HWANG Han-Yeol LEE Hyun-Bae LEE Young-Chan JANG
An on-chip monitoring circuit using a sub-sampling scheme, which consists of a 6-bit flash analog-to-digital converter (ADC) and a 51-phase phase-locked loop (PLL)-based frequency synthesizer, is proposed to analyze the signal integrity of a single-ended 8-Gb/s octal data rate (ODR) chip-to-chip interface with a source synchronous clocking scheme.
Takashi TOKUDA Hiroaki TAKEHARA Toshihiko NODA Kiyotaka SASAGAWA Jun OHTA
On-chip neural interface devices based on CMOS image sensor technology are proposed and demonstrated. The devices were designed with target applications to optogenetics in bioscience. Multifunctional CMOS image sensors equipped with an addressable on-chip electrode array were integrated with a functional interface chip that contained embedded GaInN light emitting diodes (LEDs) and electrodes to create a neural interface. Detailed design information regarding the CMOS sensor chip and the functional interface chip including the packaging structure and fabrication processes are presented in this paper. The on-chip optical stimulation functionality was demonstrated in an in vitro experiment using neuron-like cells cultured on the proposed device.
Toshiyuki KIKKAWA Toru NAKURA Kunihiro ASADA
This paper proposes an on-chip measurement method of PLL through fully digital interface. For the measurement of the PLL transfer function, we modulated the phase of the PLL input in triangular form using Digital-to-Time Converter (DTC) and read out the response by Time-to-Digital Converter (TDC). Combination of the DTC and TDC can obtain the transfer function of the PLL both in the magnitude domain and the phase domain. Since the DTC and TDC can be controlled and observed by digital signals, the measurement can be conducted without any high speed analog signal. Moreover, since the DTC and TDC can be designed symmetrically, the measurement method is robust against Process, Voltage, and Temperature (PVT) variations. At the same time, the employment of the TDC also enables a measurement of the PLL lock range by changing the division ratio of the divider. Two time domain circuits were designed using 180nm CMOS process and the HSPICE simulation results demonstrated the measurement of the transfer function and lock range.
Naoya OIKAWA Jiro HIROKAWA Hiroshi NAKANO Yasutake HIRACHI Hiroshi ISONO Atsushi ISHII Makoto ANDO
For the realization of a high-efficiency antenna for 60GHz-band wireless personal area network, we propose placing a CMOS RF circuit and an antenna on opposing sides of a silicon chip. They are connected with low loss by a coaxial-line structure using a hole opening in the chip. Since the CMOS circuit is driven differentially, a differential-feed antenna is used. In this paper, we design and measure a differential-feed square patch antenna on a silicon chip. To enhance the radiation efficiency, it is placed on a 200µm thick resin layer. The calculated radiation efficiency of 79% includes the connection loss. A prototype antenna is measured in a reverberation chamber, and its radiation efficiency is estimated to be about 81±3%.
Rui WU Wei DENG Shinji SATO Takuichi HIRANO Ning LI Takeshi INOUE Hitoshi SAKANE Kenichi OKADA Akira MATSUZAWA
A 60-GHz CMOS transmitter with on-chip antenna for high-speed short-range wireless interconnections is presented. The radiation gain of the on-chip antenna is doubled using helium-3 ion irradiation technique. The transmitter core is composed of a resistive-feedback RF amplifier, a double-balanced passive mixer, and an injection-locked oscillator. The wideband and power-saving design of the transmitter core guarantees the low-power and high-data-rate characteristic. The transmitter fabricated in a 65-nm CMOS process achieves 5-Gb/s data rate with an EVM performance of $-$12 dB for BPSK modulation at a distance of 1,mm. The whole transmitter consumes 17,mW from a 1.2-V supply and occupies a core area of 0.64,mm$^{2}$ including the on-chip antenna. The gain-enhanced antenna together with the wideband and power-saving design of the transmitter provides a low-power low-cost full on-chip solution for the short-range high-data-rate wireless communication.
As semiconductor devices scale into deep sub-micron regime, the reliability issue due to radiation-induced soft errors increases in on-chip memory systems. Neutron-induced soft errors transiently upset adjacent information of multiple cells in these systems. Although single error correction and double error detection (SEC--DED) codes have been employed to protect on-chip memories from soft errors, they are not sufficient against multiple cell upsets (MCUs). SEC--DED and double adjacent error correction (SEC--DED--DAEC) codes have recently been proposed to address this problem. However, these codes do not the resolve mis-correction of double non-adjacent errors because syndromes for double non-adjacent errors are equal to that of double adjacent errors. The occurrence of this mis-correction in region of critical memory section such as operating systems may lead to system malfunction. To eliminate mis-correction, the syndrome spaces for double adjacent and double non-adjacent errors are not shared using the matrix with reversed colexicographic order. The proposed codes are implemented using hardware description language and synthesized using 32 nm technology library. The results show that there is no mis-correction in the proposed codes. In addition, the performance enhancement of the decoder is approximately 51.9% compared to double error correction codes for on-chip memories. The proposed SEC--DED--DAEC codes is suitable for protecting on-chip memory applications from MCUs-type soft errors.
Akira MOCHIZUKI Hirokatsu SHIRAHAMA Yuma WATANABE Takahiro HANYU
An energy-efficient intra-chip communication link circuit with ternary current signaling is proposed for an asynchronous Network-on-Chip. The data signal encoded by an asynchronous three-state protocol is represented by a small-voltage-swing three-level intermediate signal, which results in the reduction of transition delay and achieving energy-efficient data transfer. The three-level voltage is generated by using a combination of dynamically controlled current sources with feedback loop mechanism. Moreover, the proposed circuit contains a power-saving scheme where the dynamically controlled transistors also are utilized. By cutting off the current paths when the data transfer on the communication link is inactive, the power dissipation can be greatly reduced. It is demonstrated that the average data-transfer speed is about 1.5 times faster than that of a binary CMOS implementation using a 130nm CMOS technology at the supply voltage of 1.2V.
Shijun LIN Zhaoshan LIU Jianghong SHI Xiaofang WU
In this paper, we propose a scalable connection-based time division multiple access architecture for wireless NoC. In this architecture, only one-hop transmission is needed when a packet is transmitted from one wired subnet to another wired subnet, which improves the communication performance and cuts down the energy consumption. Furthermore, by carefully designing the central arbiter, the bandwidth of the wireless channel can be fully used. Simulation results show that compared with the traditional WCube wireless NoC architecture, the proposed architecture can greatly improve the network throughput, and cut down the transmission latency and energy consumption with a reasonable area overhead.
Tiebin WU Hengzhu LIU Botao ZHANG
This paper presents a novel test data compression scheme for SoCs based on block merging and compatibility. The technique exploits the properties of compatibility and inverse compatibility between consecutive blocks, consecutive merged blocks, and two halves of the encoding merged block itself to encode the pre-computed test data. The decompression circuit is simple to be implemented and has advantage of test-independent. In addition, the proposed scheme is applicable for IP cores in SoCs since it compresses the test data without requiring any structural information of the circuit under test. Experimental results demonstrate that the proposed technique can achieve an average compression ratio up to 68.02% with significant low test application time.
Naoya ONIZAWA Akira MOCHIZUKI Hirokatsu SHIRAHAMA Masashi IMAI Tomohiro YONEDA Takahiro HANYU
This paper introduces a partially parallel inter-chip link architecture for asynchronous multi-chip Network-on-Chips (NoCs). The multi-chip NoCs that operate as a large NoC have been recently proposed for very large systems, such as automotive applications. Inter-chip links are key elements to realize high-performance multi-chip NoCs using a limited number of I/Os. The proposed asynchronous link based on level-encoded dual-rail (LEDR) encoding transmits several bits in parallel that are received by detecting the phase information of the LEDR signals at each serial link. It employs a burst-mode data transmission that eliminates a per-bit handshake for a high-speed operation, but the elimination may cause data-transmission errors due to cross-talk and power-supply noises. For triggering data retransmission, errors are detected from the embedded phase information; error-detection codes are not used. The throughput is theoretically modelled and is optimized by considering the bit-error rate (BER) of the link. Using delay parameters estimated for a 0.13 µm CMOS technology, the throughput of 8.82 Gbps is achieved by using 10 I/Os, which is 90.5% higher than that of a link using 9 I/Os without an error-detection method operating under negligible low BER (<10-20).
Takashi MIYAMORI Hui XU Hiroyuki USUI Soichiro HOSODA Toru SANO Kazumasa YAMAMOTO Takeshi KODAKA Nobuhiro NONOGAKI Nau OZAKI Jun TANABE
New media processing applications such as image recognition and AR (Augment Reality) have become into practical on embedded systems for automotive, digital-consumer and mobile products. Many-core processors have been proposed to realize much higher performance than multi-core processors. We have developed a low-power many-core SoC for multimedia applications in 40nm CMOS technology. Within a 210mm2 die, two 32-core clusters are integrated with dynamically reconfigurable processors, hardware accelerators, 2-channel DDR3 I/Fs, and other peripherals. Processor cores in the cluster share a 2MB L2 cache connected through a tree-based Network-on-Chip (NoC). Its total peak performance exceeds 1.5TOPS (Tera Operations Per Second). The high scalability and low power consumption are accomplished by parallelized software for multimedia applications. In case of face detection, the performance scales up to 64 cores and the SoC consumes only 2.21W. Moreover, it can execute the 1080p 48fps H.264 decoding about 520mW by 28 cores and the 4K2K 15fps super resolution about 770mW by 32 cores in one cluster. Exploiting parallelism by low power processor cores, the many-core SoC provides several tens of times better energy efficiency than that of a high performance desk-top quad-core processor.
Wenpo ZHANG Kazuteru NAMBA Hideo ITO
In recent VLSIs, small-delay defects, which are hard to detect by traditional delay fault testing, can bring about serious issues such as short lifetime. To detect small-delay defects, on-chip delay measurement which measures the delay time of paths in the circuit under test (CUT) was proposed. However, this approach incurs high test cost because it uses scan design, which brings about long test application time due to scan shift operation. Our solution is a test application time reduction method for testing using the on-chip path delay measurement. The testing with on-chip path delay measurement does not require capture operations, unlike the conventional delay testing. Specifically, FFs keep the transition pattern of the test pattern pair sensitizing a path under measurement (PUM) (denoted as p) even after the measurement of p. The proposed method uses this characteristic. The proposed method reduces scan shift time and test data volume using test pattern merging. Evaluation results on ISCAS89 benchmark circuits indicate that the proposed method reduces the test application time by 6.89∼62.67% and test data volume by 46.39∼74.86%.
Huaxi GU Zheng CHEN Yintang YANG Hui DING
Optical Network-on-Chip (ONoC) is a promising emerging technology, which can solve the bottlenecks faced by electrical on-chip interconnection. However, the existing proposals of ONoC are mostly built on fixed topologies, which are not flexible enough to support various applications. To make full use of the limited resource and provide a more efficient approach for resource allocation, RONoC (Reconfigurable Optical Network-on-Chip) is proposed in this letter. The topology can be reconfigured to meet the requirement of different applications. An 8×8 nonblocking router is also designed, together with the communication mechanism. The simulation results show that the saturation load of RONoC is 2 times better than mesh, and the energy consumption is 25% lower than mesh.
Ahmadou Dit Adi CISSE Michihiro KOIBUCHI Masato YOSHIMI Hidetsugu IRIE Tsutomu YOSHINAGA
Silicon photonics Network-on-Chips (NoCs) have emerged as an attractive solution to alleviate the high power consumption of traditional electronic interconnects. In this paper, we propose a fully optical ring NoC that combines static and dynamic wavelength allocation communication mechanisms. A different wavelength-channel is statically allocated to each destination node for light weight communication. Contention of simultaneous communication requests from multiple source nodes to the destination is solved by a token based arbitration for the particular wavelength-channel. For heavy load communication, a multiwavelength-channel is available by requesting it in execution time from source node to a special node that manages dynamic allocation of the shared multiwavelength-channel among all nodes. We combine these static and dynamic communication mechanisms in a same network that introduces selection techniques based on message size and congestion information. Using a photonic NoC simulator based on Phoenixsim, we evaluate our architecture under uniform random, neighbor, and hotspot traffic patterns. Simulation results show that our proposed fully optical ring NoC presents a good performance by utilizing adequate static and dynamic channels based on the selection techniques. We also show that our architecture can reduce by more than half, the energy consumption necessary for arbitration compared to hybrid photonic ring and mesh NoCs. A comparison with several previous works in term of architecture hardware cost shows that our architecture can be an attractive cost-performance efficient interconnection infrastructure for future SoCs and CMPs.
Zhen ZHANG Shouyi YIN Leibo LIU Shaojun WEI
TSV-interconnected 3D chips face problems such as high cost, low yield and large power dissipation. We propose a wireless 3D on-chip-network architecture for application-specific SoC design, using inductive-coupling interconnect instead of TSV for inter-layer communication. Primary design challenge of inductive-coupling 3D SoC is allocating wireless links in the 3D on-chip network effectively. We develop a design flow fully exploiting the design space brought by wireless links while providing flexible tradeoff for user's choice. Experimental results show that our design brings great improvement over uniform design and Sunfloor algorithm on latency (5% to 20%) and power consumption (10% to 45%).
Jun ASANO Jiro HIROKAWA Hiroshi NAKANO Yasutake HIRACHI Hiroshi ISONO Atsushi ISHII Makoto ANDO
As a first step towards the realization of high-efficiency on-chip antennas for 60GHz-band wireless personal area networks, this paper proposes the fabrication of a patch antenna placed on a 200µm thick dielectric resin and fed through a hole in a silicon chip. Despite the large tan δ of the adopted material (0.015 at 50GHz), the thick resin reduces the conductor loss at the radiating element and a radiation efficiency of 78%, which includes the connecting loss from the bottom is predicted by simulation. This calculated value is verified in the millimeter-wave band by experiments in a reverberation chamber. Six stirrers are installed, one on each wall in the chamber, to create a statistical Rayleigh environment. The manufactured prototype antenna with a test jig demonstrates the radiation efficiency of 75% in the reverberation chamber. This agrees well with the simulated value of 76%, while the statistical measurement uncertainty of our handmade reverberation chamber is calculated as ±0.14dB.