Yansheng WANG Leibo LIU Shouyi YIN Min ZHU Peng CAO Jun YANG Shaojun WEI
RCP (Reconfigurable Computing Processor) is intended to fill the gap between ASIC and GPP (General Purpose processor), which achieves much higher energy efficiency than GPP, while is much more flexible than ASIC. In this paper, one organization of on-chip data memory called LIBODM (LIfetime Based On-chip Data Memory) is proposed to reduce the reference delay for data and on-chip data memory size in RCP. In the LIBODM, the allocation of data is based on the data dependency. The data with low data dependency are stored off-chip to save the storage costs, while the data with high data dependency are stored on-chip to reduce the reference delay. Besides, in the LIBODM, the on-chip data are classified into two types, and the classification is based on the lifetime of data. For short lifetime data, they are preferred to be stored into FIFO to increase the reuse ratio of memory space naturally. For long lifetime data, they are preferred to be stored into RAM for several time references. The LIBODM has been testified in one CGRA (Coarse Grained Reconfigurable Architecture) called RPU (Reconfigurable Processing Unit), and two RPUs has been integrated in a RCP-REMUS_HP (High Performance version of Reconfigurable MUlti-media System) focused on video decoding. Thanks to the LIBODM, although the size of on-chip data memory in REMUS_HP is small, a high performance can still be achieved. Compared with XPP and ADRES, in REMUS_HP, the on-chip data memory size at same performance level is only 23.9% and 14.8%. REMUS_HP is implemented on a 48.9mm2 silicon with TSMC 65nm technology. Simulation shows that 1920*1088 @30fps can be achieved for H.264 high-profile decoding when exploiting a 200MHz working frequency. Compared with the high performance version of XPP, the performance is 150% boosted, while the energy efficiency is 17.59x boosted.
We propose a fault diagnosis and reconfiguration method based on the Pair and Swap scheme to improve the reliability and the MTTF (Mean Time To Failure) of network-on-chip based multiple processor systems where each processor core has its private memory. In the proposed scheme, two identical copies of a given task are executed on a pair of processor cores and the results are compared repeatedly in order to detect processor faults. If a fault is detected by mismatches, the fault is identified and isolated using a TMR (Triple Module Redundancy) and the system is reconfigured by the redundant processor cores. We propose that each task is quadruplicated and statically assigned to private memories so that each memory has only two different tasks. We evaluate the reliability of the proposed quadruplicated task allocation scheme in the viewpoint of MTTF. As a result, the MTTF of the proposed scheme is over 4.3 times longer than that of the duplicated task allocation scheme.
This paper investigates potential to improve fault-detection coverage by means of on-chip redundancy. The international standard on functional safety, namely, IEC61508 Ed. 2.0 Part 2 Annex E.3 prescribes the upper bound of βIC (common cause failure (CCF) ratio to all failures) is 0.25 to satisfy frequency upper bound of dangerous failure in the safety function for SIL (Safety Integrated Level) 3. On the other hand, this paper argues that the βIC does not necessarily have to be less than 0.25 for SIL 3, and that the upper bound of βIC can be determined depending on failure rate λ and CCF detection coverage. In other words, the frequency upper bound of dangerous failure for SIL3 can also be satisfied with βIC higher than 0.25 if the failure rate λ is lower than 400[fit]. Moreover, the paper shows that on-chip redundancy has potential to satisfy SIL 4 requirement; the frequency upper bound of dangerous failure for SIL4 can be satisfied with feasible ranges of βIC, λ and CCF coverage which can be realized by redundant code.
Satoshi TAKAYA Yoji BANDO Toru OHKAWA Toshiharu TAKARAMOTO Toshio YAMADA Masaaki SOUDA Shigetaka KUMASHIRO Tohru MOGAMI Makoto NAGATA
The response of differential pairs against low-frequency substrate voltage variation is captured in a combined transistor and substrate network models. The model generation is regularized for variation of transistor geometries including channel sizes, fingering and folding, and the placements of guard bands. The expansion of the models for full-chip substrate noise analysis is also discussed. The substrate sensitivity of differential pairs is evaluated through on-chip substrate coupling measurements in a 90 nm CMOS technology with more than 64 different geometries and operating conditions. The trends and strengths of substrate sensitivity are shown to be well consistent between simulation and measurements.
Ramesh K. POKHAREL Xin LIU Dayang A.A. MAT Ruibing DONG Haruichi KANAYA Keiji YOSHIDA
This paper presents the design of a second-order and a fourth-order bandpass filter (BPF) for 60 GHz millimeter-wave applications in 0.18 µm CMOS technology. The proposed on-chip BPFs employ the folded open loop structure designed on pattern ground shields. The adoption of a folded structure and utilization of multiple transmission zeros in the stopband permit the compact size and high selectivity for the BPF. Moreover, the pattern ground shields obviously slow down the guided waves which enable further reduction in the physical length of the resonator, and this, in turn, results in improvement of the insertion losses. A very good agreement between the electromagnetic (EM) simulations and measurement results has been achieved. As a result, the second-order BPF has the center frequency of 57.5 GHz, insertion loss of 2.77 dB, bandwidth of 14 GHz, return loss less than 27.5 dB and chip size of 650 µm810 µm (including bonding pads) while the fourth-order BPF has the center frequency of 57 GHz, insertion loss of 3.06 dB, bandwidth of 12 GHz, return loss less than 30 dB with chip size of 905 µm810 µm (including bonding pads).
Chizu MATSUMOTO Yuichi HAMAMURA Michinobu NAKAO Kaname YAMASAKI Yoshikazu SAITO Shun'ichi KANEKO
Repairing embedded memories (e-memories) on an advanced system-on-chip (SoC) product is a key technique used to improve product yield. However, increasing the die area of SoC products equipped with various types of e-memories on the die is an issue. A fuse scheme can be used to resolve this issue. However, several fuse schemes that have been proposed to decrease the die area result in an increased repair time. Therefore, in this paper, we propose a novel fuse scheme that decreases both die area and repair time. Moreover, our approach is applied to a 65 nm SoC product. The results indicate that the proposed fuse scheme effectively decreases the die area and repair time of advanced SoC products.
Yasumichi TAKAI Masanori HASHIMOTO Takao ONOYE
This paper investigates power gating implementations that mitigate power supply noise. We focus on the body connection of power-gated circuits, and examine the amount of power supply noise induced by power-on rush current and the contribution of a power-gated circuit as a decoupling capacitance during the sleep mode. To figure out the best implementation, we designed and fabricated a test chip in 65 nm process. Experimental results with measurement and simulation reveal that the power-gated circuit with body-tied structure in triple-well is the best implementation from the following three points; power supply noise due to rush current, the contribution of decoupling capacitance during the sleep mode and the leakage reduction thanks to power gating.
Mohammad Taghi TEIMOORI Ali JAHANIAN Adel DOKHANCHI
Microwave interconnects have been proposed recently to break-down long wires in large integrated circuits. In this paper, using of coplanar waveguide RF interconnects in FPGAs is explored to improve performance and reduce routing congestion. We propose a new FPGA architecture consisting of both metal wires and RF receivers/transmitters corresponding with an algorithm to route the proposed FPGA. Experimental results show that used routing tracks and routing congestion are reduced by 23.8% and 7.06%, respectively and performance of the attempted benchmarks is improved by about 33% using this technique. These benefits are earned in reasonable cost of area and power consumption which is negligible for large and complex circuits.
Seungju LEE Masao YANAGISAWA Nozomu TOGAWA
Network-on-chip (NoC) architectures have emerged as a promising solution to the lack of scalability in multi-processor systems-on-chips (MPSoCs). With the explosive growth in the usage of multimedia applications, it is expected that NoC serves as a multimedia server supporting multi-class services. In this paper, we propose a configuration algorithm for a hybrid bus-NoC architecture together with simulation results. Our target architecture is a hybrid bus-NoC architecture, called busmesh NoC, which is a generalized version of a hybrid NoC with local buses. In our BMNoC configuration algorithm, cores which have a heavy communication volume between them are mapped in a cluster node (CN) and connected by a local bus. CNs can have communication with each other via edge switches (ESes) and mesh routers (MRs). With this hierarchical communication network, our proposed algorithm can improve the latency as compared with conventional methods. Several realistic applications applied to our algorithm illustrate the better performance than earlier studies and feasibility of our proposed algorithm.
Hao XIAO Tsuyoshi ISSHIKI Arif Ullah KHAN Dongju LI Hiroaki KUNIEDA Yuko NAKASE Sadahiro KIMURA
Ultra-wideband (UWB) technology has attracted much attention recently due to its high data rate and low emission power. Its media access control (MAC) protocol, WiMedia MAC, promises a lot of facilities for high-speed and high-quality wireless communication. However, these benefits in turn involve a large amount of computational load, which challenges the traditional uniprocessor architecture based implementation method to provide the required performance. However, the constrained cost and power budget, on the other hand, makes using commercial multiprocessor solutions unrealistic. In this paper, a low-cost and energy-efficient multiprocessor system-on-chip (MPSoC), which tackles at once the aspects of system design, software migration and hardware architecture, is presented for the implementation of UWB MAC layer. Experimental results show that the proposed MPSoC, based on four simple RISC processors and shared-memory infrastructure, achieves up to 45% performance improvement and 65% power saving, but takes 15% less area than the uniprocessor implementation.
This paper presents a single-cycle shared output buffered router for Networks-on-Chip. In output ports, each input port always has an output virtual-channel (VC) which can be exchanged by VC swapper. Its critical path is only 24 logic gates, and it reduces 9.4% area overhead compared with the classical router.
Naoki MASUNAGA Koichi ISHIDA Takayasu SAKURAI Makoto TAKAMIYA
This paper presents a new type of electromagnetic interference (EMI) measurement system. An EMI Camera LSI (EMcam) with a 124 on-chip 25050 µm2 loop antenna matrix in 65 nm CMOS is developed. EMcam achieves both the 2D electric scanning and 60 µm-level spatial precision. The down-conversion architecture increases the bandwidth of EMcam and enables the measurement of EMI spectrum up to 3.3 GHz. The shared IF-block scheme is proposed to relax both the increase of power and area penalty, which are inherent issues of the matrix measurement. The power and the area are reduced by 74% and 73%, respectively. EMI measurement with the smallest 3212 µm2 antenna to date is also demonstrated.
Naoya ONIZAWA Atsushi MATSUMOTO Takahiro HANYU
We have developed a long-range asynchronous on-chip data-transmission link based on multiple-valued single-track signaling for a highly reliable asynchronous Network-on-Chip. In the proposed signaling, 1-bit data with control information is represented by using a one-digit multi-level signal, so serial data can be transmitted asynchronously using only a single wire. The small number of wires alleviates the routing complexity of wiring long-range interconnects. The use of current-mode signaling makes it possible to transmit data at high speed without buffers or repeaters over a long interconnect wire because of the low-voltage swing of signaling, and it leads to low-latency data transmission. We achieve a latency of 0.45 ns, a throughput of 1.25 Gbps, and energy dissipation of 0.58 pJ/bit with a 10-mm interconnect wire under a 0.13 µm CMOS technology. This represents an 85% decrease in latency, a 150% increase in throughput, and a 90% decrease in energy dissipation compared to a conventional serial asynchronous data-transmission link.
Yohei NAKATA Hiroshi KAWAGUCHI Masahiko YOSHIMOTO
As process technology is scaled down, a typical system on a chip (SoC) becomes denser. In scaled process technology, process variation becomes greater and increasingly affects the SoC circuits. Moreover, the process variation strongly affects network-on-chips (NoCs) that have a synchronous network across the chip. Therefore, its network frequency is degraded. We propose a process-variation-adaptive NoC with a variation-adaptive variable-cycle router (VAVCR). The proposed VAVCR can configure its cycle latency adaptively on a processor core basis, corresponding to the process variation. It can increase the network frequency, which is limited by the process variation in a conventional router. Furthermore, we propose a variable-cycle pipeline adaptive routing (VCPAR) method with VAVCR; the proposed VCPAR can reduce packet latency and has tolerance to network congestion. The total execution time reduction of the proposed VAVCR with VCPAR is 15.7%, on average, for five task graphs.
Shouyi YIN Yang HU Zhen ZHANG Leibo LIU Shaojun WEI
Hybrid wired/wireless on-chip network is a promising communication architecture for multi-/many-core SoC. For application-specific SoC design, it is important to design a dedicated on-chip network architecture according to the application-specific nature. In this paper, we propose a heuristic wireless link allocation algorithm for creating hybrid on-chip network architecture. The algorithm can eliminate the performance bottleneck by replacing multi-hop wired paths by high-bandwidth single-hop long-range wireless links. The simulation results show that the hybrid on-chip network designed by our algorithm improves the performance in terms of both communication delay and energy consumption significantly.
Chaochao FENG Zhonghai LU Axel JANTSCH Minxuan ZHANG Xianju YANG
In this paper, we propose three Deflection-Routing-based Multicast (DRM) schemes for a bufferless NoC. The DRM scheme without packets replication (DRM_noPR) sends multicast packet through a non-deterministic path. The DRM schemes with adaptive packets replication (DRM_PR_src and DRM_PR_all) replicate multicast packets at the source or intermediate node according to the destination position and the state of output ports to reduce the average multicast latency. We also provide fault-tolerant supporting in these schemes through a reinforcement-learning-based method to reconfigure the routing table to tolerate permanent faulty links in the network. Simulation results illustrate that the DRM_PR_all scheme achieves 41%, 43% and 37% less latency on average than that of the DRM_noPR scheme and 27%, 29% and 25% less latency on average than that of the DRM_PR_src scheme under three synthetic traffic patterns respectively. In addition, all three fault-tolerant DRM schemes achieve acceptable performance degradation at various link fault rates without any packet lost.
This paper proposes an all-digital process variability monitor based on a shared structure of a buffer ring and a ring oscillator. The proposed circuit monitors the PMOS and NMOS process variabilities independently according to a count number of a single pulse which propagates on the ring during the buffer ring mode, and an oscillation period during the ring oscillator mode. Using this shared-ring structure, we reduce the occupation area about 40% without loss of process variability monitoring properties compared with the conventional circuit. The proposed shared-ring circuit has been fabricated in 65 nm CMOS process and the measurement results with two different wafer lots show the feasibility of the proposed process variability monitoring scheme.
Yoji BANDO Satoshi TAKAYA Toru OHKAWA Toshiharu TAKARAMOTO Toshio YAMADA Masaaki SOUDA Shigetaka KUMASHIRO Tohru MOGAMI Makoto NAGATA
In-place AC measurements of the signal gain and substrate sensitivity of differential pair transistors of an analog amplifier are combined with DC characterization of the threshold voltage (Vth) of the same transistors. An on-chip continuous time waveform monitoring technique enables in-place matrix measurements of differential pair transistors with a variety of channel sizes and geometry, allowing the wide coverage of experiments about the transistor-level physical layout dependency of substrate noise response. A prototype test structure uses a 90-nm CMOS technology and demonstrates the geometry-dependent variation of substrate sensitivity of transistors in operation.
Nguyen Ngoc MAI KHANH Masahiro SASAKI Kunihiro ASADA
This paper presents a 65-nm CMOS 8-antenna array transmitter operating in 117–130-GHz range for short range and portable millimeter-wave (mm-wave) active imaging applications. Each antenna element is a new on-chip antenna located on the top metal. By using on-chip transformer, pulse output of each resistor-less mm-wave pulse generators (PG) are sent to each integrated antenna. To adjust pulse delays for the purpose of pulse beam-forming, a 7-bit digitally programmable delay circuit (DPDC) is added to each of PGs. Moreover, in order to dynamically adjust pulse delays among eight SW's outputs, we implemented on-chip jitter and relative skew measuring circuit with 20-bit digital output to achieve cumulative distribution (CDF) and probability density (PDF) functions from which DPDC's input codes are decided to align eight antenna's output pulses. Two measured radiation peaks after relative skew alignment are obtained at (θ; φ) angles of (-56; 0) and (+57; 0). Measurement results shows that beam-forming angles of the fully integrated antenna array can be adjusted by digital input codes and by the on-chip skew adjustment circuit for active imaging applications.
Tsuyoshi IWAGAKI Eiri TAKEDA Mineo KANEKO
This paper proposes a test scheduling method for stuck-at faults in a CHAIN interconnect, which is an asynchronous on-chip interconnect architecture, with scan ability. Special data transfer which is permitted only during test, is exploited to realize a more flexible test schedule than that of a conventional approach. Integer linear programming (ILP) models considering such special data transfer are developed according to the types of modules under test in a CHAIN interconnect. The obtained models are processed by using an ILP solver. This framework can not only obtain optimal test schedules but also easily introduce additional constraints such as a test power budget. Experimental results using benchmark circuits show that the proposed method can reduce test application time compared to that achieved by the conventional method.