Shuping ZHANG Jinjia ZHOU Dajiang ZHOU Shinji KIMURA Satoshi GOTO
Motion estimation (ME) is a key encoding component of almost all modern video coding standards. ME contributes significantly to video coding efficiency, but, it also consumes the most power of any component in a video encoder. In this paper, an ME processor with 3D stacked memory architecture is proposed to reduce memory and core power consumption. First, a memory die is designed and stacked with ME die. By adding face-to-face (F2F) pads and through-silicon-via (TSV) definitions, 2D electronic design automation (EDA) tools can be extended to support the proposed 3D stacking architecture. Moreover, a special memory controller is applied to control data transmission and timing between the memory die and the ME processor die. Finally, a 3D physical design is completed for the entire system. This design includes TSV/F2F placement, floor plan optimization, and power network generation. Compared to 2D technology, the number of input/output (IO) pins is reduced by 77%. After optimizing the floor plan of the processor die and memory die, the routing wire lengths are reduced by 13.4% and 50%, respectively. The stacking static random access memory contributes the most power reduction in this work. The simulation results show that the design can support real-time 720p @ 60fps encoding at 8MHz using less than 65mW in power, which is much better compared to the state-of-the-art ME processor.
Naruki KURATA Ryota SHIOYA Masahiro GOSHIMA Shuichi SAKAI
To eliminate CAMs from the load/store queues, several techniques to detect memory access order violation with hash filters composed of RAMs have been proposed. This paper proposes a technique with parallel counting Bloom filters (PCBF). A Bloom filter has extremely low false positive rates owing to multiple hash functions. Although some existing researches claim the use of Bloom filters, none of them make mention to multiple hash functions. This paper also addresses the problem relevant to the variety of access sizes of load/store instructions. The evaluation results show that our technique, with only 2720-bit Bloom filters, achieves a relative IPC of 99.0% while the area and power consumption are greatly reduced to 14.3% and 22.0% compared to a conventional model with CAMs. The filter is much smaller than usual branch predictors.
Koichiro ISHIBASHI Nobuyuki SUGII Shiro KAMOHARA Kimiyoshi USAMI Hideharu AMANO Kazutoshi KOBAYASHI Cong-Kha PHAM
A 32bit CPU, which can operate more than 15 years with 220mAH Li battery, or eternally operate with an energy harvester of in-door light is presented. The CPU was fabricated by using 65nm SOTB CMOS technology (Silicon on Thin Buried oxide) where gate length is 60nm and BOX layer thickness is 10nm. The threshold voltage was designed to be as low as 0.19V so that the CPU operates at over threshold region, even at lower supply voltages down to 0.22V. Large reverse body bias up to -2.5V can be applied to bodies of SOTB devices without increasing gate induced drain leak current to reduce the sleep current of the CPU. It operated at 14MHz and 0.35V with the lowest energy of 13.4 pJ/cycle. The sleep current of 0.14µA at 0.35V with the body bias voltage of -2.5V was obtained. These characteristics are suitable for such new applications as energy harvesting sensor network systems, and long lasting wearable computers.
Gong CHEN Yu ZHANG Qing DONG Ming-Yu LI Shigetoshi NAKATAKE
As semiconductor manufacturing processing scaling down, leakage current of CMOS circuits is becoming a dominant contributor to power dissipation. This paper provides an efficient leakage current reduction (LCR) technique for low-power and low-frequency circuit designs in terms of design rules and layout parameters related to layout dependent effects. We address the LCR technique both for analog and digital circuits, and present a design case when applying the LCR techniqe to a successive-approximation-register (SAR) analog-to-digital converter (ADC), which typically employs analog and digital transistors. In the post-layout simulation results by HSPICE, an SAR-ADC with the LCR technique achieves 38.6-nW as the total power consumption. Comparing with the design without the LCR technique, we attain about 30% total energy reduction.
Norihiro KAMAE Akira TSUCHIYA Hidetoshi ONODERA
A forward/reverse body bias generator (BBG) which operates under wide supply-range is proposed. Fine-grained body biasing (FGBB) is effective to reduce variability and increase energy efficiency on digital LSIs. Since FGBB requires a number of BBGs to be implemented, simple design is preferred. We propose a BBG with charge pumps for reverse body bias and the BBG operates under wide supply-range from 0.5,V to 1.2,V. Layout of the BBG was designed in a cell-based flow with an AES core and fabricated in a 65~nm CMOS process. Area of the AES core is 0.22 mm$^2$ and area overhead of the BBG is 2.3%. Demonstration of the AES core shows a successful operation with the supply voltage from 0.5,V to 1.2,V which enables the reduction of power dissipation, for example, of 17% at 400,MHz operation.
Teerachot SIRIBURANON Wei DENG Kenichi OKADA Akira MATSUZAWA
This paper presents a constant-current-controlled class-C VCO using a self-adjusting replica bias circuit. The proposed class-C VCO is more suitable in real-life applications as it can maintain constant current which is more robust in phase noise performance over variation of gate bias of cross-coupled pair comparing to a traditional approach without amplitude modulation issue. The proposed VCO is implemented in 180,nm CMOS process. It achieves a tuning range of 4.8--4.9,GHz with a phase noise of -121,dBc/Hz at 1,MHz offset. The power consumption of the core oscillators is 4.8,mW and an FoM of -189,dBc/Hz is achieved.
In this correspondence, a generic method of constructing optimal p2-ary low correlation zone sequence sets is proposed. Firstly p2-ary column sequence sets are constructed, then p2-ary LCZ sequence sets with parameters (pn-1, pm-1, (pn-1)/(pm-1),1) are constructed by using column sequences and interleaving technique. The resultant p2-ary LCZ sequence sets are optimal with respect to the Tang-Fan-Matsufuji bound.
Yu KASHIMA Takashi ISHIO Shogo ETSUDA Katsuro INOUE
To understand the behavior of a program, developers often need to read source code fragments in various modules. System-dependence-graph-based (SDG) program slicing is a good candidate for supporting the investigation of data-flow paths among modules, as SDG is capable of showing the data-dependence of focused program elements. However, this technique has two problems. First, constructing SDG requires heavyweight analysis, so SDG is not suitable for daily uses. Second, the results of SDG-based program slicing are difficult to visualize, as they contain many vertices. In this research, we proposed variable data-flow graphs (VDFG) for use in program slicing techniques. In contrast to SDG, VDFG is created by lightweight analysis because several approximations are used. Furthermore, we propose using the fractal value to visualize VDFG-based program slice in order to reduce the graph complexity for visualization purposes. We performed three experiments that demonstrate the accuracy of VDFG program slicing with fractal value, the size of a visualized program slice, and effectiveness of our tool for source code reading.
Chengsong WANG Xiaoguang MAO Yan LEI Peng ZHANG
In recent years, hybrid typestate analysis has been proposed to eliminate unnecessary monitoring instrumentations for runtime monitors at compile-time. Nop-shadows Analysis (NSA) is one of these hybrid typestate analyses. Before generating residual monitors, NSA performs the data-flow analysis which is intra-procedural flow-sensitive and partially context-sensitive to improve runtime performance. Although NSA is precise, there are some cases on which it has little effects. In this paper, we propose three optimizations to further improve the precision of NSA. The first two optimizations try to filter interferential states of objects when determining whether a monitoring instrumentation is necessary. The third optimization refines the inter-procedural data-flow analysis induced by method invocations. We have integrated our optimizations into Clara and conducted extensive experiments on the DaCapo benchmark. The experimental results demonstrate that our first two optimizations can further remove unnecessary instrumentations after the original NSA in more than half of the cases, without a significant overhead. In addition, all the instrumentations can be removed for two cases, which implies the program satisfy the typestate property and is free of runtime monitoring. It comes as a surprise to us that the third optimization can only be effective on 8.7% cases. Finally, we analyze the experimental results and discuss the reasons why our optimizations fail to further eliminate unnecessary instrumentations in some special situations.
Keishi TSUBAKI Tetsuya HIROSE Nobutaka KUROKI Masahiro NUMA
This paper proposes an ultra-low power fully on-chip CMOS relaxation oscillator (ROSC) for a real-time clock application. The proposed ROSC employs a compensation circuit of a comparator's non-idealities caused by offset voltage and delay time. The ROSC can generate a stable, and 32-kHz oscillation clock frequency without increasing power dissipation by using a low reference voltage and employing a novel compensation architecture for comparators. Measurement results in a 0.18-$mu$m CMOS process demonstrated that the circuit can generate a stable clock frequency of 32.55,kHz with low power dissipation of 472,nW at 1.8-V power supply. Measured line regulation and temperature coefficient were 1.1%/V and 120,ppm/$^{circ}$C, respectively.
Jing WANG Satoshi NAGATA Lan CHEN Huiling JIANG
Coordinated multi-point (CoMP) transmission and reception is a promising technique for interference mitigation in cellular systems. The scheduling algorithm for CoMP has a significant impact on the network processing complexity and performance. Performing exhaustive search permits centralized scheduling and thus the optimal global solution; however, it incurs a high level of computational complexity and may be impractical or lead to high cost as well as network instability. In order to provide a more realistic scheduling method while balancing performance and complexity, we propose a low complexity centralized scheduling scheme that adaptively selects users for single-cell transmission or different CoMP scheme transmission to maximize the system weighted sum capacity. We evaluate the computational complexity and system-level simulation performance in this paper. Compared to the optimal scheduling method with exhaustive search, the proposed scheme has a much lower complexity level and achieves near optimal performance.
Syndrome key equation solution is one of the important processes in the decoding of Reed-Solomon codes. This paper proposes a low power key equation solver (KES) architecture where the power consumption is reduced by decreasing the required number of multiplications without degrading the decoding throughput and latency. The proposed method employs smaller number of multipliers than a conventional low power KES architecture. The critical path in the proposed KES circuit is minimized so that the operation at a high clock frequency is possible. A low power folded KES architecture is also proposed to further reduce the hardware complexity by executing folded operations in a pipelined manner with a slight increase in decoding latency.
An LIU Maoyin CHEN Donghua ZHOU
Robust crater recognition is a research focus on deep space exploration mission, and sparse representation methods can achieve desirable robustness and accuracy. Due to destruction and noise incurred by complex topography and varied illumination in planetary images, a robust crater recognition approach is proposed based on dictionary learning with a low-rank error correction model in a sparse representation framework. In this approach, all the training images are learned as a compact and discriminative dictionary. A low-rank error correction term is introduced into the dictionary learning to deal with gross error and corruption. Experimental results on crater images show that the proposed method achieves competitive performance in both recognition accuracy and efficiency.
Juha PETÄJÄJÄRVI Heikki KARVONEN Konstantin MIKHAYLOV Aarno PÄRSSINEN Matti HÄMÄLÄINEN Jari IINATTI
This paper discusses the perspectives of using a wake-up receiver (WUR) in wireless body area network (WBAN) applications with event-driven data transfers. First we compare energy efficiency between the WUR-based and the duty-cycled medium access control protocol -based IEEE 802.15.6 compliant WBAN. Then, we review the architectures of state-of-the-art WURs and discuss their suitability for WBANs. The presented results clearly show that the radio frequency envelope detection based architecture features the lowest power consumption at a cost of sensitivity. The other architectures are capable of providing better sensitivity, but consume more power. Finally, we propose the design modification that enables using a WUR to receive the control commands beside the wake-up signals. The presented results reveal that use of this feature does not require complex modifications of the current architectures, but enables to improve energy efficiency and latency for small data blocks transfers.
Katsuhiko UEDA Zuiko RIKUHASHI Kentaro HAYASHI Hiroomi HIKAWA
It is important to reduce the power consumption of complementary metal oxide semiconductor (CMOS) logic circuits, especially those used in mobile devices. A CMOS logic circuit consists of metal-oxide-semiconductor field-effect transistors (MOSFETs), which consume electrical power dynamically when they charge and discharge load capacitance that is connected to their output. Load capacitance mainly exists in wiring or buses, and transitions between logic 0 and logic 1 cause these charges and discharges. Many methods have been proposed to reduce these transitions. One novel method (called segmentation coding) has recently been proposed that reduces power consumption of CMOS buses carrying band-limited signals, such as audio data. It improves performance by employing dedicated encoders for the upper and lower bits of transmitted data, in which the transition characteristics of band-limited signals are utilized. However, it uses a conventional majority voting circuit in the encoder for lower bits, and the circuit uses many adders to count the number of 1s to calculate the Hamming distance between the transmitted data. This paper proposes segmentation coding with pseudo-majority voting. The proposed pseudo-majority voting circuit counts the number of 1s with fewer circuit resources than the conventional circuit by further utilizing the transition characteristics of band-limited signals. The effectiveness of the proposed method was demonstrated through computer simulations and experiments.
Seong-Mun KIM Hyon-Young CHOI Youn-Hee HAN Sung-Gi MIN
In this paper, Proxy Mobile IPv6 (PMIPv6), which is a network-based mobility management protocol, is adapted to the OpenFlow architecture. Mobility-related signaling is generally performed by network entities on behalf of a mobile node, but in standard PMIPv6, the control and data packets are delivered and processed over the same network entities, which prevents the separation of the control and the data planes. In addition, IP tunneling inherent to PMIPv6 imposes excessive overhead for the network entities. In order to adapt PMIPv6 to the OpenFlow architecture, the mobility management function is separated from the PMIPv6 components, and components are reconstructed to take advantage of the offerings of the OpenFlow architecture. The components configure the flow table of the switches located in a path, which comprise the OpenFlow controller. Mobility-related signaling can then be performed at the dedicated secure channel, and all of the data packets can be sent normally in accordance with the flow table of the OpenFlow switches. Consequently, the proposed scheme eliminates IP tunneling when user traffic is forwarded and separates the data and the control planes. The performance analysis revealed that the proposed scheme can outperform PMIPv6 in terms of the signaling cost, packet delivery cost, and handover latency.
Wanbin REN Shengjun XUE Hongxu ZHI Guofu ZHAI
This paper presents the electrical contact behaviors of Au-plated material at super low making and breaking velocity conditions by introducing our new designed test rig. The fundamental phenomena in the contact voltage and contact force versus piezoactuator displacement curves were investigated under the load current of 1A and velocity of 50,nm/s. From the repetitive experimental results, we found that the adhesion phenomena during the unloading process are closely correlative with the initial contact stage in the loading process. Furthermore, a mathematical model which is relative to the variation of contact force in loading is built, thus the physical mechanism of adhesion and principal factors of gold-plated materials are discussed. Finally, the physical process of molten bridge under the no mechanical contact situation is also analyzed in detail.
Akihiro SATOH Yutaka NAKAMURA Takeshi IKENAGA
A dictionary attack against SSH is a common security threat. Many methods rely on network traffic to detect SSH dictionary attacks because the connections of remote login, file transfer, and TCP/IP forwarding are visibly distinct from those of attacks. However, these methods incorrectly judge the connections of automated operation tasks as those of attacks due to their mutual similarities. In this paper, we propose a new approach to identify user authentication methods on SSH connections and to remove connections that employ non-keystroke based authentication. This approach is based on two perspectives: (1) an SSH dictionary attack targets a host that provides keystroke based authentication; and (2) automated tasks through SSH need to support non-keystroke based authentication. Keystroke based authentication relies on a character string that is input by a human; in contrast, non-keystroke based authentication relies on information other than a character string. We evaluated the effectiveness of our approach through experiments on real network traffic at the edges in four campus networks, and the experimental results showed that our approach provides high identification accuracy with only a few errors.
Ichiro TOYOSHIMA Shingo YAMAGUCHI Yuki MURAKAMI
A workflow net (WF-net for short) is a Petri net which represents a workflow. There are two important subclasses of WF-nets: extended free choice (EFC for short) and well-structured (WS for short). It is known that most actual workflows can be modeled as EFC WF-nets; and acyclic WS is a subclass of acyclic EFC but has more analysis methods. A sound acyclic EFC WF-net may be transformed to an acyclic WS WF-net without changing the observable behavior of the net. Such a transformation is called refactoring. In this paper, we tackled a problem, named acyclic EFC WF-net refactorizability problem, that decides whether a given sound acyclic EFC WF-net is refactorable to an acyclic WS WF-net. We gave two sufficient conditions on the problem, and constructed refactoring procedures based on the conditions. Furthermore, we applied the procedures to a sample workflow, and confirmed usefulness of the procedures for the enhancement of the readability and the analysis power of acyclic EFC WF-nets.
Stewart DENHOLM Hiroaki INOUE Takashi TAKENAKA Tobias BECKER Wayne LUK
Financial exchanges provide market data feeds to update their members about changes in the market. Feed messages are often used in time-critical automated trading applications, and two identical feeds (A and B feeds) are provided in order to reduce message loss. A key challenge is to support A/B line arbitration efficiently to compensate for missing packets, while offering flexibility for various operational modes such as prioritising for low latency or for high data reliability. This paper presents a reconfigurable acceleration approach for A/B arbitration operating at the network level, capable of supporting any messaging protocol. Two modes of operation are provided simultaneously: one prioritising low latency, and one prioritising high reliability with three dynamically configurable windowing methods. We also present a model for message feed processing latencies that is useful for evaluating scalability in future applications. We outline a new low latency, high throughput architecture and demonstrate a cycle-accurate testing framework to measure the actual latency of packets within the FPGA. We implement and compare the performance of the NASDAQ TotalView-ITCH, OPRA and ARCA market data feed protocols using a Xilinx Virtex-6 FPGA. For high reliability messages we achieve latencies of 42ns for TotalView-ITCH and 36.75ns for OPRA and ARCA. 6ns and 5.25ns are obtained for low latency messages. The most resource intensive protocol, TotalView-ITCH, is also implemented in a Xilinx Virtex-5 FPGA within a network interface card; it is used to validate our approach with real market data. We offer latencies 10 times lower than an FPGA-based commercial design and 4.1 times lower than the hardware-accelerated IBM PowerEN processor, with throughputs more than double the required 10Gbps line rate.