1-11hit |
Nobutaro SHIBATA Mayumi WATANABE Takako ISHIHARA
Multiport SRAMs are frequently installed in network and/or telecommunication VLSIs to implement smart functions. This paper presents a high speed and low-power dual-port (i.e., 1W+1R two-port) SRAM macro customized for serial access operations. To reduce the wasted power dissipation due to subthreshold leakage currents, the supply voltage for 10T memory cells is lowered to 1 V and a power switch is prepared for every 64 word drivers. The switch is activated with look-ahead decoder-segment activation logic, so there is no penalty when selecting a wordline. The data I/O circuitry with a new column-based configuration makes it possible to hide the bitline precharge operation with the sensing operation in the read cycle ahead of it; that is, we have successfully reduced the read latency by a half clock cycle, resulting in a pure two-stage pipeline. The SRAM macro installed in a 4K-entry × 33-bit FIFO memory, fabricated with a 0.3-µm fully-depleted-SOI CMOS process, achieved a 500-MHz operation in the typical conditions of 2- and 1-V power supplies, and 25°C. The power consumption during the standby time was less than 1.0 mW, and that at a practical operating frequency of 400 MHz was in a range of 47-57 mW, depending on the bit-stream data pattern.
Xiangyu ZHANG Yangdong DENG Shuai MU
General purpose computing on GPU (GPGPU) has become a popular computing model for high-performance, data-intensive applications. Accordingly, there is a strong need to develop highly efficient data structures to ease the development of GPGPU applications. In this work, we proposed an efficient concurrent queue data structure for GPU computing. The GPU based provably correct, lock-free FIFO queue allows a massive number of concurrent producers and consumers. Warp-centric en-queue and de-queue procedures are introduced to better match the underlying Single-Instruction, Multiple-Thread execution model of modern GPUs. It outperforms the best previous GPU queues by up to 40 fold. The correctness of the proposed queue operations is formally validated by linearizability criteria.
Kyosuke SANO Yuki YAMANASHI Nobuyuki YOSHIKAWA
We have been developing a superconducting time-of-flight mass spectrometry (TOF-MS) system, which utilizes a superconductive strip ion detector (SSID) and a single-flux-quantum (SFQ) multi-stop time-to-digital converter (TDC). The SFQ multi-stop TDC can measure the time intervals between multiple input signals and directly convert them into binary data. In this study, we designed and implemented 24-bit SFQ multi-stop TDCs with a 3×24-bit FIFO buffer using the AIST Nb standard process (STP2), whose time resolution and dynamic range are 100ps and 1.6ms, respectively. The timing jitter of the TDC was investigated by comparing two types of TDCs: one uses an on-chip SFQ clock generator (CG) and the other uses a microwave oscillator at room temperature. We confirmed the correct operation of both TDCs and evaluated their timing jitter. The experimentally-obtained timing jitter is about 40ns and 700ps for the TDCs with and without the on-chip SFQ CG, respectively, for the measured time interval of 50µs, which linearly increases with increase of the measured time interval.
Hisashi IWAMOTO Yuji YANO Yasuto KURODA Koji YAMAMOTO Shingo ATA Kazunari INOUE
Network traffic keeps increasing due to the increasing popularity of video streaming services. Routers and switches in wire-line networks require guaranteed line rates as high as 20 Gbp/s as well as advanced quality of service (QoS). Hybrid SRAM and DRAM architecture previously presented with the benefit of high-speed and high-density, but it requires complex memory management. As a result, it has hardly supported large numbers of queue, which is an effective approach to satisfying the QoS requirements. This paper proposes an intelligent memory management unit (MMU) which is based on the hybrid architecture, where over 16k multi queues are integrated. The performance examined by the system board is zero-packet loss under the seamless traffic with 60–1.5 kByte packet-length (deterministic manner). Noticeable feature in this paper's architecture is eliminating the need for any premium memories but only low-cost commodity SRAMs and DRAMs are used. The intelligent MMU employs the head buffer architecture, which is suitable for supporting a large numbers of FIFO queues. An experimental board based on this architecture is embedded into a Router system to evaluate the performance. Using 16k queues at 20 Gbps, zero-packet loss is examined with 64-Byte to 1,500-Byte packet-length.
Changwoo MIN Hyung Kook JUN Won Tae KIM Young Ik EOM
A concurrent FIFO queue is a widely used fundamental data structure for parallelizing software. In this letter, we introduce a novel concurrent FIFO queue algorithm for multicore architecture. We achieve better scalability by reducing contention among concurrent threads, and improve performance by optimizing cache-line usage. Experimental results on a server with eight cores show that our algorithm outperforms state-of-the-art algorithms by a factor of two.
The on-chip interconnection network (OCIN) is an integrated solution for system-on-chip (SoC) designs. The buffer architecture and size, however, dominate the performance of OCINs and affect the design of routers. This work analyzes different buffer architectures and uses a data-link two-level FIFO (first-in first-out) buffer architecture to implement high-performance routers. The concepts of shared buffers and multiple accesses for buffers are developed using the two-level FIFO buffer architecture. The proposed two-level FIFO buffer architecture increases the utilities of the storage elements via the centralized buffer organization and reduces the area and power consumption of routers to achieve the same performance achieved by other buffer architectures. Depending on a cycle-accurate simulator, the proposed data-link two-level FIFO buffer can realize performance similar to that of the conventional virtual channels, while using 25% of the buffers. Consequently, the two-level FIFO buffer can achieve about 22% power reduction compared with the similar performance of the conventional virtual channels using UMC 65 nm CMOS technology.
Kazunori SHIMIZU Tatsuyuki ISHIKAWA Nozomu TOGAWA Takeshi IKENAGA Satoshi GOTO
In this paper, we propose a power-efficient LDPC decoder architecture based on an accelerated message-passing schedule. The proposed decoder architecture is characterized as follows: (i) Partitioning a pipelined operation not to read and write intermediate messages simultaneously enables the accelerated message-passing schedule to be implemented with single-port SRAMs. (ii) FIFO-based buffering reduces the number of SRAM banks and words of the LDPC decoder based on the accelerated message-passing schedule. The proposed LDPC decoder keeps a single message for each non-zero bit in a parity check matrix as well as a classical schedule while achieving the accelerated message-passing schedule. Implementation results in 0.18 [µm] CMOS technology show that the proposed decoder architecture reduces an area of the LDPC decoder by 43% and a power dissipation by 29% compared to the conventional architecture based on the accelerated message-passing schedule.
Jeong-Gun LEE Suk-Jin KIM Jeong-A LEE Kiseon KIM
This paper presents a new asynchronous FIFO design to reduce forward latency in a linear structure. The operation mode for each cell can be reconfigured dynamically as either of the two schemes, wave pipelining or handshaking, according to the data flow in the FIFO. The adoption of wave pipelining to the conventional self-timed FIFO can reduce the overhead of the handshaking as well as latching control in each stage. Initial pre-layout simulations indicate about two times of improvement on latency performance over a state-of-art asynchronous FIFO, while retaining its throughput.
Recently, the Guaranteed Frame Rate (GFR) service was proposed as a new service category of ATM to support non-realtime data applications and to provide the minimum rate guarantee. To keep the simplicity of GFR as much as possible and overcome defects of FIFO-based mechanisms, we propose a FIFO-based algorithm extending DFBA one to improve the fairness and provide the minimum rate guarantee for a wider range of Minimum Cell Rate (MCR). The key idea is controlling the number of CLP1 cells which are occupying more buffer space than the fair share even when the queue length is below Low Buffer Occupancy (LBO).
Cinzia BERNARDESCHI Nicoletta De FRANCESCO Gigliola VAGLINI
In this work, we give a true concurrency semantics for FIFO-nets, that are Petri nets in which places behave as queues, tokens take values in a finite alphabet and the firing of a transition depends on sequences on the alphabet. We introduce fn-processes to represent the concurrent behavior of a FIFO-net N during a sequence of transition firings. Fn-processes are modeled by a mapping from a simple FIFO-net without queue sharing and cycles, named FIFO-occurrence net, to N. Moreover, the relation among the firings expressed by the FIFO-occurrence net has been enriched by an ordering relation among the elements of the FIFO-occurrence net representing values entered into a same queue of N. We give a way to build fn-processes step by step in correspondance with a sequence of transition firings and the fn-processes operationally built are all those abstractly defined. The FIFO-occurrence nets of fn-processes have some interesting properties; for example, such nets are always discrete and, consequently, there is at least a transition sequence corresponding to each fn-process.
Michiko INOUE Toshimitsu MASUZAWA Nobuki TOKURA
We consider linearizable implementations of shared FIFO queues and general deterministic objects on a distributed message-passing system which provides a real-time timer. The efficiency of an implementation is measured by the worst-case response time res_time(op) for each operation op of the implemented objects. We show the following results under the assumption that all message delays are in the range [d-u,d] for some constants d and u (0 u d). We first present an implementation of deterministic objects with res_time(opa)=u for any ack-type operation opa and res_time(opv)=2d for any val-type operation opv, where an ack-type operation is an operation which always returns a unique response and a val-type operation is an operation which is not ack-type. We also consider an implementation of FIFO queues, which have two kinds of operations, enq(v) and deq. We show that, for any implementation of FIFO queues, (1) res_time(enq(v)) u(n-1)/n holds for some v where n is the number of processes, and (2) res_time(deq) d+u/2 holds in the case of u (2/3)d.