Yoshiya KOMATSU Masanori HARIYAMA Michitaka KAMEYAMA
This paper presents a novel architecture of an asynchronous FPGA for handshake-component-based design. The handshake-component-based design is suitable for large-scale, complex asynchronous circuit because of its understandability. This paper proposes an area-efficient architecture of an FPGA that is suitable for handshake-component-based asynchronous circuit. Moreover, the Four-Phase Dual-Rail encoding is employed to construct circuits robust to delay variation because the data paths are programmable in FPGA. The FPGA based on the proposed architecture is implemented in a 65 nm process. Its evaluation results show that the proposed FPGA can implement handshake components efficiently.
Masanori HARIYAMA Shigeo YAMADERA Michitaka KAMEYAMA
This paper presents a design method to minimize energy of both functional units (FUs) and an interconnection network between FUs. To reduce complexity of the interconnection network, data transfers between FUs are classified according to FU types of operations in a data flow graph. The basic idea behind reducing the complexity of the interconnection network is that the interconnection resource can be shared among data transfers with the same FU type of a source node and the same FU type of a destination node. Moreover, an efficient method based on a genetic algorithm is presented.
Hiromitsu KIMURA Takahiro HANYU Michitaka KAMEYAMA
A new logic-in-memory circuit is proposed for a fine-grain pipelined VLSI system. Dynamic-storage elements are distributed over a logic-circuit plane. A functional pass gate is a key component, where a linear summation and threshold function are merged compactly using charge-storage and charge-coupling effect with a DRAM-cell-based circuit structure. The use of dynamic logic based on pass-transistor network using functional pass gates makes it possible to realize any logic circuits compactly with small power dissipation. As a typical example, a 54-bit pipelined multiplier is implemented by using the proposed circuit technology. Its power dissipation and chip area are reduced to about 63 percent and 72 percent, respectively, in comparison with those of a corresponding binary CMOS implementation under 0.35-µm CMOS technology.
Hiromitsu KIMURA Takahiro HANYU Michitaka KAMEYAMA
This paper presents a multiple-valued logic-in-memory circuit with real-time programmability. The basic component, in which a dynamic storage function and a multiple-valued threshold function are merged, is implemented compactly by using charge storage and capacitive coupling with a DRAM-cell-based circuit structure under a 0.8-µm CMOS technology. The pass-transistor network using these basic components makes it possible to realize any multiple-valued-inputs binary-outputs logic circuits compactly. As a typical example, a fully parallel multiple-valued magnitude comparator is also implemented by using the proposed DRAM-cell-based pass-transistor network. Its execution time and power dissipation are reduced to about 11 percent and 29 percent, respectively, in comparison with those of a corresponding binary implementation. A prototype chip is also fabricated to confirm the basic operation of the proposed DRAM-cell-based logic-in-memory circuit.
Saneaki TAMAKI Michitaka KAMEYAMA
Design of high-speed digital circuits such as adders and multipliers is one of the most important issues to implement high performance VLSI systems. This paper proposes a new multiple-valued code assignment algorithm to implement locally computable combinational circuits for k-ary operations. By the decomposition of a given k-ary operation into unary operations, a code assignment algorithm for k-ary operations is developed. Partition theory usually used in the design of sequential circuits is effectively employed for optimal code assignment. Some examples are shown to demonstrate the usefulness of the proposed algorithm.
Masami NAKAJIMA Michitaka KAMEYAMA
To realize next-generation high performance ULSI processors, it is a very important issue to reduce the critical delay path which is determined by a cascade chain of basic gates. To design highly parallel digital operation circuits such as an adder and a multiplier, it is difficult to find the optimal code assignment in the non-linear digital system. On the other hand, the use of the linear concept in the digital system seems to be very attractive because analytical methods can be utilized. To meet the requirement, we propose a new design method of highly parallel linear digital circuits for unary operations using the concept of a cycle and a tree. In the linear digital circuit design, the analytical method can be developed using a representation matrix, so that the search procedure for optimal locally computable circuits becomes very simple. The evaluations demonstrate the usefulness of the circuit design algorithm.
Takahiro HANYU Yoshikazu YABE Michitaka KAMEYAMA
Toward the age of ultra-high-density digital ULSI systems, the development of new integrated circuits suitable for an ultimately fine geometry feature size will be an important issue. Resonant-tunneling (RT) diodes and transistors based on quantum effects in deep submicron geometry are such kinds of key devices in the next-generation ULSI systems. From this point of view, there has been considerable interests in RT diodes and transistors as functional devices for circuit applications. Especially, it has been recognized that RT functional devices with multiple peaks in the current-voltage (I-V) characteristic are inherently suitable for implementing multiple-valued circuits such as a multiple-state memory cell. However, very few types of the other multiple-valued logic circuits have been reported so far using RT devices. In this paper, a new multiple-valued programmable logic array (MVPLA) based on RT devices is proposed for the next-generation ULSI-oriented hardware implementation. The proposed MVPLA consists of 3 basic building blocks: a universal literal circuit, an AND circuit and a linear summation circuit. The universal literal circuit can be directly designed by the combination of the RT diodes with one peak in the I-V characteristic, which is programmable by adjusting the width of quantum well in each RT device. The other basic building blocks can be also designed easily using the wired logic or current-mode wired summation. As a result, a highdensity RT-diode-based MVPLA superior to the corresponding binary implementation can be realized. The device-model-based design method proposed in this paper is discussed using static characteristics of typical RT diode models.
Hasitha Muthumala WAIDYASOORIYA Daisuke OKUMURA Masanori HARIYAMA Michitaka KAMEYAMA
Heterogeneous multi-core processors are attracted by the media processing applications due to their capability of drawing strengths of different cores to improve the overall performance. However, the data transfer bottlenecks and limitations in the task allocation due to the accelerator-incompatible operations prevents us from gaining full potential of the heterogeneous multi-core processors. This paper presents a task allocation method based on algorithm transformation to increase the freedom of task allocation. We use approximation methods such as CORDIC algorithms to map the accelerator-incompatible operations to accelerator cores. According to the experimental results using HOG descriptor computation, the proposed task allocation method reduces the data transfer time by more than 82% and the total processing time by more than 79% compared to the conventional task allocation method.
Weisheng CHONG Masanori HARIYAMA Michitaka KAMEYAMA
A low-power field-programmable VLSI (FPVLSI) is presented to overcome the problem of large power consumption in field-programmable gate arrays (FPGAs). To reduce power consumption in routing networks, the FPVLSI consists of cells that are based on a bit-serial pipeline architecture which reduces routing block complexity. Moreover, a level-converter-less multiple-supply-voltage scheme using dynamic circuits is proposed, where the cells in non-critical paths use a low supply voltage for low power under a speed constraint. The FPVLSI is evaluated based on a 0.18-µm CMOS design rule. The power consumption of the FPVLSI using multiple supply voltages is reduced to 17% or less compared to that of the static-circuit-based FPVLSI using multiple supply voltages.
Takeshi KASUGA Michitaka KAMEYAMA Tatsuo HIGUCHI
Robust-fault tolerance is a property that a computational result becomes nearly equal to the correct one at the occurrence of faults in digital system. There are many cases where the safety of digital control systems can be maintained if the property is satisfied. In this paper, robust-fault-tolerant three-valued arithmetic modules such as an adder and a multiplier are proposed. The positive and negative integers are represented by the number of 1's and 1's, respectively. The design concept of the arithmetic modules is that a fault makes linearly additive effect with a small value to the final result. Each arithmetic module consists of identical submodules linearly connected, so that multi-stage structure is formed to generate the final output from the last submodule. Between the input and output digits in the submodule some simple functional relation is satisfied with respect to the number of 1's and 1's. Moreover, the output digit value depends on very small portion of the submodules including the input digits. These properties make the linearly additive effect with a small value to the final result in the arithmetic modules even if multiple faults are occurred at the input and output of any gates in the submodules. Not only direct three-valued representation but also the use of three-valued logic circuits is inherently suitable for efficient implementation of the arithmetic VLSI system. The evaluation of the robust-fault-tolerant three-valued arithmetic modules is done with regard to the chip size and the speed using the standard CMOS design rule. As a result, it is made clear that the chip size can be greatly reduced.
Takahiro HANYU Michitaka KAMEYAMA Tatsuo HIGUCHI
Rapid advances in integrated circuit technology based on binary logic have made possible the fabrication of digital circuits or digital VLSI systems with not only a very large number of devices on a single chip or wafer, but also high-speed processing capability. However, the advance of processing speeds and improvement in cost/performance ratio based on conventional binary logic will not always continue unabated in submicron geometry. Submicron integrated circuits can handle multiple-valued signals at high speed rather than binary signals, especially at data communication level because of the reduced interconnections. The use of nonbinary logic or discrete-analog signal processing will not be out of the question if the multiple-valued hardware algorithms are developed for fast parallel operations. Moreover, in VLSI or ULSI processors the delay time due to global communications between functional modules or chips instead of each functional module itself is the most important factors to determine the total performance. Locally computable hardware implementation and new parallel hardware algorithms natural to multiple-valued data representation and circuit technologies are the key properties to develop VLSI processors in submicron geometry. As a result, multiple-valued VLSI processors make it possible to improve the effective chip density together with the processing speed significantly. In this paper, we summarize several potential advantages of multiple-valued VLSI processors in submicron geometry due to great reduction of interconnection and due to the suitability to locally computable hardware implementation, and demonstrate that some examples of special-purpose multiple-valued VLSI processors, which are a signed-digit arithmetic VLSI processor, a residue arithmetic VLSI processor and a matching VLSI processor can achieve higher performance for real-world computing system.
Makoto HONDA Michitaka KAMEYAMA Tatsuo HIGUCHI
The demand for high-speed image processing is obvious in many real-world computations such as robot vision. Not only high throughput but also small latency becomes an important factor of the performance, because of the requirement of frequent visual feedback. In this paper, a high-performance VLSI image processor based on the multiple-valued residue arithmetic circuit is proposed for such applications. Parallelism is hierarchically used to realize the high-performance VLSI image processor. First, spatially parallel architecture that is different from pipeline architecture is considered to reduce the latency. Secondly, residue number arithmetic is introduced. In the residue number arithmetic, data communication between the mod mi arithmetic units is not necessary, so that multiple mod mi arithmetic units can be completely separated to different chips. Therefore, a number of mod mi multiply adders can be implemented on a single VLSI chip based on the modulus-slice concept. Finally, each mod mi arithmetic unit can be effectively implemented in parallel structure using the concept of a pseudoprimitive root and the multiple-valued current-mode circuit technology. Thus, it is made clear that the throughout use of parallelism makes the latency 1/3 in comparison with the ordinary binary implementation.
Katsuhiko SHIMABUKURO Michitaka KAMEYAMA
An adder-based arithmetic VLSI processor using the SD number system is proposed for the applications of real-time computation such as intelligent robot system. Especially in the intelligent robot control system, not only high throughput but also small latency is a very important subject to make quick response for the sensor feedback situation, because the next input sample is obtained only after the robot actually moves. It is essential in the VLSI architecture for the intelligent robot system to make the latency as small as possible. The use of parallelism is an effective approach to reduce the latency. To meet the requirement, an architecture of a new multiple-valued arithmetic VLSI processor is developed. In the processor, addition and subtraction are performed by using the single adderbased processing element (PE). More complex basic arithmetic operations such as multiplication and division are performed by the appropriate data communications between the adder-based PEs with preserving their parallelism. In the proposed architecture, fine-grain parallel processing at the adder-based PE level is realized, and all the PEs can be fully utilized for any parallel arithmetic operations according to adder-based data dependency graph. As a result, the processing speed will be greatly increased in comparison with the conventional parallel processors having the different kinds of the arithmetic PEs such as an adder, a multiplier and a divider. To realize the arithmetic VLSI processor using the adder-based PEs, we introduce the signed-digit (SD) number system for the parallel arithmetic operations because the SD arithmetic has the advantage of modularity as well as parallelism. The multiple-valued bidirectional currentmode technology is also used for the implementation of the compact and high-speed adder-based PE, and the reduction of the number of the interconnections. It is demonstrated that these advantges of the multiple-valued technology are fully used for the implementation of the arithmetic VLSI processor. As a result, the latency of the proposed multiple-valued processor is reduced to 25% that of the binary processor integrated in the same chip size.
Yasuhiro TAKEI Hasitha Muthumala WAIDYASOORIYA Masanori HARIYAMA Michitaka KAMEYAMA
Heterogeneous multi-core architectures with CPUs and accelerators attract many attentions since they can achieve power-efficient computing in various areas from low-power embedded processing to high-performance computing. Since the optimal architecture is different from application to application, finding the most suitable accelerator is very important. In this paper, we propose an FPGA-based heterogeneous multi-core platform with custom accelerators for power-efficient computing. Using the proposed platform, we evaluate several applications and accelerators to identify many key requirements of the applications and properties of the accelerators. Such an evaluation is very important to select and optimize the most suitable accelerator according to the requirements of an application to achieve the best performance.
Katsuhiko SHIMABUKURO Michitaka KAMEYAMA Tatsuo HIGUCHI
It is well known that the multiple-valued signed-digit (SD) arithmetic circuits have the attractive features of compactness and high-speed operation. However, both of these features have yet to be utilized fully. In this paper, we consider the application of a parallel-structure-based VLSI processor. A high-performance parallel-structure-based multiple-valued VLSI processor using the radix-2 SD number system is proposed. Its compactness makes the parallelism high under chip size limitations in comparison with the ordinary binary arithmetic circuits. Moreover, the speed of the single arithmetic module is very high in the SD arithmetic circuits, so that we can take advantage of the high-speed operation in the parallel-structure-based VLSI processor chip. The multiple-valued bidirectional current-mode technology is used not only in high-speed small sized arithmetic circuits, but also in reducing the number of connections in the parallel-structure-based VLSI processor. The proposed processor is specially developed for real-time digital control, where the performance is evaluated by delay time. Performance estimation using SPICE simulators shows that the delay time of proposed processor for matrix operations such as matrix multiplication is greatly reduced in comparison with a conventional binary processor.
Hasitha Muthumala WAIDYASOORIYA Masanori HARIYAMA Michitaka KAMEYAMA
Accelerator cores in low-power embedded processors have on-chip multiple memory modules to increase the data access speed and to enable parallel data access. When large functional units such as multipliers and dividers are used for addressing, a large power and chip area are consumed. Therefore, recent low-power processors use small functional units such as adders and counters to reduce the power and area. Such small functional units make it difficult to implement complex addressing patterns without duplicating data among multiple memory modules. The data duplication wastes the memory capacity and increases the data transfer time significantly. This paper proposes a method to reduce the memory duplication for window-based image processing, which is widely used in many applications. Evaluations using an accelerator core show that the proposed method reduces the data amount and data transfer time by more than 50%.
Hasitha Muthumala WAIDYASOORIYA Yosuke OHBAYASHI Masanori HARIYAMA Michitaka KAMEYAMA
Accelerator cores in low-power heterogeneous processors have on-chip local memories to enable parallel data access. The memory capacities of the local memories are very small. Therefore, the data should be transferred from the global memory to the local memories many times. These data transfers greatly increase the total processing time. Memory allocation technique to increase the data sharing is a good solution to this problem. However, when using reconfigurable cores, the data must be shared among multiple contexts. However, conventional context partitioning methods only consider how to reuse limited hardware resources in different time slots. They do not consider the data sharing. This paper proposes a context partitioning method to share both the hardware resources and the local memory data. According to the experimental results, the proposed method reduces the processing time by more than 87% compared to conventional context partitioning techniques.
Takahiro HANYU Manabu ARAKAKI Michitaka KAMEYAMA
This paper presents a 4-valued content-addressable memory (CAM) for fully parallel template-matching operations in real-time cellular logic image processing with fixed templates. A universal literal is essential to perform a multiple-valued template-matching operation. It is decomposed of a pair of a threshold operation in a CAM cell and a logic-value conversion shared by CAM cells in the same column of a CAM cellular array, which makes a CAM cell function simple. Since a threshold operation together with a 4-valued storage element can be designed by using a single floating-gate MOS transistor, a high-density 4-valued universal-literal CAM with a single-transistor cell can be implemented by using a multi-layer interconnection technology. It is demonstrated that the performance of the proposed CAM is much superior to that of conventional CAMs under the same function.
Hasitha Muthumala WAIDYASOORIYA Weisheng CHONG Masanori HARIYAMA Michitaka KAMEYAMA
Dynamically-programmable gate arrays (DPGAs) promise lower-cost implementations than conventional field-programmable gate arrays (FPGAs) since they efficiently reuse limited hardware resources in time. One of the typical DPGA architectures is a multi-context FPGA (MC-FPGA) that requires multiple memory bits per configuration bit to realize fast context switching. However, this additional memory bits cause significant overhead in area and power consumption. This paper presents novel architecture of a switch element to overcome the required capacity of configuration memory. Our main idea is to exploit redundancy between different contexts by using a fine-grained switch element. The proposed MC-FPGA is designed in a 0.18 µm CMOS technology. Its maximum clock frequency and the context switching frequency are measured to be 310 MHz and 272 MHz, respectively. Moreover, novel CAD process that exploits the redundancy in configuration data, is proposed to support the MC-FPGA architecture.