The search functionality is under construction.

Author Search Result

[Author] Michitaka KAMEYAMA(65hit)

41-60hit(65hit)

  • Low-Power Field-Programmable VLSI Using Multiple Supply Voltages

    Weisheng CHONG  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Low Power Methodology

      Vol:
    E88-A No:12
      Page(s):
    3298-3305

    A low-power field-programmable VLSI (FPVLSI) is presented to overcome the problem of large power consumption in field-programmable gate arrays (FPGAs). To reduce power consumption in routing networks, the FPVLSI consists of cells that are based on a bit-serial pipeline architecture which reduces routing block complexity. Moreover, a level-converter-less multiple-supply-voltage scheme using dynamic circuits is proposed, where the cells in non-critical paths use a low supply voltage for low power under a speed constraint. The FPVLSI is evaluated based on a 0.18-µm CMOS design rule. The power consumption of the FPVLSI using multiple supply voltages is reduced to 17% or less compared to that of the static-circuit-based FPVLSI using multiple supply voltages.

  • Design of Robust-Fault-Tolerant Multiple-Valued Arithmetic Circuits and Their Evaluation

    Takeshi KASUGA  Michitaka KAMEYAMA  Tatsuo HIGUCHI  

     
    PAPER

      Vol:
    E76-C No:3
      Page(s):
    428-435

    Robust-fault tolerance is a property that a computational result becomes nearly equal to the correct one at the occurrence of faults in digital system. There are many cases where the safety of digital control systems can be maintained if the property is satisfied. In this paper, robust-fault-tolerant three-valued arithmetic modules such as an adder and a multiplier are proposed. The positive and negative integers are represented by the number of 1's and 1's, respectively. The design concept of the arithmetic modules is that a fault makes linearly additive effect with a small value to the final result. Each arithmetic module consists of identical submodules linearly connected, so that multi-stage structure is formed to generate the final output from the last submodule. Between the input and output digits in the submodule some simple functional relation is satisfied with respect to the number of 1's and 1's. Moreover, the output digit value depends on very small portion of the submodules including the input digits. These properties make the linearly additive effect with a small value to the final result in the arithmetic modules even if multiple faults are occurred at the input and output of any gates in the submodules. Not only direct three-valued representation but also the use of three-valued logic circuits is inherently suitable for efficient implementation of the arithmetic VLSI system. The evaluation of the robust-fault-tolerant three-valued arithmetic modules is done with regard to the chip size and the speed using the standard CMOS design rule. As a result, it is made clear that the chip size can be greatly reduced.

  • Prospects of Multiple-Valued VLSI Processors

    Takahiro HANYU  Michitaka KAMEYAMA  Tatsuo HIGUCHI  

     
    INVITED PAPER

      Vol:
    E76-C No:3
      Page(s):
    383-392

    Rapid advances in integrated circuit technology based on binary logic have made possible the fabrication of digital circuits or digital VLSI systems with not only a very large number of devices on a single chip or wafer, but also high-speed processing capability. However, the advance of processing speeds and improvement in cost/performance ratio based on conventional binary logic will not always continue unabated in submicron geometry. Submicron integrated circuits can handle multiple-valued signals at high speed rather than binary signals, especially at data communication level because of the reduced interconnections. The use of nonbinary logic or discrete-analog signal processing will not be out of the question if the multiple-valued hardware algorithms are developed for fast parallel operations. Moreover, in VLSI or ULSI processors the delay time due to global communications between functional modules or chips instead of each functional module itself is the most important factors to determine the total performance. Locally computable hardware implementation and new parallel hardware algorithms natural to multiple-valued data representation and circuit technologies are the key properties to develop VLSI processors in submicron geometry. As a result, multiple-valued VLSI processors make it possible to improve the effective chip density together with the processing speed significantly. In this paper, we summarize several potential advantages of multiple-valued VLSI processors in submicron geometry due to great reduction of interconnection and due to the suitability to locally computable hardware implementation, and demonstrate that some examples of special-purpose multiple-valued VLSI processors, which are a signed-digit arithmetic VLSI processor, a residue arithmetic VLSI processor and a matching VLSI processor can achieve higher performance for real-world computing system.

  • Multiple-Valued VLSI Image Processor Based on Residue Arithmetic and Its Evaluation

    Makoto HONDA  Michitaka KAMEYAMA  Tatsuo HIGUCHI  

     
    PAPER

      Vol:
    E76-C No:3
      Page(s):
    455-462

    The demand for high-speed image processing is obvious in many real-world computations such as robot vision. Not only high throughput but also small latency becomes an important factor of the performance, because of the requirement of frequent visual feedback. In this paper, a high-performance VLSI image processor based on the multiple-valued residue arithmetic circuit is proposed for such applications. Parallelism is hierarchically used to realize the high-performance VLSI image processor. First, spatially parallel architecture that is different from pipeline architecture is considered to reduce the latency. Secondly, residue number arithmetic is introduced. In the residue number arithmetic, data communication between the mod mi arithmetic units is not necessary, so that multiple mod mi arithmetic units can be completely separated to different chips. Therefore, a number of mod mi multiply adders can be implemented on a single VLSI chip based on the modulus-slice concept. Finally, each mod mi arithmetic unit can be effectively implemented in parallel structure using the concept of a pseudoprimitive root and the multiple-valued current-mode circuit technology. Thus, it is made clear that the throughout use of parallelism makes the latency 1/3 in comparison with the ordinary binary implementation.

  • Architecture of a Parallel Multiple-Valued Arithmetic VLSI Processor Using Adder-Based Processing Elements

    Katsuhiko SHIMABUKURO  Michitaka KAMEYAMA  

     
    PAPER

      Vol:
    E76-C No:3
      Page(s):
    463-471

    An adder-based arithmetic VLSI processor using the SD number system is proposed for the applications of real-time computation such as intelligent robot system. Especially in the intelligent robot control system, not only high throughput but also small latency is a very important subject to make quick response for the sensor feedback situation, because the next input sample is obtained only after the robot actually moves. It is essential in the VLSI architecture for the intelligent robot system to make the latency as small as possible. The use of parallelism is an effective approach to reduce the latency. To meet the requirement, an architecture of a new multiple-valued arithmetic VLSI processor is developed. In the processor, addition and subtraction are performed by using the single adderbased processing element (PE). More complex basic arithmetic operations such as multiplication and division are performed by the appropriate data communications between the adder-based PEs with preserving their parallelism. In the proposed architecture, fine-grain parallel processing at the adder-based PE level is realized, and all the PEs can be fully utilized for any parallel arithmetic operations according to adder-based data dependency graph. As a result, the processing speed will be greatly increased in comparison with the conventional parallel processors having the different kinds of the arithmetic PEs such as an adder, a multiplier and a divider. To realize the arithmetic VLSI processor using the adder-based PEs, we introduce the signed-digit (SD) number system for the parallel arithmetic operations because the SD arithmetic has the advantage of modularity as well as parallelism. The multiple-valued bidirectional currentmode technology is also used for the implementation of the compact and high-speed adder-based PE, and the reduction of the number of the interconnections. It is demonstrated that these advantges of the multiple-valued technology are fully used for the implementation of the arithmetic VLSI processor. As a result, the latency of the proposed multiple-valued processor is reduced to 25% that of the binary processor integrated in the same chip size.

  • Evaluation of an FPGA-Based Heterogeneous Multicore Platform with SIMD/MIMD Custom Accelerators

    Yasuhiro TAKEI  Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E96-A No:12
      Page(s):
    2576-2586

    Heterogeneous multi-core architectures with CPUs and accelerators attract many attentions since they can achieve power-efficient computing in various areas from low-power embedded processing to high-performance computing. Since the optimal architecture is different from application to application, finding the most suitable accelerator is very important. In this paper, we propose an FPGA-based heterogeneous multi-core platform with custom accelerators for power-efficient computing. Using the proposed platform, we evaluate several applications and accelerators to identify many key requirements of the applications and properties of the accelerators. Such an evaluation is very important to select and optimize the most suitable accelerator according to the requirements of an application to achieve the best performance.

  • Design of a Multiple-Valued VLSI Processor for Digital Control

    Katsuhiko SHIMABUKURO  Michitaka KAMEYAMA  Tatsuo HIGUCHI  

     
    PAPER-Computer Hardware and Design

      Vol:
    E75-D No:5
      Page(s):
    709-717

    It is well known that the multiple-valued signed-digit (SD) arithmetic circuits have the attractive features of compactness and high-speed operation. However, both of these features have yet to be utilized fully. In this paper, we consider the application of a parallel-structure-based VLSI processor. A high-performance parallel-structure-based multiple-valued VLSI processor using the radix-2 SD number system is proposed. Its compactness makes the parallelism high under chip size limitations in comparison with the ordinary binary arithmetic circuits. Moreover, the speed of the single arithmetic module is very high in the SD arithmetic circuits, so that we can take advantage of the high-speed operation in the parallel-structure-based VLSI processor chip. The multiple-valued bidirectional current-mode technology is used not only in high-speed small sized arithmetic circuits, but also in reducing the number of connections in the parallel-structure-based VLSI processor. The proposed processor is specially developed for real-time digital control, where the performance is evaluated by delay time. Performance estimation using SPICE simulators shows that the delay time of proposed processor for matrix operations such as matrix multiplication is greatly reduced in comparison with a conventional binary processor.

  • Memory Allocation for Window-Based Image Processing on Multiple Memory Modules with Simple Addressing Functions

    Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E94-A No:1
      Page(s):
    342-351

    Accelerator cores in low-power embedded processors have on-chip multiple memory modules to increase the data access speed and to enable parallel data access. When large functional units such as multipliers and dividers are used for addressing, a large power and chip area are consumed. Therefore, recent low-power processors use small functional units such as adders and counters to reduce the power and area. Such small functional units make it difficult to implement complex addressing patterns without duplicating data among multiple memory modules. The data duplication wastes the memory capacity and increases the data transfer time significantly. This paper proposes a method to reduce the memory duplication for window-based image processing, which is widely used in many applications. Evaluations using an accelerator core show that the proposed method reduces the data amount and data transfer time by more than 50%.

  • Memory-Access-Driven Context Partitioning for Window-Based Image Processing on Heterogeneous Multicore Processors

    Hasitha Muthumala WAIDYASOORIYA  Yosuke OHBAYASHI  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Design Methodology

      Vol:
    E95-D No:2
      Page(s):
    354-363

    Accelerator cores in low-power heterogeneous processors have on-chip local memories to enable parallel data access. The memory capacities of the local memories are very small. Therefore, the data should be transferred from the global memory to the local memories many times. These data transfers greatly increase the total processing time. Memory allocation technique to increase the data sharing is a good solution to this problem. However, when using reconfigurable cores, the data must be shared among multiple contexts. However, conventional context partitioning methods only consider how to reuse limited hardware resources in different time slots. They do not consider the data sharing. This paper proposes a context partitioning method to share both the hardware resources and the local memory data. According to the experimental results, the proposed method reduces the processing time by more than 87% compared to conventional context partitioning techniques.

  • Design and Evaluation of a 4-Valued Universal-Literal CAM for Cellular Logic Image Processing

    Takahiro HANYU  Manabu ARAKAKI  Michitaka KAMEYAMA  

     
    PAPER-Multiple-Valued Architectures

      Vol:
    E80-C No:7
      Page(s):
    948-955

    This paper presents a 4-valued content-addressable memory (CAM) for fully parallel template-matching operations in real-time cellular logic image processing with fixed templates. A universal literal is essential to perform a multiple-valued template-matching operation. It is decomposed of a pair of a threshold operation in a CAM cell and a logic-value conversion shared by CAM cells in the same column of a CAM cellular array, which makes a CAM cell function simple. Since a threshold operation together with a 4-valued storage element can be designed by using a single floating-gate MOS transistor, a high-density 4-valued universal-literal CAM with a single-transistor cell can be implemented by using a multi-layer interconnection technology. It is demonstrated that the performance of the proposed CAM is much superior to that of conventional CAMs under the same function.

  • Multi-Context FPGA Using Fine-Grained Interconnection Blocks and Its CAD Environment

    Hasitha Muthumala WAIDYASOORIYA  Weisheng CHONG  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    517-525

    Dynamically-programmable gate arrays (DPGAs) promise lower-cost implementations than conventional field-programmable gate arrays (FPGAs) since they efficiently reuse limited hardware resources in time. One of the typical DPGA architectures is a multi-context FPGA (MC-FPGA) that requires multiple memory bits per configuration bit to realize fast context switching. However, this additional memory bits cause significant overhead in area and power consumption. This paper presents novel architecture of a switch element to overcome the required capacity of configuration memory. Our main idea is to exploit redundancy between different contexts by using a fine-grained switch element. The proposed MC-FPGA is designed in a 0.18 µm CMOS technology. Its maximum clock frequency and the context switching frequency are measured to be 310 MHz and 272 MHz, respectively. Moreover, novel CAD process that exploits the redundancy in configuration data, is proposed to support the MC-FPGA architecture.

  • A Minimum-Latency Linear Array FFT Processor for Robotics

    Somchai KITTICHAIKOONKIT  Michitaka KAMEYAMA  

     
    PAPER-Speech Processing

      Vol:
    E76-D No:6
      Page(s):
    680-688

    In the applications of the fast Fourier transform (FFT) to real-world computation such as robot vision, high-speed processing with small latency is an important issue. In this paper, we propose a linear array processor for the minimum-latency FFT computation. The processor is constructed by identical butterfly elements (BE's). The key concept to minimize the latency is that each BE generates its output data immediately after its input data become available, with 100% utilization of its arithmetic unit. We also introduce the real-valued FFT to perform the complex-valued FFT. We utilize a double linear array structure so that the parallel processing can be realized without communication between the linear arrays. As a result, the hardware amount of a single BE is reduced to half that of conventional designs. The latency of the proposed FFT processor is greatly reduced in comparison with conventional linear array FFT processors.

  • Implementation of a Partially Reconfigurable Multi-Context FPGA Based on Asynchronous Architecture

    Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Electronic Circuits

      Vol:
    E92-C No:4
      Page(s):
    539-549

    This paper presents a novel architecture to increase the hardware utilization in multi-context field programmable gate arrays (MC-FPGAs). Conventional MC-FPGAs use dedicated tracks to transfer context-ID bits. As a result, hardware utilization ratio decreases, since it is very difficult to map different contexts, area efficiently. It also increases the context switching power, area and static power of the context-ID tracks. The proposed MC-FPGA uses the same wires to transfer both data and context-ID bits from cell to cell. As a result, programs can be mapped area efficiently by partitioning them into different contexts. An asynchronous multi-context logic block architecture to increase the processing speed of the multiple contexts is also proposed. The proposed MC-FPGA is fabricated using 6-metal 1-poly CMOS design rules. The data and context-ID transfer delays are measured to be 2.03ns and 2.26ns respectively. We achieved 30% processing time reduction for the SAD based correspondance search algorithm.

  • Evaluation of Interconnect-Complexity-Aware Low-Power VLSI Design Using Multiple Supply and Threshold Voltages

    Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E91-A No:12
      Page(s):
    3596-3606

    This paper presents a high-level synthesis approach to minimize the total power consumption in behavioral synthesis under time and area constraints. The proposed method has two stages, functional unit (FU) energy optimization and interconnect energy optimization. In the first stage, active and inactive energies of the FUs are optimized using a multiple supply and threshold voltage scheme. Genetic algorithm (GA) based simultaneous assignment of supply and threshold voltages and module selection is proposed. The proposed GA based searching method can be used in large size problems to find a near-optimal solution in a reasonable time. In the second stage, interconnects are simplified by increasing their sharing. This is done by exploiting similar data transfer patterns among FUs. The proposed method is evaluated for several benchmarks under 90 nm CMOS technology. The experimental results show that more than 40% of energy savings can be achieved by our proposed method.

  • Group Testing Based Detection of Web Service DDoS Attackers

    Dalia NASHAT  Xiaohong JIANG  Michitaka KAMEYAMA  

     
    PAPER

      Vol:
    E93-B No:5
      Page(s):
    1113-1121

    The Distributed Denial of Service attack (DDoS) is one of the major threats to network security that exhausts network bandwidth and resources. Recently, an efficient approach Live Baiting was proposed for detecting the identities of DDoS attackers in web service using low state overhead without requiring either the models of legitimate requests nor anomalous behavior. However, Live Baiting has two limitations. First, the detection algorithm adopted in Live Baiting starts with a suspects list containing all clients, which leads to a high false positive probability especially for large web service with a huge number of clients. Second, Live Baiting adopts a fixed threshold based on the expected number of requests in each bucket during the detection interval without the consideration of daily and weekly traffic variations. In order to address the above limitations, we first distinguish the clients activities (Active and Non-Active clients during the detection interval) in the detection process and then further propose a new adaptive threshold based on the Change Point Detection method, such that we can improve the false positive probability and avoid the dependence of detection on sites and access patterns. Extensive trace-driven simulation has been conducted on real Web trace to demonstrate the detection efficiency of the proposed scheme in comparison with the Live Baiting detection scheme.

  • Design of a Trinocular-Stereo-Vision VLSI Processor Based on Optimal Scheduling

    Masanori HARIYAMA  Naoto YOKOYAMA  Michitaka KAMEYAMA  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    479-486

    This paper presents a processor architecture for high-speed and reliable trinocular stereo matching based on adaptive window-size control of SAD (Sum of Absolute Differences) computation. To reduce its computational complexity, SADs are computed using images divided into non-overlapping regions, and the matching result is iteratively refined by reducing a window size. Window-parallel-and-pixel-parallel architecture is also proposed to achieve to fully exploit the potential parallelism of the algorithm. The architecture also reduces the complexity of an interconnection network between memory and functional units based on regularity of reference pixels. The stereo matching processor is designed in a 0.18 µm CMOS technology. The processing time is 83.2 µs@100 MHz. By using optimal scheduling, the increases in area and processing time is only 5% and 3% respectively compared to binocular stereo vision although the computational amount is double.

  • High-Level Synthesis of VLSI Processors for Intelligent Integrated Systems

    Yasuaki SAWANO  Bumchul KIM  Michitaka KAMEYAMA  

     
    PAPER

      Vol:
    E77-C No:7
      Page(s):
    1101-1107

    In intelligent integrated systems such as robotics for autonomous work, it is essential to respond to the change of the environment very quickly. Therefore, the development of special-purpose VLSI processors for intelligent integrated systems with small latency becomes an very important subject. In this paper, we present a scheduling algorithm for high-level synthesis. The input to the scheduler is a behavioral description which is viewed as a data flow graph (DFG). The scheduler minimizes the latency, which is the delay of the critical path in the DFG, and minimizes the number of functional units and buses by improving the utilization rates. By using an integer linear programming, the scheduler optimally assigns nodes and arcs in the DFG into steps.

  • Architecture of a Fine-Grain Field-Programmable VLSI Based on Multiple-Valued Source-Coupled Logic

    Md.Munirul HAQUE  Michitaka KAMEYAMA  

     
    PAPER

      Vol:
    E87-C No:11
      Page(s):
    1869-1875

    A novel Multiple-Valued Field-Programmable VLSI (MV-FPVLSI) architecture using the Multiple-Valued Source-Coupled Logic (MVSCL) is proposed to implement special-purpose processors. An MV-FPVLSI consists of identical cells, which are connected to 8-neighborhood ones. To reduce the complexity of the interconnection block between two cells in an MV-FPVLSI, a bit-serial fine-grain pipeline architecture is introduced which allows single-wire data transmission and as a result, the data-transmission delay becomes very small in comparison with that of a conventional FPGA. To reduce the number of switches in the interconnection block further, a cell, using multiple-valued source-coupled logic circuits, is proposed, where the input currents can be linearly summed just by wiring without using any active devices. Not only the data, but also the control signal can be superposed by linear summation. As a result, no input switch is required which contributes to smaller data transmission delay. Moreover, an arbitrary 2-input logic function can be generated by linear summation of the input currents and threshold operations using these reconfigurable MVSCL circuits. As the MVSCL circuit has high driving capability in comparison with that of an equivalent CMOS circuit, high-speed logic operation is also possible while maintaining low power.

  • Collision Detection VLSI Processor for Intelligent Vehicles Using a Hierarchically-Content-Addressable Memory

    Masanori HARIYAMA  Kazuhiro SASAKI  Michitaka KAMEYAMA  

     
    PAPER-Processors

      Vol:
    E82-C No:9
      Page(s):
    1722-1729

    High-speed collision detection is important to realize a highly-safe intelligent vehicle. In collision detection, high-computational power is required to perform matching operation between discrete points on surfaces of a vehicle and obstacles in real-world environment. To achieve the highest performance, a hierarchical matching scheme is proposed based on two representations: the coarse representation and the fine representation. A vehicle is represented as a set of rectangular solids in the fine representation (fine rectangular solids), and the coarse representation, which is also a set of rectangular solids, is produced by enlarging the fine representation. If collision occurs between an obstacle discrete point and a rectangular solid in the coarse representation (coarse rectangular solid), then it is sufficient to check the only fine rectangular solids contained in the coarse one. Consequently, checks for the other fine rectangular solids can be omitted. To perform the hierarchical matching operation in parallel, a hierarchically-content-addressable memory (HCAM) is proposed. Since there is no need to perform matching operation in parallel with fine rectangular solids contained in different coarse ones, the fine ones are mapped onto a matching unit. As a result, the number of matching units can be reduced without decreasing the performance. Under the condition of the same execution time, the area of the HCAM is reduced to 46.4% in comparison with that of the conventional CAM in which the hierarchical matching scheme is not used.

  • Logic-In-Control-Architecture-Based Reconfigurable VLSI Using Multiple-Valued Differential-Pair Circuits

    Nobuaki OKADA  Michitaka KAMEYAMA  

     
    PAPER-Application of Multiple-Valued VLSI

      Vol:
    E93-D No:8
      Page(s):
    2126-2133

    A fine-grain bit-serial multiple-valued reconfigurable VLSI based on logic-in-control architecture is proposed for effective use of the hardware resources. In logic-in-control architecture, the control circuits can be merged with the arithmetic/logic circuits, where the control and arithmetic/logic circuits are constructed by using one or multiple logic blocks. To implement the control circuit, only one state in a state transition diagram is allocated to one logic block, which leads to reduction of the complexity of interconnections between logic blocks. The fine-grain logic block is implemented based on multiple-valued current-mode circuit technology. In the fine-grain logic block, an arbitrary 3-variable binary function can be programmed by using one multiplexer and two universal literal circuits. Three-variable binary functions are used to implement the control circuit. Moreover, the hardware resources can be utilized to construct a bit-serial adder, because full-adder sum and carry can be realized by programming in the universal literal circuit. Therefore, the logic block can be effectively reconfigured for arithmetic/logic and control circuits. It is made clear that the hardware complexity of the control circuit in the proposed reconfigurable VLSI can be reduced in comparison with that of the control circuit based on a typically sequential circuit in the conventional FPGA and the fine-grain field-programmable VLSI reported until now.

41-60hit(65hit)