The search functionality is under construction.

Author Search Result

[Author] Masanori HARIYAMA(29hit)

1-20hit(29hit)

  • Architecture of an Asynchronous FPGA for Handshake-Component-Based Design

    Yoshiya KOMATSU  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Architecture

      Vol:
    E96-D No:8
      Page(s):
    1632-1644

    This paper presents a novel architecture of an asynchronous FPGA for handshake-component-based design. The handshake-component-based design is suitable for large-scale, complex asynchronous circuit because of its understandability. This paper proposes an area-efficient architecture of an FPGA that is suitable for handshake-component-based asynchronous circuit. Moreover, the Four-Phase Dual-Rail encoding is employed to construct circuits robust to delay variation because the data paths are programmable in FPGA. The FPGA based on the proposed architecture is implemented in a 65 nm process. Its evaluation results show that the proposed FPGA can implement handshake components efficiently.

  • Minimizing Energy Consumption Based on Dual-Supply-Voltage Assignment and Interconnection Simplification

    Masanori HARIYAMA  Shigeo YAMADERA  Michitaka KAMEYAMA  

     
    PAPER

      Vol:
    E89-C No:11
      Page(s):
    1551-1558

    This paper presents a design method to minimize energy of both functional units (FUs) and an interconnection network between FUs. To reduce complexity of the interconnection network, data transfers between FUs are classified according to FU types of operations in a data flow graph. The basic idea behind reducing the complexity of the interconnection network is that the interconnection resource can be shared among data transfers with the same FU type of a source node and the same FU type of a destination node. Moreover, an efficient method based on a genetic algorithm is presented.

  • Task Allocation with Algorithm Transformation for Reducing Data-Transfer Bottlenecks in Heterogeneous Multi-Core Processors: A Case Study of HOG Descriptor Computation

    Hasitha Muthumala WAIDYASOORIYA  Daisuke OKUMURA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E93-A No:12
      Page(s):
    2570-2580

    Heterogeneous multi-core processors are attracted by the media processing applications due to their capability of drawing strengths of different cores to improve the overall performance. However, the data transfer bottlenecks and limitations in the task allocation due to the accelerator-incompatible operations prevents us from gaining full potential of the heterogeneous multi-core processors. This paper presents a task allocation method based on algorithm transformation to increase the freedom of task allocation. We use approximation methods such as CORDIC algorithms to map the accelerator-incompatible operations to accelerator cores. According to the experimental results using HOG descriptor computation, the proposed task allocation method reduces the data transfer time by more than 82% and the total processing time by more than 79% compared to the conventional task allocation method.

  • Low-Power Field-Programmable VLSI Using Multiple Supply Voltages

    Weisheng CHONG  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Low Power Methodology

      Vol:
    E88-A No:12
      Page(s):
    3298-3305

    A low-power field-programmable VLSI (FPVLSI) is presented to overcome the problem of large power consumption in field-programmable gate arrays (FPGAs). To reduce power consumption in routing networks, the FPVLSI consists of cells that are based on a bit-serial pipeline architecture which reduces routing block complexity. Moreover, a level-converter-less multiple-supply-voltage scheme using dynamic circuits is proposed, where the cells in non-critical paths use a low supply voltage for low power under a speed constraint. The FPVLSI is evaluated based on a 0.18-µm CMOS design rule. The power consumption of the FPVLSI using multiple supply voltages is reduced to 17% or less compared to that of the static-circuit-based FPVLSI using multiple supply voltages.

  • Evaluation of an FPGA-Based Heterogeneous Multicore Platform with SIMD/MIMD Custom Accelerators

    Yasuhiro TAKEI  Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E96-A No:12
      Page(s):
    2576-2586

    Heterogeneous multi-core architectures with CPUs and accelerators attract many attentions since they can achieve power-efficient computing in various areas from low-power embedded processing to high-performance computing. Since the optimal architecture is different from application to application, finding the most suitable accelerator is very important. In this paper, we propose an FPGA-based heterogeneous multi-core platform with custom accelerators for power-efficient computing. Using the proposed platform, we evaluate several applications and accelerators to identify many key requirements of the applications and properties of the accelerators. Such an evaluation is very important to select and optimize the most suitable accelerator according to the requirements of an application to achieve the best performance.

  • Memory Allocation for Window-Based Image Processing on Multiple Memory Modules with Simple Addressing Functions

    Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E94-A No:1
      Page(s):
    342-351

    Accelerator cores in low-power embedded processors have on-chip multiple memory modules to increase the data access speed and to enable parallel data access. When large functional units such as multipliers and dividers are used for addressing, a large power and chip area are consumed. Therefore, recent low-power processors use small functional units such as adders and counters to reduce the power and area. Such small functional units make it difficult to implement complex addressing patterns without duplicating data among multiple memory modules. The data duplication wastes the memory capacity and increases the data transfer time significantly. This paper proposes a method to reduce the memory duplication for window-based image processing, which is widely used in many applications. Evaluations using an accelerator core show that the proposed method reduces the data amount and data transfer time by more than 50%.

  • Memory-Access-Driven Context Partitioning for Window-Based Image Processing on Heterogeneous Multicore Processors

    Hasitha Muthumala WAIDYASOORIYA  Yosuke OHBAYASHI  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Design Methodology

      Vol:
    E95-D No:2
      Page(s):
    354-363

    Accelerator cores in low-power heterogeneous processors have on-chip local memories to enable parallel data access. The memory capacities of the local memories are very small. Therefore, the data should be transferred from the global memory to the local memories many times. These data transfers greatly increase the total processing time. Memory allocation technique to increase the data sharing is a good solution to this problem. However, when using reconfigurable cores, the data must be shared among multiple contexts. However, conventional context partitioning methods only consider how to reuse limited hardware resources in different time slots. They do not consider the data sharing. This paper proposes a context partitioning method to share both the hardware resources and the local memory data. According to the experimental results, the proposed method reduces the processing time by more than 87% compared to conventional context partitioning techniques.

  • Multi-Context FPGA Using Fine-Grained Interconnection Blocks and Its CAD Environment

    Hasitha Muthumala WAIDYASOORIYA  Weisheng CHONG  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    517-525

    Dynamically-programmable gate arrays (DPGAs) promise lower-cost implementations than conventional field-programmable gate arrays (FPGAs) since they efficiently reuse limited hardware resources in time. One of the typical DPGA architectures is a multi-context FPGA (MC-FPGA) that requires multiple memory bits per configuration bit to realize fast context switching. However, this additional memory bits cause significant overhead in area and power consumption. This paper presents novel architecture of a switch element to overcome the required capacity of configuration memory. Our main idea is to exploit redundancy between different contexts by using a fine-grained switch element. The proposed MC-FPGA is designed in a 0.18 µm CMOS technology. Its maximum clock frequency and the context switching frequency are measured to be 310 MHz and 272 MHz, respectively. Moreover, novel CAD process that exploits the redundancy in configuration data, is proposed to support the MC-FPGA architecture.

  • Implementation of a Partially Reconfigurable Multi-Context FPGA Based on Asynchronous Architecture

    Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Electronic Circuits

      Vol:
    E92-C No:4
      Page(s):
    539-549

    This paper presents a novel architecture to increase the hardware utilization in multi-context field programmable gate arrays (MC-FPGAs). Conventional MC-FPGAs use dedicated tracks to transfer context-ID bits. As a result, hardware utilization ratio decreases, since it is very difficult to map different contexts, area efficiently. It also increases the context switching power, area and static power of the context-ID tracks. The proposed MC-FPGA uses the same wires to transfer both data and context-ID bits from cell to cell. As a result, programs can be mapped area efficiently by partitioning them into different contexts. An asynchronous multi-context logic block architecture to increase the processing speed of the multiple contexts is also proposed. The proposed MC-FPGA is fabricated using 6-metal 1-poly CMOS design rules. The data and context-ID transfer delays are measured to be 2.03ns and 2.26ns respectively. We achieved 30% processing time reduction for the SAD based correspondance search algorithm.

  • Evaluation of Interconnect-Complexity-Aware Low-Power VLSI Design Using Multiple Supply and Threshold Voltages

    Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-High-Level Synthesis and System-Level Design

      Vol:
    E91-A No:12
      Page(s):
    3596-3606

    This paper presents a high-level synthesis approach to minimize the total power consumption in behavioral synthesis under time and area constraints. The proposed method has two stages, functional unit (FU) energy optimization and interconnect energy optimization. In the first stage, active and inactive energies of the FUs are optimized using a multiple supply and threshold voltage scheme. Genetic algorithm (GA) based simultaneous assignment of supply and threshold voltages and module selection is proposed. The proposed GA based searching method can be used in large size problems to find a near-optimal solution in a reasonable time. In the second stage, interconnects are simplified by increasing their sharing. This is done by exploiting similar data transfer patterns among FUs. The proposed method is evaluated for several benchmarks under 90 nm CMOS technology. The experimental results show that more than 40% of energy savings can be achieved by our proposed method.

  • Design of a Trinocular-Stereo-Vision VLSI Processor Based on Optimal Scheduling

    Masanori HARIYAMA  Naoto YOKOYAMA  Michitaka KAMEYAMA  

     
    PAPER

      Vol:
    E91-C No:4
      Page(s):
    479-486

    This paper presents a processor architecture for high-speed and reliable trinocular stereo matching based on adaptive window-size control of SAD (Sum of Absolute Differences) computation. To reduce its computational complexity, SADs are computed using images divided into non-overlapping regions, and the matching result is iteratively refined by reducing a window size. Window-parallel-and-pixel-parallel architecture is also proposed to achieve to fully exploit the potential parallelism of the algorithm. The architecture also reduces the complexity of an interconnection network between memory and functional units based on regularity of reference pixels. The stereo matching processor is designed in a 0.18 µm CMOS technology. The processing time is 83.2 µs@100 MHz. By using optimal scheduling, the increases in area and processing time is only 5% and 3% respectively compared to binocular stereo vision although the computational amount is double.

  • Collision Detection VLSI Processor for Intelligent Vehicles Using a Hierarchically-Content-Addressable Memory

    Masanori HARIYAMA  Kazuhiro SASAKI  Michitaka KAMEYAMA  

     
    PAPER-Processors

      Vol:
    E82-C No:9
      Page(s):
    1722-1729

    High-speed collision detection is important to realize a highly-safe intelligent vehicle. In collision detection, high-computational power is required to perform matching operation between discrete points on surfaces of a vehicle and obstacles in real-world environment. To achieve the highest performance, a hierarchical matching scheme is proposed based on two representations: the coarse representation and the fine representation. A vehicle is represented as a set of rectangular solids in the fine representation (fine rectangular solids), and the coarse representation, which is also a set of rectangular solids, is produced by enlarging the fine representation. If collision occurs between an obstacle discrete point and a rectangular solid in the coarse representation (coarse rectangular solid), then it is sufficient to check the only fine rectangular solids contained in the coarse one. Consequently, checks for the other fine rectangular solids can be omitted. To perform the hierarchical matching operation in parallel, a hierarchically-content-addressable memory (HCAM) is proposed. Since there is no need to perform matching operation in parallel with fine rectangular solids contained in different coarse ones, the fine ones are mapped onto a matching unit. As a result, the number of matching units can be reduced without decreasing the performance. Under the condition of the same execution time, the area of the HCAM is reduced to 46.4% in comparison with that of the conventional CAM in which the hierarchical matching scheme is not used.

  • A Switch Block Architecture for Multi-Context FPGAs Based on a Ferroelectric-Capacitor Functional Pass-Gate Using Multiple/Binary Valued Hybrid Signals

    Shota ISHIHARA  Noriaki IDOBATA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Application of Multiple-Valued VLSI

      Vol:
    E93-D No:8
      Page(s):
    2134-2144

    Dynamically Programmable Gate Arrays (DPGAs) provide more area-efficient implementations than conventional Field Programmable Gate Arrays (FPGAs). One of typical DPGA architectures is multi-context architecture. An DPGA based on multi-context architecture is Multi-Context FPGA (MC-FPGA) which achieves fast switching between contexts. The problem of the conventional SRAM-based MC-FPGA is its large area and standby power dissipation because of the large number of configuration memory bits. Moreover, since SRAM is volatile, the SRAM-based multi-context FPGA is difficult to implement power-gating for standby power reduction. This paper presents an area-efficient and nonvolatile multi-context switch block architecture for MC-FPGAs based on a ferroelectric-capacitor functional pass-gate which merges a multiple-valued threshold function and a nonvolatile multiple-valued storage. The test chip for four contexts is fabricated in a 0.35 µm-CMOS/0.60 µm-ferroelectric-capacitor process. The transistor count of the proposed multi-context switch block is reduced to 63% in comparison with that of the SRAM-based one.

  • Acceleration of Block Matching on a Low-Power Heterogeneous Multi-Core Processor Based on DTU Data-Transfer with Data Re-Allocation

    Yoshitaka HIRAMATSU  Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Toru NOJIRI  Kunio UCHIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Integrated Electronics

      Vol:
    E95-C No:12
      Page(s):
    1872-1882

    The large data-transfer time among different cores is a big problem in heterogeneous multi-core processors. This paper presents a method to accelerate the data transfers exploiting data-transfer-units together with complex memory allocation. We used block matching, which is very common in image processing, to evaluate our technique. The proposed method reduces the data-transfer time by more than 42% compared to the earlier works that use CPU-based data transfers. Moreover, the total processing time is only 15 ms for a VGA image with 1616 pixel blocks.

  • Highly-Parallel Stereo Vision VLSI Processor Based on an Optimal Parallel Memory Access Scheme

    Masanori HARIYAMA  Seunghwan LEE  Michitaka KAMEYAMA  

     
    PAPER-Integrated Electronics

      Vol:
    E84-C No:3
      Page(s):
    382-389

    In a real-time vision system, parallel memory access is essential for highly parallel image processing. The use of multiple memory modules is one efficient technique for parallel access. In the technique, data stored in different memory modules can be accessed in parallel. This paper presents an optimal memory allocation methodology to map data to be read in parallel onto different memory modules. Based on the methodology, a high-performance VLSI processor for three-dimensional instrumentation is proposed.

  • Design of High-Performance Asynchronous Pipeline Using Synchronizing Logic Gates

    Zhengfan XIA  Shota ISHIHARA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-Integrated Electronics

      Vol:
    E95-C No:8
      Page(s):
    1434-1443

    This paper introduces a novel design method of an asynchronous pipeline based on dual-rail dynamic logic. The overhead of handshake control logic is greatly reduced by constructing a reliable critical datapath, which offers the pipeline high throughput as well as low power consumption. Synchronizing Logic Gates (SLGs), which have no data dependency problem, are used in the design to construct the reliable critical datapath. The design targets latch-free and extremely fine-grain or gate-level pipeline, where the depth of every pipeline stage is only one dual-rail dynamic logic. HSPICE simulation results, in a 65 nm design technology, indicate that the proposed design increases the throughput by 120% and decreases the power consumption by 54% compared with PS0, a classic dual-rail asynchronous pipeline implementation style, in 4-bit wide FIFOs. Moreover, this method is applied to design an array style multiplier. It shows that the proposed design reduces power by 37.9% compared to classic synchronous design when the workloads are 55%. A chip has been fabricated with a 44 multiplier function, which works well at 2.16G data-set/s (Post-layout simulation).

  • Data-Transfer-Aware Design of an FPGA-Based Heterogeneous Multicore Platform with Custom Accelerators

    Yasuhiro TAKEI  Hasitha Muthumala WAIDYASOORIYA  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-VLSI Design Technology and CAD

      Vol:
    E98-A No:12
      Page(s):
    2658-2669

    For an FPGA-based heterogeneous multicore platform, we present the design methodology to reduce the total processing time considering data-transfer. The reconfigurability of recent FPGAs with hard CPU cores allows us to realize a single-chip heterogeneous processor optimized for a given application. The major problem in designing such heterogeneous processors is data-transfer between CPU cores and accelerator cores. The total processing time with data-transfers is modeled considering the overlap of computation time and data-transfer time, and optimal design parameters are searched for.

  • Architecture of a Stereo Matching VLSI Processor Based on Hierarchically Parallel Memory Access

    Masanori HARIYAMA  Haruka SASAKI  Michitaka KAMEYAMA  

     
    PAPER-Digital Circuits and Computer Arithmetic

      Vol:
    E88-D No:7
      Page(s):
    1486-1491

    This paper presents a VLSI processor for high-speed and reliable stereo matching based on adaptive window-size control of SAD(Sum of Absolute Differences) computation. To reduce its computational complexity, SADs are computed using multi-resolution images. Parallel memory access is essential for highly parallel image processing. For parallel memory access, this paper also presents an optimal memory allocation that minimizes the hardware amount under the condition of parallel memory access at specified resolutions.

  • Memory Allocation for Multi-Resolution Image Processing

    Yasuhiro KOBAYASHI  Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER-VLSI Systems

      Vol:
    E91-D No:10
      Page(s):
    2386-2397

    Hierarchical approaches using multi-resolution images are well-known techniques to reduce the computational amount without degrading quality. One major issue in designing image processors is to design a memory system that supports parallel access with a simple interconnection network. The complexity of the interconnection network mainly depends on memory allocation; it maps pixels onto memory modules and determines the required number of memory modules. This paper presents a memory allocation method to minimize the number of memory modules for image processing using multi-resolution images. For efficient search, the proposed method exploits the regularity of window-type image processing. A practical example demonstrates that the number of memory modules is reduced to less than 14% that of conventional methods.

  • A Collision Detection Processor for Intelligent Vehicles

    Masanori HARIYAMA  Michitaka KAMEYAMA  

     
    PAPER

      Vol:
    E76-C No:12
      Page(s):
    1804-1811

    Since carelessness in driving causes a terrible traffic accident, it is an important subject for a vehicle to avoid collision autonomously. Real-time collision detection between a vehicle and obstacles will be a key target for the next-generation car electronics system. In collision detection, a large storage capacity is usually required to store the 3-D information on the obstacles lacated in a workspace. Moreover, high-computational power is essential not only in coordinate transformation but also in matching operation. In the proposed collision detection VLSI processor, the matching operation is drastically accelerated by using a Content-Addressable Memory (CAM) which evaluates the magnitude relationships between an input word and all the stored words in parallel. A new obstacle representation based on a union of rectangular solids is also used to reduce the obstacle memory capacity, so that the collision detection can be parformed only by parallel magnitude comparison. Parallel architecture using several identical processor elements (PEs) is employed to perform the coordinate transformation at high speed based on the COordinate Rotation DIgital Computation (CORDIC) algorithms. The collision detection time becomes 5.2 ms using 20 PEs and five CAMs with a 42-kbit capacity.

1-20hit(29hit)