The search functionality is under construction.

Author Search Result

[Author] Tetsushi KOIDE(24hit)

1-20hit(24hit)

  • Integration Architecture of Content Addressable Memory and Massive-Parallel Memory-Embedded SIMD Matrix for Versatile Multimedia Processor

    Takeshi KUMAKI  Masakatsu ISHIZAKI  Tetsushi KOIDE  Hans Jurgen MATTAUSCH  Yasuto KURODA  Takayuki GYOHTEN  Hideyuki NODA  Katsumi DOSAKA  Kazutami ARIMOTO  Kazunori SAITO  

     
    PAPER

      Vol:
    E91-C No:9
      Page(s):
    1409-1418

    This paper presents an integration architecture of content addressable memory (CAM) and a massive-parallel memory-embedded SIMD matrix for constructing a versatile multimedia processor. The massive-parallel memory-embedded SIMD matrix has 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. The SIMD matrix architecture is verified to be a better way for processing the repeated arithmetic operation types in multimedia applications. The proposed architecture, reported in this paper, exploits in addition CAM technology and enables therefore fast pipelined table-lookup coding operations. Since both arithmetic and table-lookup operations execute extremely fast, the proposed novel architecture can realize consequently efficient and versatile multimedia data processing. Evaluation results of the proposed CAM-enhanced massive-parallel SIMD matrix processor for the example of the frequently used JPEG image-compression application show that the necessary clock cycle number can be reduced by 86% in comparison to a conventional mobile DSP architecture. The determined performances in Mpixel/mm2 are factors 3.3 and 4.4 better than with a CAM-less massive-parallel memory-embedded SIMD matrix processor and a conventional mobile DSP, respectively.

  • Feasibility Study for Computer-Aided Diagnosis System with Navigation Function of Clear Region for Real-Time Endoscopic Video Image on Customizable Embedded DSP Cores

    Masayuki ODAGAWA  Tetsushi KOIDE  Toru TAMAKI  Shigeto YOSHIDA  Hiroshi MIENO  Shinji TANAKA  

     
    LETTER-VLSI Design Technology and CAD

      Pubricized:
    2021/07/08
      Vol:
    E105-A No:1
      Page(s):
    58-62

    This paper presents examination result of possibility for automatic unclear region detection in the CAD system for colorectal tumor with real time endoscopic video image. We confirmed that it is possible to realize the CAD system with navigation function of clear region which consists of unclear region detection by YOLO2 and classification by AlexNet and SVMs on customizable embedded DSP cores. Moreover, we confirmed the real time CAD system can be constructed by a low power ASIC using customizable embedded DSP cores.

  • An Iterative Improvement Circuit Partitioning Algorithm under Path Delay Constraints

    Jun'ichiro MINAMI  Tetsushi KOIDE  Shin'ichi WAKABAYASHI  

     
    PAPER-Layout Synthesis

      Vol:
    E83-A No:12
      Page(s):
    2569-2576

    This paper presents a timing-driven iterative improvement circuit partitioning algorithm under path delay constraints for the general delay model. The proposed algorithm is an extension of the Fiduccia & Mattheyses (FM) method so as to handle path delay constraints and consists of the clustering and iterative improvement phases. In the first phase, we reduce the size of a given circuit, with a new clustering algorithm to obtain a partition in a short computation time. Next, the iterative improvement phase based on the FM method is applied, and then a new path-based timing violation removal algorithm is also performed so as to remove all the timing violations. From experimental results for ISCAS89 benchmarks, we have demonstrated that the proposed algorithm can produce the partitions which mostly satisfy the timing constraints.

  • A K-Means-Based Multi-Prototype High-Speed Learning System with FPGA-Implemented Coprocessor for 1-NN Searching

    Fengwei AN  Tetsushi KOIDE  Hans Jürgen MATTAUSCH  

     
    PAPER-Biocybernetics, Neurocomputing

      Vol:
    E95-D No:9
      Page(s):
    2327-2338

    In this paper, we propose a hardware solution for overcoming the problem of high computational demands in a nearest neighbor (NN) based multi-prototype learning system. The multiple prototypes are obtained by a high-speed K-means clustering algorithm utilizing a concept of software-hardware cooperation that takes advantage of the flexibility of the software and the efficiency of the hardware. The one nearest neighbor (1-NN) classifier is used to recognize an object by searching for the nearest Euclidean distance among the prototypes. The major deficiency in conventional implementations for both K-means and 1-NN is the high computational demand of the nearest neighbor searching. This deficiency is resolved by an FPGA-implemented coprocessor that is a VLSI circuit for searching the nearest Euclidean distance. The coprocessor requires 12.9% logic elements and 58% block memory bits of an Altera Stratix III E110 FPGA device. The hardware communicates with the software by a PCI Express (4) local-bus-compatible interface. We benchmark our learning system against the popular case of handwritten digit recognition in which abundant previous works for comparison are available. In the case of the MNIST database, we could attain the most efficient accuracy rate of 97.91% with 930 prototypes, the learning speed of 1.310-4 s/sample and the classification speed of 3.9410-8 s/character.

  • Software-Based Parallel Cryptographic Solution with Massive-Parallel Memory-Embedded SIMD Matrix Architecture for Data-Storage Systems

    Takeshi KUMAKI  Tetsushi KOIDE  Hans Jurgen MATTAUSCH  Masaharu TAGAMI  Masakatsu ISHIZAKI  

     
    PAPER-Fundamentals of Information Systems

      Vol:
    E94-D No:9
      Page(s):
    1742-1754

    This paper presents a software-based parallel cryptographic solution with a massive-parallel memory-embedded SIMD matrix (MTX) for data-storage systems. MTX can have up to 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. Furthermore, a next-generation SIMD matrix called MX-2 has been developed by expanding processing-element capability of MTX from 2-bit to 4-bit processing. These SIMD matrix architectures are verified to be a better alternative for processing repeated-arithmetic and logical-operations in multimedia applications with low power consumption. Moreover, we have proposed combining Content Addressable Memory (CAM) technology with the massive-parallel memory-embedded SIMD matrix architecture to enable fast pipelined table-lookup coding. Since both arithmetic logical operation and table-lookup coding execute extremely fast on these architectures, efficient execution of encryption and decryption algorithms can be realized. Evaluation results of the CAM-less and CAM-enhanced massive-parallel SIMD matrix processor for the example of the Advanced Encryption Standard (AES), which is a widely-used cryptographic algorithm, show that a throughput of up to 2.19 Gbps becomes possible. This means that several standard data-storage transfer specifications, such as SD, CF (Compact Flash), USB (Universal Serial Bus) and SATA (Serial Advanced Technology Attachment) can be covered. Consequently, the massive-parallel SIMD matrix architecture is very suitable for private information protection in several data-storage media. A further advantage of the software based solution is the flexible update possibility of the implemented-cryptographic algorithm to a safer future algorithm. The massive-parallel memory-embedded SIMD matrix architecture (MTX and MX-2) is therefore a promising solution for integrated realization of real-time cryptographic algorithms with low power dissipation and small Si-area consumption.

  • An Optimal Channel Pin Assignment Algorithm for Hierarchical Building-Block Layout Design

    Tetsushi KOIDE  Shin'ichi WAKABAYASHI  Noriyoshi YOSHIDA  

     
    PAPER

      Vol:
    E76-A No:10
      Page(s):
    1636-1644

    This paper presents a linear time optimal algorithm to a channel pin assignment problem for hierarchical building-block layout design. The channel pin assignment problem is to determine positions of the pins of nets on the top and the bottom sides of a channel, which are partitioned into several intervals, and the pins are permutable within their associated intervals. The channel pin assignment problem has been shown NP-hard in general. We present a linear time optimal algorithm for an important special case of the problem, in which there is at most one pin of a net within each interval in the channel. The proposed algorithm is optimal in a sense that it can minimize both the channel density and the total wire length of the channel. We also disscuss how to apply our algorithm to the pin assignment in the L-shaped and staircase channels. Experimental results indicate that substantial reduction in both channel density and estimated total wire length can be obtained by permuting pins in each interval. Combining the proposed algorithm with a conventional channel router, results of channel routing also achieve large amount of reduction of the number of tracks, total wire length, and the number of vias.

  • Embedded Low-Power Dynamic TCAM Architecture with Transparently Scheduled Refresh

    Hideyuki NODA  Kazunari INOUE  Hans Jurgen MATTAUSCH  Tetsushi KOIDE  Katsumi DOSAKA  Kazutami ARIMOTO  Kazuyasu FUJISHIMA  Kenji ANAMI  Tsutomu YOSHIHARA  

     
    PAPER-Memory

      Vol:
    E88-C No:4
      Page(s):
    622-629

    This paper describes a dynamic TCAM architecture with planar complementary capacitors, transparently scheduled refresh (TSR), autonomous power management (APM) and address-input-free writing scheme. The complementary cell structure of the planar dynamic TCAM (PD-TCAM) allows small cell size of 4.79 µm2 in 130 nm CMOS technology, and realizes stable TCAM operation even with very small storage capacitance. Due to the TSR architecture, the PD-TCAM maintains functional compatibility with a conventional SRAM-based TCAM. The combined effects of the compact PD-TCAM array matrix and the APM technique result in up to 50% reduction of the total power consumption during search operation. In addition, an intelligent address-input-free writing scheme is also introduced to facilitate the PD-TCAM application for the user. Consequently the proposed architecture is quite attractive for realizing compact and low-power embedded TCAM macros for the design of system VLSI solutions in the field of networking applications.

  • A Timing-Driven Global Routing Algorithm with Pin Assignment, Block Reshaping, and Positioning for Building Block Layout

    Tetsushi KOIDE  Shin'ichi WAKABAYASHI  

     
    PAPER-Layout Optimization

      Vol:
    E81-A No:12
      Page(s):
    2476-2484

    This paper presents a timing-driven global routing algorithm based on coarse pin assignment, block reshaping, and positioning for VLSI building block layout. As opposed to conventional approaches, we combine pin assignment and global routing problems into one problem. The proposed algorithm determines global routes, coarse pin assignments, and block shapes and positions so as to minimize the chip area and total wire length of nets under the given timing constraints. It is based on an iterative improvement paradigm and performs rip-up and rerouting, block reshaping, and positioning in the manner of simulated evolution taking shapes of soft blocks and routing congestion into consideration until the solution is not further improved. The Elmore delay model is adopted for the interconnection delay model. Experimental results show the effectiveness of the proposed algorithm.

  • A Floorplanning Method with Topological Constraint Manipulation in VLSI Building Block Layout

    Tetsushi KOIDE  Yoshinori KATSURA  Katsumi YAMATANI  Shin'ichi WAKABAYASHI  Noriyoshi YOSHIDA  

     
    LETTER

      Vol:
    E77-A No:12
      Page(s):
    2053-2057

    This paper presents a heuristic floorplanning method that improves the method proposed by Vijayan and Tsay. It is based on tentative insertion of constraints, that intentionally produces redundant constraints to make it possible to search in a wide range of solution space. The proposed method can reduce the total area of blocks with the removal and insertion of constraints on the critical path in both horizontal and vertical constraint graphs. Experimental results for MCNC benchmarks showed that the quality of solutions of the proposed method is better than [7],[8] by about 15% on average, and even for the large number of blocks, the proposed method keeps the high quality of solutions.

  • Realization of K-Nearest-Matches Search Capability in Fully-Parallel Associative Memories

    Md. Anwarul ABEDIN  Yuki TANAKA  Ali AHMADI  Shogo SAKAKIBARA  Tetsushi KOIDE  Hans Jurgen MATTAUSCH  

     
    LETTER-VLSI Design Technology and CAD

      Vol:
    E90-A No:6
      Page(s):
    1240-1243

    The realization of k-nearest-matches search capability in fully-parallel mixed digital-analog associative memories by a sequential autonomous search mode is reported. The proposed concept and circuit implementation can be applied with all types of distance measures such as Hamming, Manhattan or Euclidean distance search, and the k value can be freely selected during operation. A test chip for concept verification has been designed in 0.35 µm CMOS technology with two-poly, three-metal layers, realizes k-nearest-matches Euclidean distance search and consumes 5.12 mm2 of the chip area for 64 reference patterns each with 16 units of 5-bit.

  • A Performance-Driven Floorplanning Method with Interconnect Performance Estimation

    Shinya YAMASAKI  Shingo NAKAYA  Shin'ichi WAKABAYASHI  Tetsushi KOIDE  

     
    PAPER-Physical Design

      Vol:
    E85-A No:12
      Page(s):
    2775-2784

    In this paper, we propose a floorplanning method for VLSI building block layout. The proposed method produces a floorplan under the timing constraint for a given netlist. To evaluate the wiring delay, the proposed method estimates the global routing cost for each net with buffer insertion and wire sizing. The slicing structure is adopted to represent a floorplan, and the Elmore delay model is used to estimate the wiring delay. The proposed method is based on simulated annealing. To shorten the computation time, a table look-up method is adopted to calculate the wiring delay. Experimental results show that the proposed algorithm performs well for producing satisfactory floorplans for industrial data.

  • An Efficient Timing-Driven Global Routing Method for Standard Cell Layout

    Tetsushi KOIDE  Takeshi SUZUKI  Shin'ichi WAKABAYASHI  Noriyoshi YOSHIDA  

     
    PAPER-Lauout Synthesis

      Vol:
    E79-D No:10
      Page(s):
    1410-1418

    This paper presents a new timing-driven global routing method for standard cell layout. The proposed method can explicitly consider the timing constraint between two registers and minimize the channel density under the given timing constraint. In the proposed method, first, we determine the initial global routes. Next, we improve the global routes to satisfy the timing constraint between two registers as well as to minimize the channel density. Finally, for each cell row, the nets incident to terminals on the cell row are assigned to channels to minimize the channel density using 0-1 integer linear programming. We also show the experimental results of the proposed method implemented on an engineering workstation. Experimental results show that the proposed method is quite promising.

  • Classification with CNN features and SVM on Embedded DSP Core for Colorectal Magnified NBI Endoscopic Video Image

    Masayuki ODAGAWA  Takumi OKAMOTO  Tetsushi KOIDE  Toru TAMAKI  Shigeto YOSHIDA  Hiroshi MIENO  Shinji TANAKA  

     
    PAPER-VLSI Design Technology and CAD

      Pubricized:
    2021/07/21
      Vol:
    E105-A No:1
      Page(s):
    25-34

    In this paper, we present a classification method for a Computer-Aided Diagnosis (CAD) system in a colorectal magnified Narrow Band Imaging (NBI) endoscopy. In an endoscopic video image, color shift, blurring or reflection of light occurs in a lesion area, which affects the discrimination result by a computer. Therefore, in order to identify lesions with high robustness and stable classification to these images specific to video frame, we implement a CAD system for colorectal endoscopic images with the Convolutional Neural Network (CNN) feature and Support Vector Machine (SVM) classification on the embedded DSP core. To improve the robustness of CAD system, we construct the SVM learned by multiple image sizes data sets so as to adapt to the noise peculiar to the video image. We confirmed that the proposed method achieves higher robustness, stable, and high classification accuracy in the endoscopic video image. The proposed method also can cope with differences in resolution by old and new endoscopes and perform stably with respect to the input endoscopic video image.

  • A Reliability-Enhanced TCAM Architecture with Associated Embedded DRAM and ECC

    Hideyuki NODA  Katsumi DOSAKA  Hans Jurgen MATTAUSCH  Tetsushi KOIDE  Fukashi MORISHITA  Kazutami ARIMOTO  

     
    PAPER

      Vol:
    E89-C No:11
      Page(s):
    1612-1619

    This paper describes a novel TCAM architecture designed for enhancing the soft-error immunity. An associated embedded DRAM and ECC circuits are placed next to TCAM macro to implement a unique methodology of recovering upset bits due to soft errors. The proposed configuration allows an improvement of soft-error immunity by 6 orders of magnitude compared with the conventional TCAM. We also propose a novel testing methodology of the soft-error rate with a fast parallel multi-bit test. In addition, the proposed architecture resolves the critical problem of the look-up table maintenance of TCAM. The design techniques reported in this paper are especially attractive for realizing soft-error immune, high-performance TCAM chips.

  • A Hardware Implementation on Customizable Embedded DSP Core for Colorectal Tumor Classification with Endoscopic Video toward Real-Time Computer-Aided Diagnosais System

    Masayuki ODAGAWA  Takumi OKAMOTO  Tetsushi KOIDE  Toru TAMAKI  Bisser RAYTCHEV  Kazufumi KANEDA  Shigeto YOSHIDA  Hiroshi MIENO  Shinji TANAKA  Takayuki SUGAWARA  Hiroshi TOISHI  Masayuki TSUJI  Nobuo TAMBA  

     
    PAPER-VLSI Design Technology and CAD

      Pubricized:
    2020/10/06
      Vol:
    E104-A No:4
      Page(s):
    691-701

    In this paper, we present a hardware implementation of a colorectal cancer diagnosis support system using a colorectal endoscopic video image on customizable embedded DSP. In an endoscopic video image, color shift, blurring or reflection of light occurs in a lesion area, which affects the discrimination result by a computer. Therefore, in order to identify lesions with high robustness and stable classification to these images specific to video frame, we implement a computer-aided diagnosis (CAD) system for colorectal endoscopic images with Narrow Band Imaging (NBI) magnification with the Convolutional Neural Network (CNN) feature and Support Vector Machine (SVM) classification. Since CNN and SVM need to perform many multiplication and accumulation (MAC) operations, we implement the proposed hardware system on a customizable embedded DSP, which can realize at high speed MAC operations and parallel processing with Very Long Instruction Word (VLIW). Before implementing to the customizable embedded DSP, we profile and analyze processing cycles of the CAD system and optimize the bottlenecks. We show the effectiveness of the real-time diagnosis support system on the embedded system for endoscopic video images. The prototyped system demonstrated real-time processing on video frame rate (over 30fps @ 200MHz) and more than 90% accuracy.

  • A CAM-Based Signature-Matching Co-processor with Application-Driven Power-Reduction Features

    Kazunari INOUE  Hideyuki NODA  Kazutami ARIMOTO  Hans Jurgen MATTAUSCH  Tetsushi KOIDE  

     
    PAPER-Integrated Electronics

      Vol:
    E88-C No:6
      Page(s):
    1332-1342

    A signature-matching co-processor in 130 nm CMOS technology for application in the network-security field is presented. Two key search technologies, implemented with fully-parallel CAM-based search cores, enable the removal of misused packets from Giga-bit-per-second (G-bps) networks in real-time without disturbing the normal network traffic. The first technology is a thorough search through packet header as well as payload in byte-shifting manner and is capable of detecting viruses, even if they are hidden at an arbitrary position within the packet. A 1.125 Mbit ternary CAM, operated at the speed of 125 Mega-searches per second (M-sps), integrates the primary lookup table for thorough packet search. The second technology applies an additional relational search with programmable logical operations to detect recently appearing more complicated misused packets. A small 192-bit binary CAM operated at 31.25 M-sps is also included for this purpose. Power dissipation, being a major concern of CAM-based application-specific LSIs, is addressed in the light of the signature-matching application, which has a high probability of multiple matches and which doesn't require to mask individual bits of the search word. Consequently, two application-driven power-reduction methods are implemented, namely an improved pipelined search for efficiently reducing power even in the case of a large number of multiple matches, and a search-line encoding for cutting search-line related power dissipation. As a result the signature-matching co-processor features low power dissipation between 0.4 W and 1.1 W for the best case and the worst case search configurations, respectively.

  • 4-Port Unified Data/Instruction Cache Design with Distributed Crossbar and Interleaved Cache-Line Words

    Koh JOHGUCHI  Hans Jurgen MATTAUSCH  Tetsushi KOIDE  Tetsuo HIRONAKA  

     
    LETTER-Integrated Electronics

      Vol:
    E90-C No:11
      Page(s):
    2157-2160

    The presented unified data/instruction cache design uses multiple banks and features 4 ports, distributed crossbar, different word-length for data and instruction ports, interleaved cache-line words and synchronous access with hidden precharge. A 20.5 KByte storage capacity is integrated in 5-metal-layer CMOS logic technology with 200 nm minimum gate length and a 3.4 ns access-cycle time is achieved. The access bandwidth corresponds to 10 ports with standard word-length, while the cost in increased Si-area is only 25% in comparison to a 1-port cache.

  • Acceleration of DCT Processing with Massive-Parallel Memory-Embedded SIMD Matrix Processor

    Takeshi KUMAKI  Masakatsu ISHIZAKI  Tetsushi KOIDE  Hans Jurgen MATTAUSCH  Yasuto KURODA  Hideyuki NODA  Katsumi DOSAKA  Kazutami ARIMOTO  Kazunori SAITO  

     
    LETTER-Image Processing and Video Processing

      Vol:
    E90-D No:8
      Page(s):
    1312-1315

    This paper reports an efficient Discrete Cosine Transform (DCT) processing method for images using a massive-parallel memory-embedded SIMD matrix processor. The matrix-processing engine has 2,048 2-bit processing elements, which are connected by a flexible switching network, and supports 2-bit 2,048-way bit-serial and word-parallel operations with a single command. For compatibility with this matrix-processing architecture, the conventional DCT algorithm has been improved in arithmetic order and the vertical/horizontal-space 1 Dimensional (1D)-DCT processing has been further developed. Evaluation results of the matrix-engine-based DCT processing show that the necessary clock cycles per image block can be reduced by 87% in comprison to a conventional DSP architecture. The determined performances in MOPS and MOPS/mm2 are factors 8 and 5.6 better than with a conventional DSP, respectively.

  • Real-Time Huffman Encoder with Pipelined CAM-Based Data Path and Code-Word-Table Optimizer

    Takeshi KUMAKI  Yasuto KURODA  Masakatsu ISHIZAKI  Tetsushi KOIDE  Hans Jurgen MATTAUSCH  Hideyuki NODA  Katsumi DOSAKA  Kazutami ARIMOTO  Kazunori SAITO  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E90-D No:1
      Page(s):
    334-345

    This paper presents a novel optimized real-time Huffman encoder using a pipelined data path based on CAM technology and a parallel code-word-table optimizer. The exploitation of CAM technology enables fast parallel search of the code word table. At the same time, the code word table is optimized according to the frequency of received input symbols and is up-dated in real-time. Since these two functions work in parallel, the proposed architecture realizes fast parallel encoding and keeps a constantly high compression ratio. Evaluation results for the JPEG application show that the proposed architecture can achieve up to 28% smaller encoded picture sizes than the conventional architectures. The obtained encoding time can be reduced by 95% in comparison to a conventional SRAM-based architecture, which is suitable even for the latest end-user-devices requiring fast frame-rates. Furthermore, the proposed architecture provides the only encoder that can simultaneously realize small compressed data size and fast processing speed.

  • A Graph Bisection Algorithm Based on Subgraph Migration

    Kazunori ISOMOTO  Yoshiyasu MIMASA  Shin'ichi WAKABAYASHI  Tetsushi KOIDE  Noriyoshi YOSHIDA  

     
    PAPER

      Vol:
    E77-A No:12
      Page(s):
    2039-2044

    The graph bisection problem is to partition a given graph into two subgraphs with equal size with minimizing the cutsize. This problem is NP-hard, and hence several heuristic algorithms have been proposed. Among them, the Kernighan-Lin algorithm and the Fiduccia-Mattheyses algorithm are well known, and widely used in practical applications. Since those algorithms are iterative improvement algorithms, in which the current solution is iteratively improved by interchanging a pair of two nodes belonging to different subgraphs, or moving one node from one subgraph to the other, those algorithms tend to fall into a local optimum. In this paper, we present a heuristic algorithm based on subgraph migration to avoid falling into a local optimum. In this algorithm, an initial solution is given, and it is improved by moving a subgraph, which is effective to reduce the cutsize. The algorithm repeats this operation until no further improvement can be achieved. Finally, the balance of the bisection is restored by moving nodes to get a final solution. Experimental results show that the proposed algorithm gets better solutions than the Kernighan-Lin and Fiduccia-Mattheyses algorithms.

1-20hit(24hit)