The search functionality is under construction.

Author Search Result

[Author] Hideharu AMANO(66hit)

1-20hit(66hit)

  • Design and Implementation of RHiNET-2/NI0: A Reconfigurable Network Interface for Cluster Computing

    Tomonori YOKOYAMA  Naoyuki IZU  Jun-ichiro TSUCHIYA  Konosuke WATANABE  Hideharu AMANO  Tomohiro KUDOH  

     
    PAPER

      Vol:
    E86-D No:5
      Page(s):
    789-795

    A reconfigurable network interface called RHiNET-2/NI0 is developed for parallel processing of PCs distributed within one or more floors of a building. Two configurations: the HS (High Speed) configuration with only a high-speed primitive and the DSM (Distributed Shared Memory) configuration which supports sophisticated primitives can be selected by the network requirement. From the empirical evaluation, it appears that the HS configuration markedly improves the latency of data transfer compared with traditional network interfaces. On the other hand, the DSM configuration executes sophisticated primitives for distributed shared memory more than twice as fast as that of software implementation.

  • A Fine-Grained Multicasting of Configuration Data for Coarse-Grained Reconfigurable Architectures

    Takuya KOJIMA  Hideharu AMANO  

     
    PAPER-Computer System

      Pubricized:
    2019/04/05
      Vol:
    E102-D No:7
      Page(s):
    1247-1256

    A novel configuration data compression technique for coarse-grained reconfigurable architectures (CGRAs) is proposed. Reducing the size of configuration data of CGRAs shortens the reconfiguration time especially when the communication bandwidth between a CGRA and a host CPU is limited. In addition, it saves energy consumption of configuration cache and controller. The proposed technique is based on a multicast configuration technique called RoMultiC, which reduces the configuration time by multicasting the same data to multiple PEs (Processing Elements) with two bit-maps. Scheduling algorithms for an optimizing the order of multicasting have been proposed. However, the multicasting is possible only if each PE has completely the same configuration. In general, configuration data for CGRAs can be divided into some fields like machine code formats of general perpose CPUs. The proposed scheme confines a part of fields for multicasting so that the possibility of multicasting more PEs can be increased. This paper analyzes algorithms to find a configuration pattern which maximizes the number of multicasted PEs. We implemented the proposed scheme to CMA (Cool Mega Array), a straight forward CGRA as a case study. Experimental results show that the proposed method achieves 40.0% smaller configuration than a previous method for an image processing application at maximum. The exploration of the multicasted grain size reveals the effective grain size for each algorithm. Furthermore, since both a dynamic power consumption of the configuration controller and a configuration time are improved, it achieves 50.1% reduction of the energy consumption for the configuration with a negligible area overhead.

  • Pot: A General Purpose Monitor for Parallel Computers

    Yuso KANAMORI  Oki MINABE  Masaki WAKABAYASHI  Hideharu AMANO  

     
    PAPER

      Vol:
    E86-D No:10
      Page(s):
    2025-2033

    At the initial stage of developing parallel machines, a software monitor, which manages communication between host computers, program loading and debugging, is necessary. However, it is often a cumbersome job to develop such a monitoring system especially when the target takes a parallel architecture. To solve this problem, we developed an integrated monitor system called "Pot". "Pot" consists of a system runs on the host computer and simple code on a target machine. In order to reduce the development costs, the program on a target machine is as simple as possible while "Pot" on the host computer itself provides various functions for system development.

  • An FPGA-Based Acceleration Method for Metabolic Simulation

    Yasunori OSANA  Tomonori FUKUSHIMA  Masato YOSHIMI  Hideharu AMANO  

     
    PAPER-Recornfigurable Systems

      Vol:
    E87-D No:8
      Page(s):
    2029-2037

    Computer simulation of cellular process is one of the most important applications in bioinformatics. Since such simulators need huge computational resources, many biologists must use expensive PC/WS clusters. ReCSiP is an FPGA-based, reconfigurable accelerator which aims to realize economical high-performance simulation environment on desktop computers. It can exploit fine-grain parallelism in the target applications by small hardware modules in the FPGA which work in parallel manner. As the first step to implement a simulator of cellular process on ReCSiP, a solver to perform a basic simulation of metabolism was implemented. The throughput of the solver was about 29 times faster than the software on Intel's PentiumIII operating at 1.13 GHz.

  • A Batcher-Double-Omega Network with Combining

    Kalidou GAYE  Hideharu AMANO  

     
    PAPER-Computer Networks

      Vol:
    E75-D No:3
      Page(s):
    307-314

    The Batcher banyan network is well known as a non-blocking switching fabric. However, it is conflict free only when there is no packets for the same destination. To cope with the arbitrary combination of packets, an additional network or special control sequence which causes the increase of the hardware or performance degradation is required. A Batcher Double Omega network with Combining (BDOC) is an elegant solution of this problem. It consists of a Batcher sorter and two double sized Omega networks. Like in the Batcher banyan network, packets are sorted by the destination label in the Batcher sorter. In the first Omega network called the distributer, a packet is routed by a tag corresponding to the sum of the label at the output of the Batcher sorter and the destination label. In the second (Inverse) Omega network called the concentrator, the original destination label is used as the routing tag, and packets are routed without any conflict. The BDOC is useful for an interconnection network to connect processors and memory modules in multiprocessor. Unlike conventional multistage interconnection networks for multiprocessors, packets are transferred in a serial and synchronized manner. The simple structure of the switching element enables a high speed operation which reduces the latency caused by the serial communication. Using the pipelined circuit switching, the address and data packets share the same control signal, and the structure of the switching element is much simplified. Moreover, packets combining which avoids the hot spot contention is realized easily in the concentrator.

  • A Dynamically Adaptive Hardware on Dynamically Reconfigurable Processor

    Hideharu AMANO  Akiya JOURAKU  Kenichiro ANJO  

     
    INVITED PAPER

      Vol:
    E86-B No:12
      Page(s):
    3385-3391

    A framework of dynamically adaptive hardware mechanism on multicontext reconfigurable devices is proposed, and as an example, an adaptive switching fabric is implemented on NEC's novel reconfigurable device DRP (Dynamically Reconfigurable Processor). In this switch, contexts for the full crossbar and alternative hadware modules, which provide larger bandwidth but can treat only a limited pattern of packet inputs, are prepared. Using the quick context switching functionality, a context for the full crossbar is replaced by alternative contexts according to the packet inputs pattern. If the context corresponding to requested alternative hadware modules is not inside the chip, it is loaded from outside chip to currently unused context memory, then replaced with the full size crossbar. If the traffic includes a lot of packets for specific destinations, a set of contexts frequently used in the traffic is gathered inside the chip like a working set stored in a cache. 4 4 mesh network connected with the proposed adaptive switches is simulated, and it appears that the latency between nodes is improved three times when the traffic between neighboring four nodes is dominant.

  • A VLSI Switch for a Digital PBX

    Suhut Hasiholan PURBA  Hideharu AMANO  Yasuro SHOBATAKE  Hideo AISO  

     
    LETTER-Switching Systems and Communication Processing

      Vol:
    E69-E No:7
      Page(s):
    771-774

    In this letter, an economical VLSI switch MWS is proposed. MWS is constructed using a large amount of normal speed RAM and provdies sufficient switching capacity without high speed devices or technology.

  • Wavelength Division Multiple Access Ring -- Virtual Topology on a Simple Ring Network --

    Xiaoshe DONG  Tomohiro KUDOH  Hideharu AMANO  

     
    PAPER-Computer Systems

      Vol:
    E81-D No:4
      Page(s):
    345-354

    In this paper, Wavelength Division Multiple access (WDM) ring is proposed for interconnection in workstation clusters or parallel machines. This network consists of ring connected routers each of which selectively passes signals addressed in some particular wavelengths. Other wavelengths are once converted to electric signals, and re-transmitted being addressed in different wavelengths. Wavelengths are assigned to divisors of the number of nodes in the system. Using the regular WDM ring with imaginary nodes, the diameter and average distance are reduced even if the number of nodes has few divisors. It provides better diameter and average distance than that of the uni-directional torus. Although the diameter and average distance is worse than that of ShuffleNet, the physical structure of the WDM ring is simple and the available number of nodes is flexible.

  • A Preemption Algorithm for a Multitasking Environment on Dynamically Reconfigurable Processors

    Vu Manh TUAN  Hideharu AMANO  

     
    PAPER-Computer Systems

      Vol:
    E91-D No:12
      Page(s):
    2793-2803

    Task preemption is a critical mechanism for building an effective multi-tasking environment on dynamically reconfigurable processors. When a task is preempted, its necessary state information must be correctly preserved in order for the task to be resumed later. Not only do coarse-grained Dynamically Reconfigurable Processing Array (DRPAs) devices have different architectures using a variety of development tools, but the great amount of state data of hardware tasks executing on such devices are usually distributed on many different storage elements. To address these difficulties, this paper aims at studying a general method for capturing the state data of hardware tasks targeting coarse-grained DRPAs. Based on resource usage, algorithms for identifying preemption points and inserting preemption states subject to user-specified preemption latency are proposed. Moreover, a modification to automatically incorporate proposed steps into the system design flow is also discussed. The performance degradation caused by additional preemption states is minimized by allowing preemption only at predefined points where demanded resources are small. The evaluation result using a model based on NEC Electronics' DRP-1 shows that the proposed method can produce preemption points satisfying a given preemption latency with reasonable hardware overhead (from 6% to 15%).

  • The RDT Router Chip: A Versatile Router for Supporting a Distributed Shared Memory

    Hiroaki NISHI  Ken-ichiro ANJO  Tomohiro KUDOH  Hideharu AMANO  

     
    PAPER-Interconnection Networks

      Vol:
    E80-D No:9
      Page(s):
    854-862

    JUMP-1 is currently under development by seven Japanese universities to establish techniques for building an efficient distributed shared memory on a massively parallel processor. It provides a coherent cache with reduced hierarchical bit-map directory scheme to achieve cost effective and high performance management. Messages for coherent cache are transferred through a fat tree on the RDT (Recursive Diagonal Torus) interconnection network. RDT router supports versatile functions including multicast and acknowledge combining for the reduced hierarchical bit-map directory scheme. By using 0.5µm BiCMOS SOG technology, it can transfer all packets synchronized with a unique CPU clock (50MHz). Long coaxial cables (4m at maximum) are directly driven with the ECL interface of this chip. Using the dual port RAM, packet buffers allow to push and pull a flit of the packet simultaneously.

  • MINC: Multistage Interconnection Network with Cache Control Mechanism

    Toshihiro HANAWA  Takayuki KAMEI  Hideki YASUKAWA  Katsunobu NISHIMURA  Hideharu AMANO  

     
    PAPER-Interconnection Networks

      Vol:
    E80-D No:9
      Page(s):
    863-870

    A novel approach to the cache coherent Multistage Interconnection Network (MIN) called the MINC (MIN with Cache control mechanism) is proposed. In the MINC, the directory is located only on the shared memory using the Reduced Hierarchical Bit-map Directory schemes (RHBDs). In the RHBD, the bit-map directory is reduced and carried in the packet header for quick multicasting without accessing the directory in each hierarchy. In order to reduce unnecessary packets caused by compacting the bit map in the RHBD, a small cache called the pruning cache is introduced in the switching element. The simulation reveals the pruning cache works most effectively when it is provided in every switching element of the first stage, and it reduces the congestion more than 50% with only 4 entries. The MINC cache control chip with 16 inputs/outputs is implemented on the LPGA (Laser Programmable Gate Array), and works with a 66 MHz clock.

  • Architecture and Evaluation of a Third-Generation RHiNET Switch for High-Performance Parallel Computing

    Hiroaki NISHI  Shinji NISHIMURA  Katsuyoshi HARASAWA  Tomohiro KUDOH  Hideharu AMANO  

     
    PAPER

      Vol:
    E86-D No:10
      Page(s):
    1987-1995

    RHiNET-3/SW is the third-generation switch used in the RHiNET-3 system. It provides both low-latency processing and flexible connection due to its use of a credit-based flow-control mechanism, topology-free routing, and deadlock-free routing. The aggregate throughput of RHiNET-3/SW is 80 Gbps, and the latency is 140 ns. RHiNET-3/SW also provides a hop-by-hop retransmission mechanism. Simulation demonstrated that the effective throughput at a node in a 64-node torus RHiNET-3 system is equivalent to the effective throughput of a 64-bit 33-MHz PCI bus and that the performance of RHiNET-3/SW almost equals or exceeds the best performance of RHiNET-2/SW, the second-generation switch. Although credit-based flow control requires 26% more gates than rate-based flow control to manage the virtual channels (VCs), it requires less VC memory than rate-based flow control. Moreover, its use in a network system reduces latency and increases the maximum throughput compared to rate-based flow control.

  • Proxy Responses by FPGA-Based Switch for MapReduce Stragglers

    Koya MITSUZUKA  Michihiro KOIBUCHI  Hideharu AMANO  Hiroki MATSUTANI  

     
    PAPER-Computer System

      Pubricized:
    2018/06/15
      Vol:
    E101-D No:9
      Page(s):
    2258-2268

    In parallel processing applications, a few worker nodes called “stragglers”, which execute their tasks significantly slower than other tasks, increase the execution time of the job. In this paper, we propose a network switch based straggler handling system to mitigate the burden of the compute nodes. We also propose how to offload detecting stragglers and computing their results in the network switch with no additional communications between worker nodes. We introduce some approximate techniques for the proxy computation and response at the switch; thus our switch is called “ApproxSW.” As a result of a simulation experiment, the proposed approximation based on task similarity achieves the best accuracy in terms of quality of generated Map outputs. We also analyze how to suppress unnecessary proxy computation by the ApproxSW. We implement ApproxSW on NetFPGA-SUME board that has four 10Gbit Ethernet (10GbE) interfaces and a Virtex-7 FPGA. Experimental results shows that the ApproxSW functions do not degrade the original 10GbE switch performance.

  • Vertical Link On/Off Regulations for Inductive-Coupling Based Wireless 3-D NoCs

    Hao ZHANG  Hiroki MATSUTANI  Yasuhiro TAKE  Tadahiro KURODA  Hideharu AMANO  

     
    PAPER-Computer System

      Vol:
    E96-D No:12
      Page(s):
    2753-2764

    We propose low-power techniques for wireless three-dimensional Network-on-Chips (wireless 3-D NoCs), in which the connections among routers on the same chip are wired while the routers on different chips are connected wirelessly using inductive-coupling. The proposed low-power techniques stop the clock and power supplies to the transmitter of the wireless vertical links only when their utilizations are higher than the threshold. Meanwhile, the whole wireless vertical link will be shut down when the utilization is lower than the threshold in order to reduce the power consumption of wireless 3-D NoCs. This paper uses an on-demand method, in which the dormant data transmitter or the whole vertical link will be activated as long as a flit comes. Full-system many-core simulations using power parameters derived from a real chip implementation show that the proposed low-power techniques reduce the power consumption by 23.4%-29.3%, while the performance overhead is less than 2.4%.

  • Implementation of Data Driven Applications on a Multi-Context Reconfigurable Device

    Masaki UNO  Yuichiro SHIBATA  Hideharu AMANO  

     
    PAPER

      Vol:
    E86-D No:5
      Page(s):
    841-849

    WASMII is a virtual hardware system that executes dataflow algorithms using a dynamically reconfigurable multi-context device with a data driven control mechanism. Although the effectiveness of the system has been evaluated through simulations and using an emulator, implementation of WASMII was infeasible due to the unavailability of such a device. However, the first prototype of a practical dynamically reconfigurable multi-context device called DRL has been developed by NEC, and we developed a reconfigurable test bed using four sample DRL chips. On this board, we have implemented and executed some simple applications of WASMII mechanism. Evaluation results show that the performance of the parallel implementation of WASMII is almost twice as that of a PC with a CPU based on the corresponding technology.

  • Code Compression with Split Echo Instructions

    Iver STUBDAL  Arda KARADUMAN  Hideharu AMANO  

     
    PAPER-Fundamentals of Software and Theory of Programs

      Vol:
    E92-D No:9
      Page(s):
    1650-1656

    Code density is often a critical issue in embedded computers, since the memory size of embedded systems is strictly limited. Echo instructions have been proposed as a method for reducing code size. This paper presents a new type of echo instruction, split echo, and evaluates an implementation of both split echo and traditional echo instructions on a MIPS R3000 based processor. Evaluation results show that memory requirement is reduced by 12% on average with small additional hardware cost.

  • Traffic-Independent Multi-Path Routing for High-Throughput Data Center Networks

    Ryuta KAWANO  Ryota YASUDO  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Computer System

      Pubricized:
    2020/08/06
      Vol:
    E103-D No:12
      Page(s):
    2471-2479

    Network throughput has become an important issue for big-data analysis on Warehouse-Scale Computing (WSC) systems. It has been reported that randomly-connected inter-switch networks can enlarge the network throughput. For irregular networks, a multi-path routing method called k-shortest path routing is conventionally utilized. However, it cannot efficiently exploit longer-than-shortest paths that would be detour paths to avoid bottlenecks. In this work, a novel routing method called k-optimized path routing to achieve high throughput is proposed for irregular networks. We introduce a heuristic to select detour paths that can avoid bottlenecks in the network to improve the average-case network throughput. Experimental results by network simulation show that the proposed k-optimized path routing can improve the saturation throughput by up to 18.2% compared to the conventional k-shortest path routing. Moreover, it can reduce the computation time required for optimization to 1/2760 at a minimum compared to our previously proposed method.

  • A Routing Algorithm for Multihop WDM Ring

    Xiaoshe DONG  Tomohiro KUDOH  Hideharu AMANO  

     
    PAPER-Computer Networks

      Vol:
    E82-D No:2
      Page(s):
    422-430

    Divisor-Skip Wavelength Division Multiplexing (DS-WDM) ring is an optical interconnection network for workstation clusters or parallel machines which can connect various number of nodes easily using wavelength division multiplexing techniques. However, the wavelength-ordered routing algorithm proposed for the DS-WDM ring requires complicated processes in each router. Here, a new routing algorithm called the comparing dimensional number routing algorithm for the DS-WDM ring is proposed and evaluated. Although the diameter and average distance are almost same as traditional wavelength-ordered routing, the cost and latency are much reduced.

  • Fault Tolerance of the TBSF (Tandem Banyan Switching Fabrics) and PBSF (Piled Banyan Switching Fabrics)

    Akira FUNAHASHI  Toshihiro HANAWA  Hideharu AMANO  

     
    PAPER-Fault Diagnosis/Tolerance

      Vol:
    E79-D No:8
      Page(s):
    1180-1189

    Multistage Interconnection Networks (MIN) with multiple outlets are networks which can support higher bandwidth than those of nonblocking networks by passing multiple packets to the same destination. Fault recovery mechanisms are proposed for two of such networks (TBSF/PBSF) with the best use of their inherent fault tolerant capability. With these mechanisms, on-the-fly fault recovery is possible for multiple faults on switching elements. For the link fault, the networks are reconfigured after fault diagnosis, and the network is available with some performance degradation. The bandwidth degradation under multiple faults on link/element is analyzed with both theoretical models and simulation. Through the analysis, F-PBSF shows high fault tolerance under high traffic load and low reliability by using 3 or more banyan networks.

  • Reconfigurable Out-of-Order System for Fluid Dynamics Computation Using Unstructured Mesh

    Takayuki AKAMINE  Mohamad Sofian ABU TALIP  Yasunori OSANA  Naoyuki FUJITA  Hideharu AMANO  

     
    PAPER-Computer System

      Vol:
    E97-D No:5
      Page(s):
    1225-1234

    Computational fluid dynamics (CFD) is an important tool for designing aircraft components. FaSTAR (Fast Aerodynamics Routines) is one of the most recent CFD packages and has various subroutines. However, its irregular and complicated data structure makes it difficult to execute FaSTAR on parallel machines due to memory access problem. The use of a reconfigurable platform based on field programmable gate arrays (FPGAs) is a promising approach to accelerating memory-bottlenecked applications like FaSTAR. However, even with hardware execution, a large number of pipeline stalls can occur due to read-after-write (RAW) data hazards. Moreover, it is difficult to predict when such stalls will occur because of the unstructured mesh used in FaSTAR. To eliminate this problem, we developed an out-of-order mechanism for permuting the data order so as to prevent RAW hazards. It uses an execution monitor and a wait buffer. The former identifies the state of the computation units, and the latter temporarily stores data to be processed in the computation units. This out-of-order mechanism can be applied to various types of computations with data dependency by changing the number of execution monitors and wait buffers in accordance with the equations used in the target computation. An out-of-order system can be reconfigured by automatic changing of the parameters. Application of the proposed mechanism to five subroutines in FaSTAR showed that its use reduces the number of stalls to less than 1% compared to without the mechanism. In-order execution was speeded up 2.6-fold and software execution was speeded up 2.9-fold using an Intel Core 2 Duo processor with a reasonable amount of overhead.

1-20hit(66hit)