The search functionality is under construction.

Author Search Result

[Author] Hiroki MATSUTANI(21hit)

1-20hit(21hit)

  • In-GPU Cache for Acceleration of Anomaly Detection in Blockchain

    Shin MORISHIMA  Hiroki MATSUTANI  

     
    PAPER-Computer System

      Pubricized:
    2020/04/28
      Vol:
    E103-D No:8
      Page(s):
    1814-1824

    Blockchain is a distributed ledger system composed of a P2P network and is used for a wide range of applications, such as international remittance, inter-individual transactions, and asset conservation. In Blockchain systems, tamper resistance is enhanced by the property of transaction that cannot be changed or deleted by everyone including the creator of the transaction. However, this property also becomes a problem that unintended transaction created by miss operation or secret key theft cannot be corrected later. Due to this problem, once an illegal transaction such as theft occurs, the damage will expand. To suppress the damage, we need countermeasures, such as detecting illegal transaction at high speed and correcting the transaction before approval. However, anomaly detection in the Blockchain at high speed is computationally heavy, because we need to repeat the detection process using various feature quantities and the feature extractions become overhead. In this paper, to accelerate anomaly detection, we propose to cache transaction information necessary for extracting feature in GPU device memory and perform both feature extraction and anomaly detection in the GPU. We also propose a conditional feature extraction method to reduce computation cost of anomaly detection. We employ anomaly detection using K-means algorithm based on the conditional features. When the number of users is one million and the number of transactions is 100 millions, our proposed method achieves 8.6 times faster than CPU processing method and 2.6 times faster than GPU processing method that does not perform feature extraction on the GPU. In addition, the conditional feature extraction method achieves 1.7 times faster than the unconditional method when the number of users satisfying a given condition is 200 thousands out of one million.

  • A Port Combination Methodology for Application-Specific Networks-on-Chip on FPGAs

    Daihan WANG  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Reconfigurable System and Applications

      Vol:
    E90-D No:12
      Page(s):
    1914-1922

    A temporal correlation based port combination algorithm that customizes the router design in Network-on-Chip (NoC) is proposed for reconfigurable systems in order to minimize required hardware amount. Given the traffic characteristics of the target application and the expected hardware amount reduction rate, the algorithm automatically makes the port combination plan for the networks. Since the port combination technique has the advantage of almost keeping the topology including two-surface layout, it does not affect the design of the other layer, such as task mapping and scheduling. The algorithm shows much better efficiency than the algorithm without temporal correlation. For the multimedia stream processing application, the algorithm can save 55% of the hardware amount without performance degradation, while the none temporal correlation algorithm suffers from 30% performance loss.

  • An Area-Efficient Recurrent Neural Network Core for Unsupervised Time-Series Anomaly Detection Open Access

    Takuya SAKUMA  Hiroki MATSUTANI  

     
    PAPER

      Pubricized:
    2020/12/15
      Vol:
    E104-C No:6
      Page(s):
    247-256

    Since most sensor data depend on each other, time-series anomaly detection is one of practical applications of IoT devices. Such tasks are handled by Recurrent Neural Networks (RNNs) with a feedback structure, such as Long Short Term Memory. However, their learning phase based on Stochastic Gradient Descent (SGD) is computationally expensive for such edge devices. This issue is addressed by executing their learning on high-performance server machines, but it introduces a communication overhead and additional power consumption. On the other hand, Recursive Least-Squares Echo State Network (RLS-ESN) is a simple RNN that can be trained at low cost using the least-squares method rather than SGD. In this paper, we propose its area-efficient hardware implementation for edge devices and adapt it to human activity anomaly detection as an example of interdependent time-series sensor data. The model is implemented in Verilog HDL, synthesized with a 45 nm process technology, and evaluated in terms of the anomaly capability, hardware amount, and performance. The evaluation results demonstrate that the RLS-ESN core with a feedback structure is more robust to hyper parameters than an existing Online Sequential Extreme Learning Machine (OS-ELM) core. It consumes only 1.25 times larger hardware amount and 1.11 times longer latency than the existing OS-ELM core.

  • An FPGA Acceleration and Optimization Techniques for 2D LiDAR SLAM Algorithm

    Keisuke SUGIURA  Hiroki MATSUTANI  

     
    PAPER-Computer System

      Pubricized:
    2021/03/04
      Vol:
    E104-D No:6
      Page(s):
    789-800

    An efficient hardware implementation for Simultaneous Localization and Mapping (SLAM) methods is of necessity for mobile autonomous robots with limited computational resources. In this paper, we propose a resource-efficient FPGA implementation for accelerating scan matching computations, which typically cause a major bottleneck in 2D LiDAR SLAM methods. Scan matching is a process of correcting a robot pose by aligning the latest LiDAR measurements with an occupancy grid map, which encodes the information about the surrounding environment. We exploit an inherent parallelism in the Rao-Blackwellized Particle Filter (RBPF) based algorithm to perform scan matching computations for multiple particles in parallel. In the proposed design, several techniques are employed to reduce the resource utilization and to achieve the maximum throughput. Experimental results using the benchmark datasets show that the scan matching is accelerated by 5.31-8.75× and the overall throughput is improved by 3.72-5.10× without seriously degrading the quality of the final outputs. Furthermore, our proposed IP core requires only 44% of the total resources available in the TUL Pynq-Z2 FPGA board, thus facilitating the realization of SLAM applications on indoor mobile robots.

  • An FPGA-Based Optimizer Design for Distributed Deep Learning with Multiple GPUs

    Tomoya ITSUBO  Michihiro KOIBUCHI  Hideharu AMANO  Hiroki MATSUTANI  

     
    PAPER

      Pubricized:
    2021/07/01
      Vol:
    E104-D No:12
      Page(s):
    2057-2067

    Since deep learning workloads perform a large number of matrix operations on training data, GPUs (Graphics Processing Units) are efficient especially for the training phase. A cluster of computers each of which equips multiple GPUs can significantly accelerate the deep learning workloads. More specifically, a back-propagation algorithm following a gradient descent approach is used for the training. Although the gradient computation is still a major bottleneck of the training, gradient aggregation and optimization impose both communication and computation overheads, which should also be reduced for further shortening the training time. To address this issue, in this paper, multiple GPUs are interconnected with a PCI Express (PCIe) over 10Gbit Ethernet (10GbE) technology. Since these remote GPUs are interconnected with network switches, gradient aggregation and optimizers (e.g., SGD, AdaGrad, Adam, and SMORMS3) are offloaded to FPGA-based 10GbE switches between remote GPUs; thus, the gradient aggregation and parameter optimization are completed in the network. The proposed FPGA-based 10GbE switches with the four optimizers are implemented on NetFPGA-SUME board. Their resource utilizations are increased by PEs for the optimizers, and they consume up to 56% of the resources. Evaluation results using four remote GPUs connected via the proposed FPGA-based switch demonstrate that these optimizers are accelerated by up to 3.0x and 1.25x compared to CPU and GPU implementations, respectively. Also, the gradient aggregation throughput by the FPGA-based switch achieves up to 98.3% of the 10GbE line rate.

  • A Novel Channel Assignment Method to Ensure Deadlock-Freedom for Deterministic Routing

    Ryuta KAWANO  Hiroshi NAKAHARA  Seiichi TADE  Ikki FUJIWARA  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Computer System

      Pubricized:
    2017/05/19
      Vol:
    E100-D No:8
      Page(s):
    1798-1806

    Inter-switch networks for HPC systems and data-centers can be improved by applying random shortcut topologies with a reduced number of hops. With minimal routing in such networks; however, deadlock-freedom is not guaranteed. Multiple Virtual Channels (VCs) are efficiently used to avoid this problem. However, previous works do not provide good trade-offs between the number of required VCs and the time and memory complexities of an algorithm. In this work, a novel and fast algorithm, named ACRO, is proposed to endorse the arbitrary routing functions with deadlock-freedom, as well as consuming a small number of VCs. A heuristic approach to reduce VCs is achieved with a hash table, which improves the scalability of the algorithm compared with our previous work. Moreover, experimental results show that ACRO can reduce the average number of VCs by up to 63% when compared with a conventional algorithm that has the same time complexity. Furthermore, ACRO reduces the time complexity by a factor of O(|N|⋅log|N|), when compared with another conventional algorithm that requires almost the same number of VCs.

  • A Layout-Oriented Routing Method for Low-Latency HPC Networks

    Ryuta KAWANO  Hiroshi NAKAHARA  Ikki FUJIWARA  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Interconnection networks

      Pubricized:
    2017/07/14
      Vol:
    E100-D No:12
      Page(s):
    2796-2807

    End-to-end network latency has become an important issue for parallel application on large-scale high performance computing (HPC) systems. It has been reported that randomly-connected inter-switch networks can lower the end-to-end network latency. This latency reduction is established in exchange for a large amount of routing information. That is, minimal routing on irregular networks is achieved by using routing tables for all destinations in the networks. In this work, a novel distributed routing method called LOREN (Layout-Oriented Routing with Entries for Neighbors) to achieve low-latency with a small routing table is proposed for irregular networks whose link length is limited. The routing tables contain both physically and topologically nearby neighbor nodes to ensure livelock-freedom and a small number of hops between nodes. Experimental results show that LOREN reduces the average latencies by 5.8% and improves the network throughput by up to 62% compared with a conventional compact routing method. Moreover, the number of required routing table entries is reduced by up to 91%, which improves scalability and flexibility for implementation.

  • Novel Chip Stacking Methods to Extend Both Horizontally and Vertically for Many-Core Architectures with ThrouChip Interface

    Hiroshi NAKAHARA  Tomoya OZAKI  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Architecture

      Pubricized:
    2016/08/24
      Vol:
    E99-D No:12
      Page(s):
    2871-2880

    The increase of recent non-recurrent engineering cost (design, mask and test cost) have made large System-on-Chip (SoC) difficult to develop especially with advanced technology. We radically explore an approach for cheap and flexible chip stacking by using Inductive coupling ThruChip Interface (TCI). In order to connect a large number of small chips for building a large scale system, novel chip stacking methods called the linear stacking and staggered stacking are proposed. They enable the system to be extended to x or/and y dimensions, not only to z dimension. Here, a novel chip staking layout, and its deadlock-free routing design for the case using single-core chips and multi-core chips are shown. The network with 256 nodes formed by the proposed stacking improves the latency of 2D mesh by 13.8% and the performance of NAS Parallel Benchmarks by 5.4% on average compared to that of 2D mesh.

  • An FPGA-Based Change-Point Detection for 10Gbps Packet Stream Open Access

    Takuma IWATA  Kohei NAKAMURA  Yuta TOKUSASHI  Hiroki MATSUTANI  

     
    PAPER-Computer System

      Pubricized:
    2019/07/23
      Vol:
    E102-D No:12
      Page(s):
    2366-2376

    In statistical analysis and data mining, change-point detection that identifies the change-points which are times when the probability distribution of time series changes has been used for various purposes, such as anomaly detections on network traffic and transaction data. However, computation cost of a conventional AR (Auto-Regression) model based approach is too high and infeasible for online. In this paper, an AR model based online change-point detection algorithm, called ChangeFinder, is implemented on an FPGA (Field Programmable Gate Array) based NIC (Network Interface Card). The proposed system computes the change-point score from time series data received from 10GbE (10Gbit Ethernet). More specifically, it computes the change-point score at the 10GbE NIC in advance of host applications. It can find change-points on single or multiple streams using a context memory. This paper aims to reduce the host workload and improve change-point detection performance by offloading ChangeFinder algorithm from host to the NIC. As evaluations, change-point detection in the FPGA NIC is compared with a baseline software implementation and those enhanced by two network optimization techniques using DPDK and Netfilter in terms of throughput. The result demonstrates 16.8x improvement in change-point detection throughput compared to the baseline software implementation. It is corresponding to the 10GbE line rate. Performance and area overheads when supporting multiple streams are also evaluated.

  • A Low-Cost Neural ODE with Depthwise Separable Convolution for Edge Domain Adaptation on FPGAs

    Hiroki KAWAKAMI  Hirohisa WATANABE  Keisuke SUGIURA  Hiroki MATSUTANI  

     
    PAPER-Computer System

      Pubricized:
    2023/04/05
      Vol:
    E106-D No:7
      Page(s):
    1186-1197

    High-performance deep neural network (DNN)-based systems are in high demand in edge environments. Due to its high computational complexity, it is challenging to deploy DNNs on edge devices with strict limitations on computational resources. In this paper, we derive a compact while highly-accurate DNN model, termed dsODENet, by combining recently-proposed parameter reduction techniques: Neural ODE (Ordinary Differential Equation) and DSC (Depthwise Separable Convolution). Neural ODE exploits a similarity between ResNet and ODE, and shares most of weight parameters among multiple layers, which greatly reduces the memory consumption. We apply dsODENet to a domain adaptation as a practical use case with image classification datasets. We also propose a resource-efficient FPGA-based design for dsODENet, where all the parameters and feature maps except for pre- and post-processing layers can be mapped onto on-chip memories. It is implemented on Xilinx ZCU104 board and evaluated in terms of domain adaptation accuracy, inference speed, FPGA resource utilization, and speedup rate compared to a software counterpart. The results demonstrate that dsODENet achieves comparable or slightly better domain adaptation accuracy compared to our baseline Neural ODE implementation, while the total parameter size without pre- and post-processing layers is reduced by 54.2% to 79.8%. Our FPGA implementation accelerates the inference speed by 23.8 times.

  • A Lightweight Reinforcement Learning Based Packet Routing Method Using Online Sequential Learning

    Kenji NEMOTO  Hiroki MATSUTANI  

     
    PAPER-Computer System

      Pubricized:
    2023/08/15
      Vol:
    E106-D No:11
      Page(s):
    1796-1807

    Existing simple routing protocols (e.g., OSPF, RIP) have some disadvantages of being inflexible and prone to congestion due to the concentration of packets on particular routers. To address these issues, packet routing methods using machine learning have been proposed recently. Compared to these algorithms, machine learning based methods can choose a routing path intelligently by learning efficient routes. However, machine learning based methods have a disadvantage of training time overhead. We thus focus on a lightweight machine learning algorithm, OS-ELM (Online Sequential Extreme Learning Machine), to reduce the training time. Although previous work on reinforcement learning using OS-ELM exists, it has a problem of low learning accuracy. In this paper, we propose OS-ELM QN (Q-Network) with a prioritized experience replay buffer to improve the learning performance. It is compared to a deep reinforcement learning based packet routing method using a network simulator. Experimental results show that introducing the experience replay buffer improves the learning performance. OS-ELM QN achieves a 2.33 times speedup than a DQN (Deep Q-Network) in terms of learning speed. Regarding the packet transfer latency, OS-ELM QN is comparable or slightly inferior to the DQN while they are better than OSPF in most cases since they can distribute congestions.

  • A Link Removal Methodology for Application-Specific Networks-on-Chip on FPGAs

    Daihan WANG  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-VLSI Systems

      Vol:
    E92-D No:4
      Page(s):
    575-583

    The regular 2-D mesh topology has been utilized for most of Network-on-Chips (NoCs) on FPGAs. Spatially biased traffic generated in some applications makes a customization method for removing links more efficient, since some links become low utilization. In this paper, a link removal strategy that customizes the router in NoC is proposed for reconfigurable systems in order to minimize the required hardware amount. Based on the pre-analyzed traffic information, links on which the communication amount is small are removed to reduce the hardware cost while maintaining adequate performance. Two policies are proposed to avoid deadlocks and they outperform up*/down* routing, which is a representative deadlock-free routing on irregular topology. In the case of the image recognition application susan, the proposed method can save 30% of the hardware amount without performance degradation.

  • A Generalized Theory Based on the Turn Model for Deadlock-Free Irregular Networks

    Ryuta KAWANO  Ryota YASUDO  Hiroki MATSUTANI  Michihiro KOIBUCHI  Hideharu AMANO  

     
    PAPER-Computer System

      Pubricized:
    2019/10/08
      Vol:
    E103-D No:1
      Page(s):
    101-110

    Recently proposed irregular networks can reduce the latency for both on-chip and off-chip systems with a large number of computing nodes and thus can improve the performance of parallel applications. However, these networks usually suffer from deadlocks in routing packets when using a naive minimal path routing algorithm. To solve this problem, we focus attention on a lately proposed theory that generalizes the turn model to maintain the network performance with deadlock-freedom. The theorems remain a challenge of applying themselves to arbitrary topologies including fully irregular networks. In this paper, we advance the theorems to completely general ones. Moreover, we provide a feasible implementation of a deadlock-free routing method based on our advanced theorem. Experimental results show that the routing method based on our proposed theorem can improve the network throughput by up to 138 % compared to a conventional deterministic minimal routing method. Moreover, when utilized as the escape path in Duato's protocol, it can improve the throughput by up to 26.3 % compared with the conventional up*/down* routing.

  • Multi-Voltage Variable Pipeline Routers with the Same Clock Frequency for Low-Power Network-on-Chips Systems

    Akram BEN AHMED  Hiroki MATSUTANI  Michihiro KOIBUCHI  Kimiyoshi USAMI  Hideharu AMANO  

     
    PAPER

      Vol:
    E99-C No:8
      Page(s):
    909-917

    In this paper, the Multi-voltage (multi-Vdd) variable pipeline router is proposed to reduce the power consumption of Network-on-Chips (NoCs) designed for Chip Multi-processors (CMPs). The multi-Vdd variable pipeline router adjusts its pipeline depth (i.e., communication latency) and supply voltage level in response to the applied workload. Unlike Dynamic Voltage and Frequency Scaling (DVFS) routers, the operating frequency remains the same for all routers throughout the CMP; thus, omitting the need to synchronize neighboring routers working at different frequencies. Two types of router architectures are presented: a Coarse-Grained Variable Pipeline (CG-VP) router that changes the voltage supplied to the entire router, and a Fine-Grained Variable Pipeline (FG-VP) router that uses a finer power partition. The evaluation results showed that the CG-VP and FG-VP routers achieve a 22.9% and 35.3% power reduction on average with 14% and 23% area overhead in comparison with a baseline router without variable pipelines, respectively. Thanks to the adopted look-ahead mechanism to switch the supply voltage, the performance overhead is only 4.4%.

  • A Sequential Approach to Detect Drifts and Retrain Neural Networks on Resource-Limited Edge Devices Open Access

    Kazuki SUNAGA  Takeya YAMADA  Hiroki MATSUTANI  

     
    PAPER-Software System

      Pubricized:
    2024/02/09
      Vol:
    E107-D No:6
      Page(s):
    741-750

    A practical issue of edge AI systems is that data distributions of trained dataset and deployed environment may differ due to noise and environmental changes over time. Such a phenomenon is known as a concept drift, and this gap degrades the performance of edge AI systems and may introduce system failures. To address this gap, retraining of neural network models triggered by concept drift detection is a practical approach. However, since available compute resources are strictly limited in edge devices, in this paper we propose a fully sequential concept drift detection method in cooperation with an on-device sequential learning technique of neural networks. In this case, both the neural network retraining and the proposed concept drift detection are done only by sequential computation to reduce computation cost and memory utilization. We use three datasets for experiments and compare the proposed approach with existing batch-based detection methods. It is also compared with a DNN-based approach without concept drift detection. The evaluation results of the proposed approach show that the proposed method is capable of detecting each of four concept drift types. The results also show that, while the accuracy is decreased by up to 0.9% compared to the existing batch-based detection methods, it decreases the memory size by 88.9%-96.4% and the execution time by 45.0%-87.6%. As a result, the combination of the neural network retraining and the proposed concept drift detection method is demonstrated on Raspberry Pi Pico that has 264 kB memory.

  • Federated Learning of Neural ODE Models with Different Iteration Counts Open Access

    Yuto HOSHINO  Hiroki KAWAKAMI  Hiroki MATSUTANI  

     
    PAPER-Artificial Intelligence, Data Mining

      Pubricized:
    2024/02/09
      Vol:
    E107-D No:6
      Page(s):
    781-791

    Federated learning is a distributed machine learning approach in which clients train models locally with their own data and upload them to a server so that their trained results are shared between them without uploading raw data to the server. There are some challenges in federated learning, such as communication size reduction and client heterogeneity. The former can mitigate the communication overheads, and the latter can allow the clients to choose proper models depending on their available compute resources. To address these challenges, in this paper, we utilize Neural ODE based models for federated learning. The proposed flexible federated learning approach can reduce the communication size while aggregating models with different iteration counts or depths. Our contribution is that we experimentally demonstrate that the proposed federated learning can aggregate models with different iteration counts or depths. It is compared with a different federated learning approach in terms of the accuracy. Furthermore, we show that our approach can reduce communication size by up to 89.4% compared with a baseline ResNet model using CIFAR-10 dataset.

  • An Overflow/Underflow-Free Fixed-Point Bit-Width Optimization Method for OS-ELM Digital Circuit Open Access

    Mineto TSUKADA  Hiroki MATSUTANI  

     
    PAPER

      Pubricized:
    2021/09/17
      Vol:
    E105-A No:3
      Page(s):
    437-447

    Currently there has been increasing demand for real-time training on resource-limited IoT devices such as smart sensors, which realizes standalone online adaptation for streaming data without data transfers to remote servers. OS-ELM (Online Sequential Extreme Learning Machine) has been one of promising neural-network-based online algorithms for on-chip learning because it can perform online training at low computational cost and is easy to implement as a digital circuit. Existing OS-ELM digital circuits employ fixed-point data format and the bit-widths are often manually tuned, however, this may cause overflow or underflow which can lead to unexpected behavior of the circuit. For on-chip learning systems, an overflow/underflow-free design has a great impact since online training is continuously performed and the intervals of intermediate variables will dynamically change as time goes by. In this paper, we propose an overflow/underflow-free bit-width optimization method for fixed-point digital circuits of OS-ELM. Experimental results show that our method realizes overflow/underflow-free OS-ELM digital circuits with 1.0x - 1.5x more area cost compared to the baseline simulation method where overflow or underflow can happen.

  • A Hardware-Based Caching System on FPGA NIC for Blockchain

    Yuma SAKAKIBARA  Shin MORISHIMA  Kohei NAKAMURA  Hiroki MATSUTANI  

     
    PAPER-Computer System

      Pubricized:
    2018/02/02
      Vol:
    E101-D No:5
      Page(s):
    1350-1360

    Engineers and researchers have recently paid attention to Blockchain. Blockchain is a fault-tolerant distributed ledger without administrators. Blockchain is originally derived from cryptocurrency, but it is possible to be applied to other industries. Transferring digital asset is called a transaction. Blockchain holds all transactions, so the total amount of Blockchain data will increase as time proceeds. On the other hand, the number of Internet of Things (IoT) products has been increasing. It is difficult for IoT products to hold all Blockchain data because of their storage capacity. Therefore, they access Blockchain data via servers that have Blockchain data. However, if a lot of IoT products access Blockchain network via servers, server overloads will occur. Thus, it is useful to reduce workloads and improve throughput. In this paper, we propose a caching technique using a Field Programmable Gate Array-based (FPGA) Network Interface Card (NIC) which possesses four 10Gigabit Ethernet (10GbE) interfaces. The proposed system can reduce server overloads, because the FPGA NIC instead of the server responds to requests from IoT products if cache hits. We implemented the proposed hardware cache to achieve high throughput on NetFPGA-10G board. We counted the number of requests that the server or the FPGA NIC processed as an evaluation. As a result, the throughput improved by on average 1.97 times when hitting the cache.

  • Proxy Responses by FPGA-Based Switch for MapReduce Stragglers

    Koya MITSUZUKA  Michihiro KOIBUCHI  Hideharu AMANO  Hiroki MATSUTANI  

     
    PAPER-Computer System

      Pubricized:
    2018/06/15
      Vol:
    E101-D No:9
      Page(s):
    2258-2268

    In parallel processing applications, a few worker nodes called “stragglers”, which execute their tasks significantly slower than other tasks, increase the execution time of the job. In this paper, we propose a network switch based straggler handling system to mitigate the burden of the compute nodes. We also propose how to offload detecting stragglers and computing their results in the network switch with no additional communications between worker nodes. We introduce some approximate techniques for the proxy computation and response at the switch; thus our switch is called “ApproxSW.” As a result of a simulation experiment, the proposed approximation based on task similarity achieves the best accuracy in terms of quality of generated Map outputs. We also analyze how to suppress unnecessary proxy computation by the ApproxSW. We implement ApproxSW on NetFPGA-SUME board that has four 10Gbit Ethernet (10GbE) interfaces and a Virtex-7 FPGA. Experimental results shows that the ApproxSW functions do not degrade the original 10GbE switch performance.

  • Vertical Link On/Off Regulations for Inductive-Coupling Based Wireless 3-D NoCs

    Hao ZHANG  Hiroki MATSUTANI  Yasuhiro TAKE  Tadahiro KURODA  Hideharu AMANO  

     
    PAPER-Computer System

      Vol:
    E96-D No:12
      Page(s):
    2753-2764

    We propose low-power techniques for wireless three-dimensional Network-on-Chips (wireless 3-D NoCs), in which the connections among routers on the same chip are wired while the routers on different chips are connected wirelessly using inductive-coupling. The proposed low-power techniques stop the clock and power supplies to the transmitter of the wireless vertical links only when their utilizations are higher than the threshold. Meanwhile, the whole wireless vertical link will be shut down when the utilization is lower than the threshold in order to reduce the power consumption of wireless 3-D NoCs. This paper uses an on-demand method, in which the dormant data transmitter or the whole vertical link will be activated as long as a flit comes. Full-system many-core simulations using power parameters derived from a real chip implementation show that the proposed low-power techniques reduce the power consumption by 23.4%-29.3%, while the performance overhead is less than 2.4%.

1-20hit(21hit)