A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM

Yibo FAN; Leilei HUANG; Kewei CHEN; Xiaoyang ZENG

doi:10.1587/transele.2019ECP5008

A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM

Yibo FAN, Leilei HUANG, Kewei CHEN, Xiaoyang ZENG

Full Text Views

0

Cite this

Summary :

The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.

Publication: IEICE TRANSACTIONS on Electronics Vol.E103-C No.5 pp.263-273

Publication Date: 2020/05/01

Publicized: 2019/11/27

Online ISSN: 1745-1353

DOI: 10.1587/transele.2019ECP5008

Type of Manuscript: PAPER

Category: Integrated Electronics

Authors

Yibo FAN
  Fudan University
Leilei HUANG
  Fudan University
Kewei CHEN
  Fudan University
Xiaoyang ZENG
  Fudan University

Keyword

Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), hardware implementation

Cite this

Copy

Yibo FAN, Leilei HUANG, Kewei CHEN, Xiaoyang ZENG, "A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM" in IEICE TRANSACTIONS on Electronics, vol. E103-C, no. 5, pp. 263-273, May 2020, doi: 10.1587/transele.2019ECP5008.
Abstract: The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.
URL: https://global.ieice.org/en_transactions/electronics/10.1587/transele.2019ECP5008/_p

Copy

@ARTICLE{e103-c_5_263,
author={Yibo FAN, Leilei HUANG, Kewei CHEN, Xiaoyang ZENG, },
journal={IEICE TRANSACTIONS on Electronics},
title={A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM},
year={2020},
volume={E103-C},
number={5},
pages={263-273},
abstract={The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.},
keywords={},
doi={10.1587/transele.2019ECP5008},
ISSN={1745-1353},
month={May},}

Copy

TY - JOUR
TI - A Highly Configurable 7.62GOP/s Hardware Implementation for LSTM
T2 - IEICE TRANSACTIONS on Electronics
SP - 263
EP - 273
AU - Yibo FAN
AU - Leilei HUANG
AU - Kewei CHEN
AU - Xiaoyang ZENG
PY - 2020
DO - 10.1587/transele.2019ECP5008
JO - IEICE TRANSACTIONS on Electronics
SN - 1745-1353
VL - E103-C
IS - 5
JA - IEICE TRANSACTIONS on Electronics
Y1 - May 2020
AB - The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.
ER -