Xincheng CAO Bin YAO Binqiang CHEN Wangpeng HE Suqin GUO Kun CHEN
Tool condition monitoring is one of the core tasks of intelligent manufacturing in digital workshop. This paper presents an intelligent recognize method of tool condition based on deep learning. First, the industrial microphone is used to collect the acoustic signal during machining; then, a central fractal decomposition algorithm is proposed to extract sensitive information; finally, the multi-scale convolutional recurrent neural network is used for deep feature extraction and pattern recognition. The multi-process milling experiments proved that the proposed method is superior to the existing methods, and the recognition accuracy reached 88%.
Manaya TOMIOKA Tsuneo KATO Akihiro TAMURA
A neural conversational model (NCM) based on an encoder-decoder recurrent neural network (RNN) with an attention mechanism learns different sequence-to-sequence mappings from what neural machine translation (NMT) learns even when based on the same technique. In the NCM, we confirmed that target-word-to-source-word mappings captured by the attention mechanism are not as clear and stationary as those for NMT. Considering that vector norms indicate a magnitude of information in the processing, we analyzed the inner workings of an encoder-decoder GRU-based NCM focusing on the norms of word embedding vectors and hidden vectors. First, we conducted correlation analyses on the norms of word embedding vectors with frequencies in the training set and with conditional entropies of a bi-gram language model to understand what is correlated with the norms in the encoder and decoder. Second, we conducted correlation analyses on norms of change in the hidden vector of the recurrent layer with their input vectors for the encoder and decoder, respectively. These analyses were done to understand how the magnitude of information propagates through the network. The analytical results suggested that the norms of the word embedding vectors are associated with their semantic information in the encoder, while those are associated with the predictability as a language model in the decoder. The analytical results further revealed how the norms propagate through the recurrent layer in the encoder and decoder.
This letter studies a biobjective optimization problem in binary associative memories characterized by ternary connection parameters. First, we introduce a condition of parameters that guarantees storage of any desired memories and suppression of oscillatory behavior. Second, we define a biobjective problem based on two objectives that evaluate uniform stability of desired memories and sparsity of connection parameters. Performing precise numerical analysis for typical examples, we have clarified existence of a trade-off between the two objectives.
Bima PRIHASTO Tzu-Chiang TAI Pao-Chi CHANG Jia-Ching WANG
The recurrent neural network (RNN) has been used in audio and speech processing, such as language translation and speech recognition. Although RNN-based architecture can be applied to speech synthesis, the long computing time is still the primary concern. This research proposes a fast gated recurrent neural network, a fast RNN-based architecture, for speech synthesis based on the minimal gated unit (MGU). Our architecture removes the unit state history from some equations in MGU. Our MGU-based architecture is about twice faster, with equally good sound quality than the other MGU-based architectures.
Suresh JAGANATHAN Sathya MADHUSUDHANAN
Online feeds are streamed continuously in batches with varied polarities at varying times. The system handling the online feeds must be trained to classify all the varying polarities occurring dynamically. The polarity classification system designed for the online feeds must address two significant challenges: i) stability-plasticity, ii) category-proliferation. The challenges faced in the polarity classification of online feeds can be addressed using the technique of incremental learning, which serves to learn new classes dynamically and also retains the previously learned knowledge. This paper proposes a new incremental learning methodology, ILOF (Incremental Learning of Online Feeds) to classify the feeds by adopting Deep Learning Techniques such as RNN (Recurrent Neural Networks) and LSTM (Long Short Term Memory) and also ELM (Extreme Learning Machine) for addressing the above stated problems. The proposed method creates a separate model for each batch using ELM and incrementally learns from the trained batches. The training of each batch avoids the retraining of old feeds, thus saving training time and memory space. The trained feeds can be discarded when new batch of feeds arrives. Experiments are carried out using the standard datasets comprising of long feeds (IMDB, Sentiment140) and short feeds (Twitter, WhatsApp, and Twitter airline sentiment) and the proposed method showed positive results in terms of better performance and accuracy.
Noriyuki TONAMI Keisuke IMOTO Ryosuke YAMANISHI Yoichi YAMASHITA
Sound event detection (SED) and acoustic scene classification (ASC) are important research topics in environmental sound analysis. Many research groups have addressed SED and ASC using neural-network-based methods, such as the convolutional neural network (CNN), recurrent neural network (RNN), and convolutional recurrent neural network (CRNN). The conventional methods address SED and ASC separately even though sound events and acoustic scenes are closely related to each other. For example, in the acoustic scene “office,” the sound events “mouse clicking” and “keyboard typing” are likely to occur. Therefore, it is expected that information on sound events and acoustic scenes will be of mutual aid for SED and ASC. In this paper, we propose multitask learning for joint analysis of sound events and acoustic scenes, in which the parts of the networks holding information on sound events and acoustic scenes in common are shared. Experimental results obtained using the TUT Sound Events 2016/2017 and TUT Acoustic Scenes 2016 datasets indicate that the proposed method improves the performance of SED and ASC by 1.31 and 1.80 percentage points in terms of the F-score, respectively, compared with the conventional CRNN-based method.
Recurrent neural networks (RNNs) have been proven effective for sequence-based tasks thanks to their capability to process temporal information. In real-world systems, deep RNNs are more widely used to solve complicated tasks such as large-scale speech recognition and machine translation. However, the implementation of deep RNNs on traditional hardware platforms is inefficient due to long-range temporal dependence and irregular computation patterns within RNNs. This inefficiency manifests itself in the proportional increase in the latency of RNN inference with respect to the number of layers of deep RNNs on CPUs and GPUs. Previous work has focused mostly on optimizing and accelerating individual RNN cells. To make deep RNN inference fast and efficient, we propose an accelerator based on a multi-FPGA platform called Flow-in-Cloud (FiC). In this work, we show that the parallelism provided by the multi-FPGA system can be taken advantage of to scale up the inference of deep RNNs, by partitioning a large model onto several FPGAs, so that the latency stays close to constant with respect to increasing number of RNN layers. For single-layer and four-layer RNNs, our implementation achieves 31x and 61x speedup compared with an Intel CPU.
A limited number of types of sound event occur in an acoustic scene and some sound events tend to co-occur in the scene; for example, the sound events “dishes” and “glass jingling” are likely to co-occur in the acoustic scene “cooking.” In this paper, we propose a method of sound event detection using graph Laplacian regularization with sound event co-occurrence taken into account. In the proposed method, the occurrences of sound events are expressed as a graph whose nodes indicate the frequencies of event occurrence and whose edges indicate the sound event co-occurrences. This graph representation is then utilized for the model training of sound event detection, which is optimized under an objective function with a regularization term considering the graph structure of sound event occurrence and co-occurrence. Evaluation experiments using the TUT Sound Events 2016 and 2017 detasets, and the TUT Acoustic Scenes 2016 dataset show that the proposed method improves the performance of sound event detection by 7.9 percentage points compared with the conventional CNN-BiGRU-based detection method in terms of the segment-based F1 score. In particular, the experimental results indicate that the proposed method enables the detection of co-occurring sound events more accurately than the conventional method.
Yibo FAN Leilei HUANG Kewei CHEN Xiaoyang ZENG
The neural network has been one of the most useful techniques in the area of speech recognition, language translation and image analysis in recent years. Long Short-Term Memory (LSTM), a popular type of recurrent neural networks (RNNs), has been widely implemented on CPUs and GPUs. However, those software implementations offer a poor parallelism while the existing hardware implementations lack in configurability. In order to make up for this gap, a highly configurable 7.62 GOP/s hardware implementation for LSTM is proposed in this paper. To achieve the goal, the work flow is carefully arranged to make the design compact and high-throughput; the structure is carefully organized to make the design configurable; the data buffering and compression strategy is carefully chosen to lower the bandwidth without increasing the complexity of structure; the data type, logistic sigmoid (σ) function and hyperbolic tangent (tanh) function is carefully optimized to balance the hardware cost and accuracy. This work achieves a performance of 7.62 GOP/s @ 238 MHz on XCZU6EG FPGA, which takes only 3K look-up table (LUT). Compared with the implementation on Intel Xeon E5-2620 CPU @ 2.10GHz, this work achieves about 90× speedup for small networks and 25× speed-up for large ones. The consumption of resources is also much less than that of the state-of-the-art works.
Tachanun KANGWANTRAKOOL Kobkrit VIRIYAYUDHAKORN Thanaruk THEERAMUNKONG
Most existing methods of effort estimations in software development are manual, labor-intensive and subjective, resulting in overestimation with bidding fail, and underestimation with money loss. This paper investigates effectiveness of sequence models on estimating development effort, in the form of man-months, from software project data. Four architectures; (1) Average word-vector with Multi-layer Perceptron (MLP), (2) Average word-vector with Support Vector Regression (SVR), (3) Gated Recurrent Unit (GRU) sequence model, and (4) Long short-term memory (LSTM) sequence model are compared in terms of man-months difference. The approach is evaluated using two datasets; ISEM (1,573 English software project descriptions) and ISBSG (9,100 software projects data), where the former is a raw text and the latter is a structured data table explained the characteristic of a software project. The LSTM sequence model achieves the lowest and the second lowest mean absolute errors, which are 0.705 and 14.077 man-months for ISEM and ISBSG datasets respectively. The MLP model achieves the lowest mean absolute errors which is 14.069 for ISBSG datasets.
Rui CHEN Ying TONG Ruiyu LIANG
Deep neural networks have achieved great success in visual tracking by learning a generic representation and leveraging large amounts of training data to improve performance. Most generic object trackers are trained from scratch online and do not benefit from a large number of videos available for offline training. We present a real-time generic object tracker capable of incorporating temporal information into its model, learning from many examples offline and quickly updating online. During the training process, the pre-trained weight of convolution layer is updated lagging behind, and the input video sequence length is gradually increased for fast convergence. Furthermore, only the hidden states in recurrent network are updated to guarantee the real-time tracking speed. The experimental results show that the proposed tracking method is capable of tracking objects at 150 fps with higher predicting overlap rate, and achieves more robustness in multiple benchmarks than state-of-the-art performance.
Andros TJANDRA Sakriani SAKTI Satoshi NAKAMURA
Recurrent Neural Network (RNN) has achieved many state-of-the-art performances on various complex tasks related to the temporal and sequential data. But most of these RNNs require much computational power and a huge number of parameters for both training and inference stage. Several tensor decomposition methods are included such as CANDECOMP/PARAFAC (CP), Tucker decomposition and Tensor Train (TT) to re-parameterize the Gated Recurrent Unit (GRU) RNN. First, we evaluate all tensor-based RNNs performance on sequence modeling tasks with a various number of parameters. Based on our experiment results, TT-GRU achieved the best results in a various number of parameters compared to other decomposition methods. Later, we evaluate our proposed TT-GRU with speech recognition task. We compressed the bidirectional GRU layers inside DeepSpeech2 architecture. Based on our experiment result, our proposed TT-format GRU are able to preserve the performance while reducing the number of GRU parameters significantly compared to the uncompressed GRU.
Ryo MASUMURA Taichi ASAMI Takanobu OBA Sumitaka SAKAUCHI Akinori ITO
This paper demonstrates latent word recurrent neural network language models (LW-RNN-LMs) for enhancing automatic speech recognition (ASR). LW-RNN-LMs are constructed so as to pick up advantages in both recurrent neural network language models (RNN-LMs) and latent word language models (LW-LMs). The RNN-LMs can capture long-range context information and offer strong performance, and the LW-LMs are robust for out-of-domain tasks based on the latent word space modeling. However, the RNN-LMs cannot explicitly capture hidden relationships behind observed words since a concept of a latent variable space is not present. In addition, the LW-LMs cannot take into account long-range relationships between latent words. Our idea is to combine RNN-LM and LW-LM so as to compensate individual disadvantages. The LW-RNN-LMs can support both a latent variable space modeling as well as LW-LMs and a long-range relationship modeling as well as RNN-LMs at the same time. From the viewpoint of RNN-LMs, LW-RNN-LM can be considered as a soft class RNN-LM with a vast latent variable space. In contrast, from the viewpoint of LW-LMs, LW-RNN-LM can be considered as an LW-LM that uses the RNN structure for latent variable modeling instead of an n-gram structure. This paper also details a parameter inference method and two kinds of implementation methods, an n-gram approximation and a Viterbi approximation, for introducing the LW-LM to ASR. Our experiments show effectiveness of LW-RNN-LMs on a perplexity evaluation for the Penn Treebank corpus and an ASR evaluation for Japanese spontaneous speech tasks.
Yingwei FU Kele XU Haibo MI Qiuqiang KONG Dezhi WANG Huaimin WANG Tie HONG
Sound event detection is intended to identify the sound events in audio recordings, which has widespread applications in real life. Recently, convolutional recurrent neural network (CRNN) models have achieved state-of-the-art performance in this task due to their capabilities in learning the representative features. However, the CRNN models are of high complexities with millions of parameters to be trained, which limits their usage for the mobile and embedded devices with limited computation resource. Model distillation is effective to distill the knowledge of a complex model to a smaller one, which can be deployed on the devices with limited computational power. In this letter, we propose a novel multi model-based distillation approach for sound event detection by making use of the knowledge from models of multiple teachers which are complementary in detecting sound events. Extensive experimental results demonstrated that our approach achieves a compression ratio about 50 times. In addition, better performance is obtained for the sound event detection task.
Se-Min LIM Jooyoung PARK Hyeong-Cheol OH
This study designs a low-cost portable device that functions as a coaching assistant system which can support table tennis practice. Although deep learning technology is a promising solution to realizing human activity recognition, we propose using cosine similarity in making inferences. Our experiments show that the cosine similarity based inference can be a good alternative to the deep learning based inference for the assistant system when resources are limited.
Zhenyu ZHAO Ming ZHU Yiqiang SHENG Jinlin WANG
To solve the low accuracy problem of the recommender system for long term users, in this paper, we propose a top-N-balanced sequential recommendation based on recurrent neural network. We postulated and verified that the interactions between users and items is time-dependent in the long term, but in the short term, it is time-independent. We balance the top-N recommendation and sequential recommendation to generate a better recommender list by improving the loss function and generation method. The experimental results demonstrate the effectiveness of our method. Compared with a state-of-the-art recommender algorithm, our method clearly improves the performance of the recommendation on hit rate. Besides the improvement of the basic performance, our method can also handle the cold start problem and supply new users with the same quality of service as the old users.
A fusion framework between CNN and RNN is proposed dedicatedly for air-writing recognition. By modeling the air-writing using both spatial and temporal features, the proposed network can learn more information than existing techniques. Performance of the proposed network is evaluated by using the alphabet and numeric datasets in the public database namely the 6DMG. Average accuracy of the proposed fusion network outperforms other techniques, i.e. 99.25% and 99.83% are observed in the alphabet gesture and the numeric gesture, respectively. Simplified structure of RNN is also proposed, which can attain about two folds speed-up of ordinary BLSTM network. It is also confirmed that only the distance between consecutive sampling points is enough to attain high recognition performance.
Maoxi LI Qingyu XIANG Zhiming CHEN Mingwen WANG
The-state-of-the-art neural quality estimation (QE) of machine translation model consists of two sub-networks that are tuned separately, a bidirectional recurrent neural network (RNN) encoder-decoder trained for neural machine translation, called the predictor, and an RNN trained for sentence-level QE tasks, called the estimator. We propose to combine the two sub-networks into a whole neural network, called the unified neural network. When training, the bidirectional RNN encoder-decoder are initialized and pre-trained with the bilingual parallel corpus, and then, the networks are trained jointly to minimize the mean absolute error over the QE training samples. Compared with the predictor and estimator approach, the use of a unified neural network helps to train the parameters of the neural networks that are more suitable for the QE task. Experimental results on the benchmark data set of the WMT17 sentence-level QE shared task show that the proposed unified neural network approach consistently outperforms the predictor and estimator approach and significantly outperforms the other baseline QE approaches.
Junfeng SHI Wenming MA Peng SONG
To learn the working situation of rod-pumped wells under ground, we always need to analyze dynamometer diagrams, which are generated by the load sensor and displacement sensor. Rod-pumped wells are usually located in the places with extreme weather, and these sensors are installed on some special oil equipments in the open air. As time goes by, sensors are prone to generating unstable and incorrect data. Unfortunately, load sensors are too expensive to frequently reinstall. Therefore, the resulting dynamometer diagrams sometimes cannot make an accurate diagnosis. Instead, as an absolutely necessary equipment of the rod-pumped well, the electric motor has much longer life and cannot be easily impacted by the weather. The electric power curve during a swabbing period can also reflect the working situation under ground, but is much harder to explain than the dynamometer diagram. This letter presented a novel deep learning architecture, which can transform the electric power curve into the dimensionless dynamometer diagram image. We conduct our experiments on a real-world dataset, and the results show that our method can get an impressive transformation accuracy.
Lei ZHANG Guoxing ZHANG Zhizheng LIANG Qingfu FAN Yadong LI
The traditional Markov prediction methods of the taxi destination rely only on the previous 2 to 3 GPS points. They negelect long-term dependencies within a taxi trajectory. We adopt a Recurrent Neural Network (RNN) to explore the long-term dependencies to predict the taxi destination as the multiple hidden layers of RNN can store these dependencies. However, the hidden layers of RNN are very sensitive to small perturbations to reduce the prediction accuracy when the amount of taxi trajectories is increasing. In order to improve the prediction accuracy of taxi destination and reduce the training time, we embed suprisal-driven zoneout (SDZ) to RNN, hence a taxi destination prediction method by regularized RNN with SDZ (TDPRS). SDZ can not only improve the robustness of TDPRS, but also reduce the training time by adopting partial update of parameters instead of a full update. Experiments with a Porto taxi trajectory data show that TDPRS improves the prediction accuracy by 12% compared to RNN prediction method in literature[4]. At the same time, the prediction time is reduced by 7%.