Yuka KO Katsuhito SUDOH Sakriani SAKTI Satoshi NAKAMURA
End-to-end speech translation (ST) directly renders source language speech to the target language without intermediate automatic speech recognition (ASR) output as in a cascade approach. End-to-end ST avoids error propagation from intermediate ASR results. Although recent attempts have applied multi-task learning using an auxiliary task of ASR to improve ST performance, they use cross-entropy loss to one-hot references in the ASR task, and the trained ST models do not consider possible ASR confusion. In this study, we propose a novel multi-task learning framework for end-to-end STs leveraged by ASR-based loss against posterior distributions obtained using a pre-trained ASR model called ASR posterior-based loss (ASR-PBL). The ASR-PBL method, which enables a ST model to reflect possible ASR confusion among competing hypotheses with similar pronunciations, can be applied to one of the strong multi-task ST baseline models with Hybrid CTC/Attention ASR task loss. In our experiments on the Fisher Spanish-to-English corpus, the proposed method demonstrated better BLEU results than the baseline that used standard CE loss.
In industry, automatic speech recognition has come to be a competitive feature for embedded products with poor hardware resources. In this work, we propose a tiny end-to-end speech recognition model that is lightweight and easily deployable on edge platforms. First, instead of sophisticated network structures, such as recurrent neural networks, transformers, etc., the model we propose mainly uses convolutional neural networks as its backbone. This ensures that our model is supported by most software development kits for embedded devices. Second, we adopt the basic unit of MobileNet-v3, which performs well in computer vision tasks, and integrate the features of the hidden layer at different scales, thus compressing the number of parameters of the model to less than 1 M and achieving an accuracy greater than that of some traditional models. Third, in order to further reduce the CPU computation, we directly extract acoustic representations from 1-dimensional speech waveforms and use a self-supervised learning approach to encourage the convergence of the model. Finally, to solve some problems where hardware resources are relatively weak, we use a prefix beam search decoder to dynamically extend the search path with an optimized pruning strategy and an additional initialism language model to capture the probability of between-words in advance and thus avoid premature pruning of correct words. In our experiments, according to a number of evaluation categories, our end-to-end model outperformed several tiny speech recognition models used for embedded devices in related work.
Peng FAN Xiyao HUA Yi LIN Bo YANG Jianwei ZHANG Wenyi GE Dongyue GUO
In this work, we propose a new automatic speech recognition (ASR) system based on feature learning and an end-to-end training procedure for air traffic control (ATC) systems. The proposed model integrates the feature learning block, recurrent neural network (RNN), and connectionist temporal classification loss to build an end-to-end ASR model. Facing the complex environments of ATC speech, instead of the handcrafted features, a learning block is designed to extract informative features from raw waveforms for acoustic modeling. Both the SincNet and 1D convolution blocks are applied to process the raw waveforms, whose outputs are concatenated to the RNN layers for the temporal modeling. Thanks to the ability to learn representations from raw waveforms, the proposed model can be optimized in a complete end-to-end manner, i.e., from waveform to text. Finally, the multilingual issue in the ATC domain is also considered to achieve the ASR task by constructing a combined vocabulary of Chinese characters and English letters. The proposed approach is validated on a multilingual real-world corpus (ATCSpeech), and the experimental results demonstrate that the proposed approach outperforms other baselines, achieving a 6.9% character error rate.
Huaijin DENG Takehito UTSURO Akio KOBAYASHI Hiromitsu NISHIZAKI
There have been lots of previous studies on fluency evaluation of spontaneous speech. However, most of them focus on lexical cues, and little emphasis is placed on how diverse acoustic features and deep end-to-end models contribute to improving the performance. In this paper, we describe multi-layer neural network to investigate not only lexical features extracted from transcription, but also consider utterance-level acoustic features from audio data. We also conduct the experiments to investigate the performance of end-to-end approaches with mel-spectrogram in this task. As the speech fluency evaluation task, we evaluate our proposed method in two binary classification tasks of fluent speech detection and disfluent speech detection. Speech data of around 10 seconds duration each with the annotation of the three classes of “fluent,” “neutral,” and “disfluent” is used for evaluation. According to the two way splits of those three classes, the task of fluent speech detection is defined as binary classification of fluent vs. neutral and disfluent, while that of disfluent speech detection is defined as binary classification of fluent and neutral vs. disfluent. We then conduct experiments with the purpose of comparative evaluation of multi-layer neural network with diverse features as well as end-to-end models. For the fluent speech detection, in the comparison of utterance-level disfluency-based, prosodic, and acoustic features with multi-layer neural network, disfluency-based and prosodic features only are better. More specifically, the performance improved a lot when removing all of the acoustic features from the full set of features, while the performance is damaged a lot if fillers related features are removed. Overall, however, the end-to-end Transformer+VGGNet model with mel-spectrogram achieves the best results. For the disfluent speech detection, the multi-layer neural network using disfluency-based, prosodic, and acoustic features without fillers achieves the best results. The end-to-end Transformer+VGGNet architecture also obtains high scores, whereas it is exceeded by the best results with the multi-layer neural network with significant difference. Thus, unlike in the fluent speech detection, disfluency-based and prosodic features other than fillers are still necessary in the disfluent speech detection.
Gang JIN Jingsheng ZHAI Jianguo WEI
In this paper, we propose an end-to-end two-branch feature attention network. The network is mainly used for single image dehazing. The network consists of two branches, we call it CAA-Net: 1) A U-NET network composed of different-level feature fusion based on attention (FEPA) structure and residual dense block (RDB). In order to make full use of all the hierarchical features of the image, we use RDB. RDB contains dense connected layers and local feature fusion with local residual learning. We also propose a structure which called FEPA.FEPA structure could retain the information of shallow layer and transfer it to the deep layer. FEPA is composed of serveral feature attention modules (FPA). FPA combines local residual learning with channel attention mechanism and pixel attention mechanism, and could extract features from different channels and image pixels. 2) A network composed of several different levels of FEPA structures. The network could make feature weights learn from FPA adaptively, and give more weight to important features. The final output result of CAA-Net is the combination of all branch prediction results. Experimental results show that the CAA-Net proposed by us surpasses the most advanced algorithms before for single image dehazing.
Ya ZENG Li WAN Qiuhong LUO Mao CHEN
Traditional pipeline methods for task-oriented dialogue systems are designed individually and expensively. Existing memory augmented end-to-end methods directly map the inputs to outputs and achieve promising results. However, the most existing end-to-end solutions store the dialogue history and knowledge base (KB) information in the same memory and represent KB information in the form of KB triples, making the memory reader's reasoning on the memory more difficult, which makes the system difficult to retrieve the correct information from the memory to generate a response. Some methods introduce many manual annotations to strengthen reasoning. To reduce the use of manual annotations, while strengthening reasoning, we propose a hierarchical memory model (HM2Seq) for task-oriented systems. HM2Seq uses a hierarchical memory to separate the dialogue history and KB information into two memories and stores KB in KB rows, then we use memory rows pointer combined with an entity decoder to perform hierarchical reasoning over memory. The experimental results on two publicly available task-oriented dialogue datasets confirm our hypothesis and show the outstanding performance of our HM2Seq by outperforming the baselines.
Dongni HU Chengxin CHEN Pengyuan ZHANG Junfeng LI Yonghong YAN Qingwei ZHAO
Recently, automated recognition and analysis of human emotion has attracted increasing attention from multidisciplinary communities. However, it is challenging to utilize the emotional information simultaneously from multiple modalities. Previous studies have explored different fusion methods, but they mainly focused on either inter-modality interaction or intra-modality interaction. In this letter, we propose a novel two-stage fusion strategy named modality attention flow (MAF) to model the intra- and inter-modality interactions simultaneously in a unified end-to-end framework. Experimental results show that the proposed approach outperforms the widely used late fusion methods, and achieves even better performance when the number of stacked MAF blocks increases.
Kiyoshi KURIHARA Nobumasa SEIYAMA Tadashi KUMANO
This paper describes a method to control prosodic features using phonetic and prosodic symbols as input of attention-based sequence-to-sequence (seq2seq) acoustic modeling (AM) for neural text-to-speech (TTS). The method involves inserting a sequence of prosodic symbols between phonetic symbols that are then used to reproduce prosodic acoustic features, i.e. accents, pauses, accent breaks, and sentence endings, in several seq2seq AM methods. The proposed phonetic and prosodic labels have simple descriptions and a low production cost. By contrast, the labels of conventional statistical parametric speech synthesis methods are complicated, and the cost of time alignments such as aligning the boundaries of phonemes is high. The proposed method does not need the boundary positions of phonemes. We propose an automatic conversion method for conventional labels and show how to automatically reproduce pitch accents and phonemes. The results of objective and subjective evaluations show the effectiveness of our method.
LINE is currently the most popular messaging service in Japan. Communications using LINE are protected by the original encryption scheme, called LINE Encryption, and specifications of the client-to-server transport encryption protocol and the client-to-client message end-to-end encryption protocol are published by the Technical Whitepaper. Though a spoofing attack (i.e., a malicious client makes another client misunderstand the identity of the peer) and a reply attack (i.e., a message in a session is sent again in another session by a man-in-the-middle adversary, and the receiver accepts these messages) to the end-to-end protocol have been shown, no formal security analysis of these protocols is known. In this paper, we show a formal verification result of secrecy of application data and authenticity for protocols of LINE Encryption (Version 1.0) by using the automated security verification tool ProVerif. Especially, since it is claimed that the transport protocol satisfies forward secrecy (i.e., even if the static private key is leaked, security of application data is guaranteed), we verify forward secrecy for client's data and for server's data of the transport protocol, and we find an attack to break secrecy of client's application data. Moreover, we find the spoofing attack and the reply attack, which are reported in previous papers.
Zheng FANG Tieyong CAO Jibin YANG Meng SUN
Salient region detection is a fundamental problem in computer vision and image processing. Deep learning models perform better than traditional approaches but suffer from their huge parameters and slow speeds. To handle these problems, in this paper we propose the multi-feature fusion network (MFFN) - a efficient salient region detection architecture based on Convolution Neural Network (CNN). A novel feature extraction structure is designed to obtain feature maps from CNN. A fusion dense block is used to fuse all low-level and high-level feature maps to derive salient region results. MFFN is an end-to-end architecture which does not need any post-processing procedures. Experiments on the benchmark datasets demonstrate that MFFN achieves the state-of-the-art performance on salient region detection and requires much less parameters and computation time. Ablation experiments demonstrate the effectiveness of each module in MFFN.
Kohei WATABE Shintaro HIRAKAWA Kenji NAKAGAWA
In this paper, a parallel flow monitoring technique that achieves accurate measurement of end-to-end delay of networks is proposed. In network monitoring tasks, network researchers and practitioners usually monitor multiple probe flows to measure delays on multiple paths in parallel. However, when they measure an end-to-end delay on a path, information of flows except for the flow along the path is not utilized in the conventional method. Generally, paths of flows share common parts in parallel monitoring. In the proposed method, information of flows on paths that share common parts, utilizes to measure delay on a path by partially converting the observation results of a flow to those of another flow. We perform simulations to confirm that the observation results of 72 parallel flows of active measurement are appropriately converted between each other. When the 99th-percentile of the end-to-end delay for each flow are measured, the accuracy of the proposed method is doubled compared with the conventional method.
In parallel computing systems, the interconnection network forms the critical infrastructure which enables robust and scalable communication between hundreds of thousands of nodes. The traditional packet-switched network tends to suffer from long communication time when network congestion occurs. In this context, we explore the use of circuit switching (CS) to replace packet switches with custom hardware that supports circuit-based switching efficiently with low latency. In our target CS network, a certain amount of bandwidth is guaranteed for each communication pair so that the network latency can be predictable when a limited number of node pairs exchange messages. The number of allocated time slots in every switch is a direct factor to affect the end-to-end latency, we thereby improve the slot utilization and develop a network topology generator to minimize the number of time slots optimized to target applications whose communication patterns are predictable. By a quantitative discrete-event simulation, we illustrate that the minimum necessary number of slots can be reduced to a small number in a generated topology by our design methodology while maintaining network cost 50% less than that in standard tori topologies.
Wei WANG Weiguang LI Zhaoming CHEN Mingquan SHI
In general, effective integrating the advantages of different trackers can achieve unified performance promotion. In this work, we study the integration of multiple correlation filter (CF) trackers; propose a novel but simple tracking integration method that combines different trackers in filter level. Due to the variety of their correlation filter and features, there is no comparability between different CF tracking results for tracking integration. To tackle this, we propose twofold CF to unify these various response maps so that the results of different tracking algorithms can be compared, so as to boost the tracking performance like ensemble learning. Experiment of two CF methods integration on the data sets OTB demonstrates that the proposed method is effective and promising.
Siye WANG Mingyao WANG Boyu JIA Yonghua LI Wenbo XU
In this paper, we investigate the capacity performance of an in-band full-duplex (IBFD) amplify-and-forward two-way relay system under the effect of residual loop-back-interference (LBI). In a two-way IBFD relay system, two IBFD nodes exchange data with each other via an IBFD relay. Both two-way relaying and IBFD one-way relaying could double the spectrum efficiency theoretically. However, due to imperfect channel estimation, the performance of two-way relaying is degraded by self-interference at the receiver. Moreover, the performance of the IBFD relaying is deteriorated by LBI between the transmit antenna and the receive antenna of the node. Different from the IBFD one-way relay scenario, the IBFD two-way relay system will suffer from an extra level of LBI at the destination receiver. We derive accurate approximations of the average end-to-end capacities for both the IBFD and half-duplex modes. We evaluate the impact of the LBI and channel estimation errors on system performance. Monte Carlo simulations verify the validity of analytical results. It can be shown that with certain signal-to-noise ratio values and effective interference cancellation techniques, the IBFD transmission is preferable in terms of capacity. The IBFD two-way relaying is an attractive technique for practical applications.
A statistical call admission control (CAC) allows more calls with on-off patterns to be accepted and a higher channel efficiency to be achieved. In this paper, we propose three statistical CACs for VoIP calls with silence suppression considering the priority of each VoIP call, where the call priority is determined by the call acceptance order in an IP-PBX. We analyse the packet loss rates in an IP-PBX under the proposed strategies and express the end-to-end QoS of a VoIP call as an R-factor in a VoIP service network. The performances of the proposed CACs are evaluated using the maximum allowable number of VoIP calls while satisfying the end-to-end QoS constraint, the average QoS of acceptable VoIP calls and the channel efficiency. The advantage of the proposed statistical CACs over the non-statistical CAC is verified in terms of these three performance metrics. The results indicate that a trade-off is possible in that the maximum allowable number of VoIP calls in an IP-PBX increases as the average QoS of acceptable VoIP calls is lowered according to the proposed statistical CAC used. Nevertheless, the results allow us to verify that the channel efficiencies are the same for all the statistical CACs considered.
Xuyang WANG Pengyuan ZHANG Qingwei ZHAO Jielin PAN Yonghong YAN
The introduction of deep neural networks (DNNs) leads to a significant improvement of the automatic speech recognition (ASR) performance. However, the whole ASR system remains sophisticated due to the dependent on the hidden Markov model (HMM). Recently, a new end-to-end ASR framework, which utilizes recurrent neural networks (RNNs) to directly model context-independent targets with connectionist temporal classification (CTC) objective function, is proposed and achieves comparable results with the hybrid HMM/DNN system. In this paper, we investigate per-dimensional learning rate methods, ADAGRAD and ADADELTA included, to improve the recognition of the end-to-end system, based on the fact that the blank symbol used in CTC technique dominates the output and these methods give frequent features small learning rates. Experiment results show that more than 4% relative reduction of word error rate (WER) as well as 5% absolute improvement of label accuracy on the training set are achieved when using ADADELTA, and fewer epochs of training are needed.
With shortest path bridging MAC (SPBM), shortest path trees are computed based on link metrics from each node to all other participating nodes. When an edge bridge receives a frame, it selects a path along which to forward the frame to its destination node from multiple shortest paths. Blocking ports are eliminated to allow full use of the network links. This approach is expected to use network resources efficiently and to simplify the operating procedure. However, there is only one multipath distribution point in the SPBM network. This type of network can be defined as an end-to-end multipath network. Edge bridges need to split flows to achieve the load balancing of the entire network. This paper proposes a rate-based path selection scheme that can be employed for end-to-end multipath networks including SPBM. The proposed scheme assumes that a path with a low average rate will be congested because the TCP flow rates decrease on a congested path. When a new flow arrives at an edge bridge, it selects the path with the highest average rate since this should provide the new flow with the highest rate. The performance of the proposed scheme is confirmed by computer simulations. The appropriate timeout value is estimated from the expected round trip time (RTT). If an appropriate timeout value is used, the proposed scheme can realize good load balancing. The proposed scheme improves the efficiency of link utilization and throughput fairness. The performance is not affected by differences in the RTT or traffic congestion outside the SPBM network.
Kosuke SANADA Jin SHI Nobuyoshi KOMURO Hiroo SEKIYA
String-topology multi-hop network is often selected as an analysis object because it is one of the fundamental network topologies. The purpose of this paper is to establish expression for end-to-end delay for IEEE 802.11 string-topology multi-hop networks. For obtaining the analytical expression, the effects of frame collisions and carrier-sensing effect from other nodes under the non-saturated condition are obtained for each node in the network. For expressing the properties in non-saturated condition, a new parameter, which is frame-existence probability, is defined. The end-to-end delay of a string-topology multi-hop network can be derived as the sum of the transmission delays in the network flow. The analytical predictions agree with simulation results well, which show validity of the obtained analytical expressions.
In our previous work [2], we proposed a new concept of utility functions for rate control in communication networks. Unlike conventional utility-based rate control in which the utility function of each user is defined as a function of its transmitting data rate, in [2], we defined the utility function of each user as a function of not only its transmitting data rate but also it receiving data rate. The former is called a session-level utility function and the latter is called a user-level utility function. The user-level utility function reflects the satisfaction with the service of a user with two-way communication, which consists of transmitting and receiving sessions, better than the session-level utility function, since user's satisfaction depends on not only the satisfaction with its transmitting session but also that for its receiving session. In [2], an algorithm that required each user to know the exact utility function of its correspondent was developed. However, in some cases, this information might not be available due to some reasons such as security and privacy issues, and in such cases, the algorithm developed in [2] cannot be used. Hence, in this paper, we develop a new distributed algorithm that does not require each user to know the utility function of its correspondent. Numerical results show that our new algorithm, which does not require the utility information of the correspondent, converges to the same solution to that with the algorithm that requires the utility information of the correspondent.
For real-time services, such as VoIP and videoconferencing supplied through a multi-domain MPLS network, it is vital to guarantee end-to-end QoS of the inter-domain paths. Thus, it is important to allocate an appropriate QoS class to the inter-domain paths in each transit domain. Because each domain has its own policy for QoS class allocation, each domain must then allocate an appropriate QoS class adaptively based on the estimation of the QoS class allocation policies adopted in other domains. This paper proposes an adaptive method for acquiring a QoS class allocation policy through the use of reinforcement learning. This method learns the appropriate policy through experience in the actual QoS class allocation process. Thus, the method can adapt to a complex environment where the arrival of inter-domain path requests does not follow a simple Poisson process and where the various QoS class allocation policies are adopted in other domains. The proposed method updates the allocation policy whenever a QoS class is actually allocated to an inter-domain path. Moreover, some of the allocation policies often utilized in the real operational environment can be updated and refined more frequently. For these reasons, the proposed method is designed to adapt rapidly to variances in the surrounding environment. Simulation results verify that the proposed method can quickly adapt to variations in the arrival process of inter-domain path requests and the QoS class allocation policies in other domains.