Thi Thu HIEN NGUYEN Thai BINH NGUYEN Ngoc PHUONG PHAM Quoc TRUONG DO Tu LUC LE Chi MAI LUONG
Speech recognition is a technique that recognizes words and sentences in audio form and converts them into text sentences. Currently, with the advancement of deep learning technologies, speech recognition has achieved very satisfactory results close to human abilities. However, there are still limitations in identification results such as lack of punctuation, capitalization, and standardized numerical data. Vietnamese also contains local words, homonyms, etc, which make it difficult to read and understand the identification results for users as well as to perform the next tasks in Natural Language Processing (NLP). In this paper, we propose to combine the transformer decoder with conditional random field (CRF) to restore punctuation and capitalization for the Vietnamese automatic speech recognition (ASR) output. By chunking input sentences and merging output sequences, it is possible to handle longer strings with greater accuracy. Experiments show that the method proposed in the Vietnamese post-speech recognition dataset delivers the best results.
Dongni HU Chengxin CHEN Pengyuan ZHANG Junfeng LI Yonghong YAN Qingwei ZHAO
Recently, automated recognition and analysis of human emotion has attracted increasing attention from multidisciplinary communities. However, it is challenging to utilize the emotional information simultaneously from multiple modalities. Previous studies have explored different fusion methods, but they mainly focused on either inter-modality interaction or intra-modality interaction. In this letter, we propose a novel two-stage fusion strategy named modality attention flow (MAF) to model the intra- and inter-modality interactions simultaneously in a unified end-to-end framework. Experimental results show that the proposed approach outperforms the widely used late fusion methods, and achieves even better performance when the number of stacked MAF blocks increases.
This paper proposes a pulse-width modulated (PWM) signaling[1] to send clock and data over a pair of channels for in-vehicle network where a closed chain of point-to-point (P2P) interconnection between electronic control units (ECU) has been established. To improve detection speed and margin of proposed receiver, we also proposed a novel clock and data recovery (CDR) scheme with 0.5 unit-interval (UI) tuning range and a PWM generator utilizing 10 equally-spaced phases. The feasibility of proposed system has been proved by successfully detecting 1.25 Gb/s data delivered via 3 ECUs and inter-channels in 180 nm CMOS technology. Compared to previous study, the proposed system achieved better efficiency in terms of power, cost, and reliability.
Takaaki SAEKI Yuki SAITO Shinnosuke TAKAMICHI Hiroshi SARUWATARI
This paper proposes two high-fidelity and computationally efficient neural voice conversion (VC) methods based on a direct waveform modification using spectral differentials. The conventional spectral-differential VC method with a minimum-phase filter achieves high-quality conversion for narrow-band (16 kHz-sampled) VC but requires heavy computational cost in filtering. This is because the minimum phase obtained using a fixed lifter of the Hilbert transform often results in a long-tap filter. Furthermore, when we extend the method to full-band (48 kHz-sampled) VC, the computational cost is heavy due to increased sampling points, and the converted-speech quality degrades due to large fluctuations in the high-frequency band. To construct a short-tap filter, we propose a lifter-training method for data-driven phase reconstruction that trains a lifter of the Hilbert transform by taking into account filter truncation. We also propose a frequency-band-wise modeling method based on sub-band multi-rate signal processing (sub-band modeling method) for full-band VC. It enhances the computational efficiency by reducing sampling points of signals converted with filtering and improves converted-speech quality by modeling only the low-frequency band. We conducted several objective and subjective evaluations to investigate the effectiveness of the proposed methods through implementation of the real-time, online, full-band VC system we developed, which is based on the proposed methods. The results indicate that 1) the proposed lifter-training method for narrow-band VC can shorten the tap length to 1/16 without degrading the converted-speech quality, and 2) the proposed sub-band modeling method for full-band VC can improve the converted-speech quality while reducing the computational cost, and 3) our real-time, online, full-band VC system can convert 48 kHz-sampled speech in real time attaining the converted speech with a 3.6 out of 5.0 mean opinion score of naturalness.
Zhen LI Baojun ZHAO Wenzheng WANG Baoxian WANG
Hyperspectral images (HSIs) are generally susceptible to various noise, such as Gaussian and stripe noise. Recently, numerous denoising algorithms have been proposed to recover the HSIs. However, those approaches cannot use spectral information efficiently and suffer from the weakness of stripe noise removal. Here, we propose a tensor decomposition method with two different constraints to remove the mixed noise from HSIs. For a HSI cube, we first employ the tensor singular value decomposition (t-SVD) to effectively preserve the low-rank information of HSIs. Considering the continuity property of HSIs spectra, we design a simple smoothness constraint by using Tikhonov regularization for tensor decomposition to enhance the denoising performance. Moreover, we also design a new unidirectional total variation (TV) constraint to filter the stripe noise from HSIs. This strategy will achieve better performance for preserving images details than original TV models. The developed method is evaluated on both synthetic and real noisy HSIs, and shows the favorable results.
Yusuke SAKUMOTO Hiroyuki OHSAKI
Various graph algorithms have been developed with multiple random walks, the movement of several independent random walkers on a graph. Designing an efficient graph algorithm based on multiple random walks requires investigating multiple random walks theoretically to attain a deep understanding of their characteristics. The first meeting time is one of the important metrics for multiple random walks. The first meeting time on a graph is defined by the time it takes for multiple random walkers to meet at the same node in a graph. This time is closely related to the rendezvous problem, a fundamental problem in computer science. The first meeting time of multiple random walks has been analyzed previously, but many of these analyses focused on regular graphs. In this paper, we analyze the first meeting time of multiple random walks in arbitrary graphs and clarify the effects of graph structures on expected values. First, we derive the spectral formula of the expected first meeting time on the basis of spectral graph theory. Then, we examine the principal component of the expected first meeting time using the derived spectral formula. The clarified principal component reveals that (a) the expected first meeting time is almost dominated by $n/(1+d_{ m std}^2/d_{ mavg}^2)$ and (b) the expected first meeting time is independent of the starting nodes of random walkers, where n is the number of nodes of the graph. davg and dstd are the average and the standard deviation of weighted node degrees, respectively. Characteristic (a) is useful for understanding the effect of the graph structure on the first meeting time. According to the revealed effect of graph structures, the variance of the coefficient dstd/davg (degree heterogeneity) for weighted degrees facilitates the meeting of random walkers.
Toshiki YAMADA Takahiro KAJI Chiyumi YAMADA Akira OTOMO
We previously developed a new terahertz (THz) wave detection method that utilizes the effect of nonlinear optical (NLO) polymers. The new method provided us with a gapless detection, a wide detection bandwidth, and a simpler optical geometry in the THz wave detection. In this paper, polarization dependences in THz wave detection by the Stark effect were investigated. The projection model was employed to analyze the polarization dependences and the consistency with experiments was observed qualitatively, surely supporting the use of the first-order Stark effect in this method. The relations between THz wave detection by the Stark effect and Stark spectroscopy or electroabsorption spectroscopy are also discussed.
Yahui WANG Wenxi ZHANG Xinxin KONG Yongbiao WANG Hongxin ZHANG
Laser speech detection uses a non-contact Laser Doppler Vibrometry (LDV)-based acoustic sensor to obtain speech signals by precisely measuring voice-generated surface vibrations. Over long distances, however, the detected signal is very weak and full of speckle noise. To enhance the quality and intelligibility of the detected signal, we designed a two-sided Linear Prediction Coding (LPC)-based locator and interpolator to detect and replace speckle noise. We first studied the characteristics of speckle noise in detected signals and developed a binary-state statistical model for speckle noise generation. A two-sided LPC-based locator was then designed to locate the polluted samples, composed of an inverse decorrelator, nonlinear filter and threshold estimator. This greatly improves the detectability of speckle noise and avoids false/missed detection by improving the noise-to-signal-ratio (NSR). Finally, samples from both sides of the speckle noise were used to estimate the parameters of the interpolator and to code samples for replacing the polluted samples. Real-world speckle noise removal experiments and simulation-based comparative experiments were conducted and the results show that the proposed method is better able to locate speckle noise in laser detected speech and highly effective at replacing it.
Young-Kyoon SUH Seounghyeon KIM Joo-Young LEE Hawon CHU Junyoung AN Kyong-Ha LEE
In this letter we analyze the economic worth of GPU on analytical processing of GPU-accelerated database management systems (DBMSes). To this end, we conducted rigorous experiments with TPC-H across three popular GPU DBMSes. Consequently, we show that co-processing with CPU and GPU in the GPU DBMSes was cost-effective despite exposed concerns.
This letter proposes a downlink multiple-input multiple-output (MIMO) non-orthogonal multiple access technique that mitigates multi-cell interference (MCI) at cell-edge users, regardless of the number of interfering cells, thereby improving the spectral efficiency. This technique employs specific receive beamforming vectors at the cell-edge users in clusters to minimize the MCI. Based on the receive beamforming vectors adopted by the cell-edge users, the transmit beamforming vectors for a base station (BS) and the receive beamforming vectors for cell-center users are designed to eliminate the inter-cluster interference and maximize the spectral efficiency. As each user can directly obtain its own receive beamforming vector, this technique does not require channel feedback from the users to a BS to design the receive beamforming vectors, thereby reducing the system overhead. We also derive the upper bound of the average sum rate achievable using the proposed technique. Finally, we demonstrate through simulations that the proposed technique achieves a better sum rate performance than the existing schemes and that the derived upper bound is valid.
Fumihiro YAMASHITA Daisuke GOTO Yasuyoshi KOJIMA Jun-ichi ABE Takeshi ONIZAWA
We have developed a direct spectrum division transmission (DSDT) technique that can divide a single-carrier signal into multiple sub-spectra and assign them to dispersed frequency resources of the satellite transponder to improve the spectrum efficiency of the whole system. This paper summarizes the satellite experiments on DSDT over a single and/or multiple satellite transponders, while changing various parameters such as modulation schemes, roll-off ratios, and symbol rates. In addition, by considering practical use conditions, we present an evaluation of the performance when the spectral density of each sub-spectrum differed across transponders. The satellite experiments demonstrate that applying the proposal does not degrade the bit error rate (BER) performance. Thus, the DSDT technique is a practical approach to use the scattered unused frequency resources over not only a single transponder but also multiple ones.
Spectral graph theory provides an algebraic approach to investigate the characteristics of weighted networks using the eigenvalues and eigenvectors of a matrix (e.g., normalized Laplacian matrix) that represents the structure of the network. However, it is difficult to accurately represent the structures of large-scale and complex networks (e.g., social network) as a matrix. This difficulty can be avoided if there is a universality, such that the eigenvalues are independent of the detailed structure in large-scale and complex network. In this paper, we clarify Wigner's Semicircle Law for weighted networks as such a universality. The law indicates that the eigenvalues of the normalized Laplacian matrix of weighted networks can be calculated from a few network statistics (the average degree, average link weight, and square average link weight) when the weighted networks satisfy a sufficient condition of the node degrees and the link weights.
Narihiro NAKAMOTO Toru TAKAHASHI Toru FUKASAWA Naofumi YONEDA Hiroaki MIYASHITA
This paper proposes a dual linear-polarized open-ended waveguide subarray designed for use in phased array antennas. The proposed subarray is a one-dimensional linear array that consists of open-ended waveguide antenna elements and suspended stripline feed networks to realize vertical and horizontal polarizations. The antenna includes a novel suspended stripline-to-waveguide transition that combines double- and quad-ridge waveguides to minimize the size of the transition and enhance the port isolation. Metal posts are installed on the waveguide apertures to eliminate scan-blindness. Prototype subarrays are fabricated and tested in an array of 16 subarrays. The experimental tests and numerical simulations indicate that the prototype subarray offers a low reflection coefficient of less than -11.4dB, low cross-polarization of less than -26dB, and antenna efficiency above 69% in the frequency bandwidth of 14%.
Kiyoshi KURIHARA Nobumasa SEIYAMA Tadashi KUMANO
This paper describes a method to control prosodic features using phonetic and prosodic symbols as input of attention-based sequence-to-sequence (seq2seq) acoustic modeling (AM) for neural text-to-speech (TTS). The method involves inserting a sequence of prosodic symbols between phonetic symbols that are then used to reproduce prosodic acoustic features, i.e. accents, pauses, accent breaks, and sentence endings, in several seq2seq AM methods. The proposed phonetic and prosodic labels have simple descriptions and a low production cost. By contrast, the labels of conventional statistical parametric speech synthesis methods are complicated, and the cost of time alignments such as aligning the boundaries of phonemes is high. The proposed method does not need the boundary positions of phonemes. We propose an automatic conversion method for conventional labels and show how to automatically reproduce pitch accents and phonemes. The results of objective and subjective evaluations show the effectiveness of our method.
Kohei SHIMATANI Shigemasa TAKAI
We consider the bisimilarity control problem for partially observed nondeterministic discrete event systems with deterministic specifications. This problem requires us to synthesize a supervisor that achieves bisimulation equivalence of the supervised system and the deterministic specification under partial observation. We present necessary and sufficient conditions for the existence of such a deterministic supervisor and show that these conditions can be verified polynomially.
Yuki NISHIO Osamu TAKYU Hayato SOYA Keiichiro SHIRAI Mai OHTA Takeo FUJII
Dynamic spectrum access (DSA) exploits vacant frequency resources via distributed wireless access. The two nodes of DSA, master and slave, access different channels, and thus, cannot communicate with each other. To compensate for the access channel mismatch between the two nodes, a rendezvous channel, which exchanges control signals between two nodes, has been considered. The rendezvous channel based on channel-occupancy ratio (COR) adaptively constructs the channel in accordance with the channel occupancy of other systems, and both a high-speed rendezvous channel and high usage efficiency of the frequency resource are accomplished owing to exploitation of the vacant channel. In the rendezvous channel based on COR, the master and slave recognize the channel with minimum measured COR as the superior channel. As the master sends the control signals through the superior channel recognized by the master, the slave accesses to the superior channel recognized by the slave with higher access rate than to the other channels. As a result, the slave can receive the control signals with highly probability and thus high speed rendezvous channel is achieved. If the master and the slave recognize the different channel as the superior channel, the access rate to the other channel should be larger. This is because the slave obtains the opportunity of receiving the control signals through the different channel from the superior channel recognized by slave and thus the high probability that the slave can receive the control signals is maintained. Therefore, the access rate of slave should be constructed in accordance with the recognition of superior channel by master and slave. In this paper, the access rate of slave to the superior channel is optimally constructed using the analyzed probability of completion of rendezvous channel. The analysis of the probability of completion of rendezvous channel includes the recognition of superior channel by master and slave. Even if the master and the slave recognize the different channel, the constructed access rate of slave can maintain the high speed rendezvous channel. From the theoretical analysis and computer simulation, the rendezvous channel based on COR with the optimal access rate to the channel with the lowest COR achieves reduced time for the rendezvous channel.
Isao ECHIZEN Noboru BABAGUCHI Junichi YAMAGISHI Naoko NITTA Yuta NAKASHIMA Kazuaki NAKAMURA Kazuhiro KONO Fuming FANG Seiko MYOJIN Zhenzhong KUANG Huy H. NGUYEN Ngoc-Dung T. TIEU
With the spread of high-performance sensors and social network services (SNS) and the remarkable advances in machine learning technologies, fake media such as fake videos, spoofed voices, and fake reviews that are generated using high-quality learning data and are very close to the real thing are causing serious social problems. We launched a research project, the Media Clone (MC) project, to protect receivers of replicas of real media called media clones (MCs) skillfully fabricated by means of media processing technologies. Our aim is to achieve a communication system that can defend against MC attacks and help ensure safe and reliable communication. This paper describes the results of research in two of the five themes in the MC project: 1) verification of the capability of generating various types of media clones such as audio, visual, and text derived from fake information and 2) realization of a protection shield for media clones' attacks by recognizing them.
Silver electrical contacts were separated at constant opening speed in a 200V-500VDC/10A resistive circuit. Break arcs were extinguished by magnetic blowing-out with transverse magnetic field of a permanent magnet. The permanent magnet was appropriately located to simplify the lengthened shape of the break arcs. Magnetic flux density of the transverse magnetic field was varied from 20 to 140mT. Images of the break arcs were observed from the horizontal and vertical directions using two high speed cameras simultaneously. Arc length just before extinction was analyzed from the observed images. It was shown that shapes of the break arcs were simple enough to trace the most part of paths of the break arcs for all experimental conditions owing to simplification of the shapes of the break arcs by appropriate arrangement of the magnet. The arc length increased with increasing supply voltage and decreased with increasing magnetic flux density. These results will be discussed in the view points of arc lengthening time and arc lengthening velocity.
Shusuke NARIEDA Daiki CHO Hiromichi OGASAWARA Kenta UMEBAYASHI Takeo FUJII Hiroshi NARUSE
This paper provides theoretical analyses for maximum cyclic autocorrelation selection (MCAS)-based spectrum sensing techniques in cognitive radio networks. The MCAS-based spectrum sensing techniques are low computational complexity spectrum sensing in comparison with some cyclostationary detection. However, MCAS-based spectrum sensing characteristics have never been theoretically derived. In this study, we derive closed form solutions for signal detection probability and false alarm probability for MCAS-based spectrum sensing. The theoretical values are compared with numerical examples, and the values match well with each other.
Koichiro SAWA Yoshitada WATANABE Takahiro UENO Hirotasu MASUBUCHI
The authors have been investigating the deterioration process of Au-plated slip-ring and Ag-Pd brush system with lubricant to realize stable and long lifetime. Through the past tests, it can be made clear that lubricant is very important for long lifetime, and a simple model of the deterioration process was proposed. However, it is still an issue how the lubricant is deteriorated and also what the relation between lubricant deterioration and contact voltage behavior is. In this paper, the contact voltage waveforms were regularly recorded during the test, and analyzed to obtain the time change of peak voltage and standard deviation during one rotation. Based on these results, it is discussed what happens at the interface between ring and brush with the lubricant. And the following results are made clear. The fluctuation of voltage waveforms, especially peaks of pulse-like fluctuation more easily occurs for minus rings than for plus rings. Further, peak values of the pulse-like fluctuation rapidly decreases and disappear at lower rotation speed as mentioned in the previous works. In addition, each peaks of the pulse-like fluctuation is identified at each position of the ring periphery. From these results, it can be assumed that lubricant film exists between brush and ring surface and electric conduction is realized by tunnel effect. In other words, it can be made clear that the fluctuation would be caused by the lubricant layer, not only by the ring surface. Finally, an electric conduction model is proposed and the above results can be explained by this model.