Hao WANG Yao MA Jianyong DUAN Li HE Xin LI
Chinese Spelling Correction (CSC) is an important natural language processing task. Existing methods for CSC mostly utilize BERT models, which select a character from a candidate list to correct errors in the sentence. World knowledge refers to structured information and relationships spanning a wide range of domains and subjects, while definition knowledge pertains to textual explanations or descriptions of specific words or concepts. Both forms of knowledge have the potential to enhance a model’s ability to comprehend contextual nuances. As BERT lacks sufficient guidance from world knowledge for error correction and existing models overlook the rich definition knowledge in Chinese dictionaries, the performance of spelling correction models is somewhat compromised. To address these issues, within the world knowledge network, this study injects world knowledge from knowledge graphs into the model to assist in correcting spelling errors caused by a lack of world knowledge. Additionally, the definition knowledge network in this model improves the error correction capability by utilizing the definitions from the Chinese dictionary through a comparative learning approach. Experimental results on the SIGHAN benchmark dataset validate the effectiveness of our approach.
Zhishuo ZHANG Chengxiang TAN Xueyan ZHAO Min YANG
Entity alignment (EA) is a crucial task for integrating cross-lingual and cross-domain knowledge graphs (KGs), which aims to discover entities referring to the same real-world object from different KGs. Most existing embedding-based methods generate aligning entity representation by mining the relevance of triple elements, paying little attention to triple indivisibility and entity role diversity. In this paper, a novel framework named TTEA - Type-enhanced Ensemble Triple Representation via Triple-aware Attention for Cross-lingual Entity Alignment is proposed to overcome the above shortcomings from the perspective of ensemble triple representation considering triple specificity and diversity features of entity role. Specifically, the ensemble triple representation is derived by regarding relation as information carrier between semantic and type spaces, and hence the noise influence during spatial transformation and information propagation can be smoothly controlled via specificity-aware triple attention. Moreover, the role diversity of triple elements is modeled via triple-aware entity enhancement in TTEA for EA-oriented entity representation. Extensive experiments on three real-world cross-lingual datasets demonstrate that our framework makes comparative results.
Tetsuya ARAKI Shin-ichi NAKANO
The dispersion problem is a variant of facility location problems, that has been extensively studied. Given a polygon with n edges on a plane we want to find k points in the polygon so that the minimum pairwise Euclidean distance of the k points is maximized. We call the problem the k-dispersion problem in a polygon. Intuitively, for an island, we want to locate k drone bases far away from each other in flying distance to avoid congestion in the sky. In this paper, we give a polynomial-time approximation scheme (PTAS) for this problem when k is a constant and ε < 1 (where ε is a positive real number). Our proposed algorithm runs in O(((1/ε)2 + n/ε)k) time with 1/(1 + ε) approximation, the first PTAS developed for this problem. Additionally, we consider three variations of the dispersion problem and design a PTAS for each of them.
Hongliang FU Qianqian LI Huawei TAO Chunhua ZHU Yue XIE Ruxue GUO
Speech emotion recognition (SER) is a key research technology to realize the third generation of artificial intelligence, which is widely used in human-computer interaction, emotion diagnosis, interpersonal communication and other fields. However, the aliasing of language and semantic information in speech tends to distort the alignment of emotion features, which affects the performance of cross-corpus SER system. This paper proposes a cross-corpus SER model based on causal emotion information representation (CEIR). The model uses the reconstruction loss of the deep autoencoder network and the source domain label information to realize the preliminary separation of causal features. Then, the causal correlation matrix is constructed, and the local maximum mean difference (LMMD) feature alignment technology is combined to make the causal features of different dimensions jointly distributed independent. Finally, the supervised fine-tuning of labeled data is used to achieve effective extraction of causal emotion information. The experimental results show that the average unweighted average recall (UAR) of the proposed algorithm is increased by 3.4% to 7.01% compared with the latest partial algorithms in the field.
Ji XI Yue XIE Pengxu JIANG Wei JIANG
Currently, a significant portion of acoustic scene categorization (ASC) research is centered around utilizing Convolutional Neural Network (CNN) models. This preference is primarily due to CNN’s ability to effectively extract time-frequency information from audio recordings of scenes by employing spectrum data as input. The expression of many dimensions can be achieved by utilizing 2D spectrum characteristics. Nevertheless, the diverse interpretations of the same object’s existence in different positions on the spectrum map can be attributed to the discrepancies between spectrum properties and picture qualities. The lack of distinction between different aspects of input information in ASC-based CNN networks may result in a decline in system performance. Considering this, a feature pyramid segmentation (FPS) approach based on CNN is proposed. The proposed approach involves utilizing spectrum features as the input for the model. These features are split based on a preset scale, and each segment-level feature is then fed into the CNN network for learning. The SoftMax classifier will receive the output of all feature scales, and these high-level features will be fused and fed to it to categorize different scenarios. The experiment provides evidence to support the efficacy of the FPS strategy and its potential to enhance the performance of the ASC system.
Pengxu JIANG Yang YANG Yue XIE Cairong ZOU Qingyun WANG
Convolutional neural network (CNN) is widely used in acoustic scene classification (ASC) tasks. In most cases, local convolution is utilized to gather time-frequency information between spectrum nodes. It is challenging to adequately express the non-local link between frequency domains in a finite convolution region. In this paper, we propose a dual-path convolutional neural network based on band interaction block (DCNN-bi) for ASC, with mel-spectrogram as the model’s input. We build two parallel CNN paths to learn the high-frequency and low-frequency components of the input feature. Additionally, we have created three band interaction blocks (bi-blocks) to explore the pertinent nodes between various frequency bands, which are connected between two paths. Combining the time-frequency information from two paths, the bi-blocks with three distinct designs acquire non-local information and send it back to the respective paths. The experimental results indicate that the utilization of the bi-block has the potential to improve the initial performance of the CNN substantially. Specifically, when applied to the DCASE 2018 and DCASE 2020 datasets, the CNN exhibited performance improvements of 1.79% and 3.06%, respectively.
The steady-state and convergence performances are important indicators to evaluate adaptive algorithms. The step-size affects these two important indicators directly. Many relevant scholars have also proposed some variable step-size adaptive algorithms for improving performance. However, there are still some problems in these existing variable step-size adaptive algorithms, such as the insufficient theoretical analysis, the imbalanced performance and the unachievable parameter. These problems influence the actual performance of some algorithms greatly. Therefore, we intend to further explore an inherent relationship between the key performance and the step-size in this paper. The variation of mean square deviation (MSD) is adopted as the cost function. Based on some theoretical analyses and derivations, a novel variable step-size algorithm with a dynamic limited function (DLF) was proposed. At the same time, the sufficient theoretical analysis is conducted on the weight deviation and the convergence stability. The proposed algorithm is also tested with some typical algorithms in many different environments. Both the theoretical analysis and the experimental result all have verified that the proposed algorithm equips a superior performance.
Junko SHIROGANE Daisuke SAYAMA Hajime IWATA Yoshiaki FUKAZAWA
Webpage texts are often emphasized by decorations such as bold, italic, underline, and text color using HTML (HyperText Markup Language) tags and CSS (Cascading Style Sheets). However, users with visual impairment often struggle to recognize decorations appropriately because most screen readers do not read decorations appropriately. To overcome this limitation, we propose a method to read emphasized texts by changing the reading voice parameters of a screen reader and adding sound effects. First, the strong emphasis types and reading voices are investigated. Second, the intensity of the emphasis type is used to calculate a score. Then the score is used to assign the reading method for the emphasized text. Finally, the proposed method is evaluated by users with and without visual impairment. The proposed method can convey emphasized texts, but future improvements are necessary.
Takumasa ISHIOKA Tatsuya FUKUI Toshihito FUJIWARA Satoshi NARIKAWA Takuya FUJIHASHI Shunsuke SARUWATARI Takashi WATANABE
Cloud gaming systems allow users to play games that require high-performance computational capability on their mobile devices at any location. However, playing games through cloud gaming systems increases the Round-Trip Time (RTT) due to increased network delay. To simulate a local gaming experience for cloud users, we must minimize RTTs, which include network delays. The speculative video transmission pre-generates and encodes video frames corresponding to all possible user inputs and sends them to the user before the user’s input. The speculative video transmission mitigates the network, whereas a simple solution significantly increases the video traffic. This paper proposes tile-wise delta detection for traffic reduction of speculative video transmission. More specifically, the proposed method determines a reference video frame from the generated video frames and divides the reference video frame into multiple tiles. We calculate the similarity between each tile of the reference video frame and other video frames based on a hash function. Based on calculated similarity, we determine redundant tiles and do not transmit them to reduce traffic volume in minimal processing time without implementing a high compression ratio video compression technique. Evaluations using commercial games showed that the proposed method reduced 40-50% in traffic volume when the SSIM index was around 0.98 in certain genres, compared with the speculative video transmission method. Furthermore, to evaluate the feasibility of the proposed method, we investigated the effectiveness of network delay reduction with existing computational capability and the requirements in the future. As a result, we found that the proposed scheme may mitigate network delay by one to two frames, even with existing computational capability under limited conditions.
Qingqi ZHANG Xiaoan BAO Ren WU Mitsuru NAKATA Qi-Wei GE
Automatic detection of prohibited items is vital in helping security staff be more efficient while improving the public safety index. However, prohibited item detection within X-ray security inspection images is limited by various factors, including the imbalance distribution of categories, diversity of prohibited item scales, and overlap between items. In this paper, we propose to leverage the Poisson blending algorithm with the Canny edge operator to alleviate the imbalance distribution of categories maximally in the X-ray images dataset. Based on this, we improve the cascade network to deal with the other two difficulties. To address the prohibited scale diversity problem, we propose the Re-BiFPN feature fusion method, which includes a coordinate attention atrous spatial pyramid pooling (CA-ASPP) module and a recursive connection. The CA-ASPP module can implicitly extract direction-aware and position-aware information from the feature map. The recursive connection feeds the CA-ASPP module processed multi-scale feature map to the bottom-up backbone layer for further multi-scale feature extraction. In addition, a Rep-CIoU loss function is designed to address the overlapping problem in X-ray images. Extensive experimental results demonstrate that our method can successfully identify ten types of prohibited items, such as Knives, Scissors, Pressure, etc. and achieves 83.4% of mAP, which is 3.8% superior to the original cascade network. Moreover, our method outperforms other mainstream methods by a significant margin.
Haijun LIANG Yukun LI Jianguo KONG Qicong HAN Chengyu YU
Air Traffic Control (ATC) communication suffers from issues such as high electromagnetic interference, fast speech rate, and low intelligibility, which pose challenges for downstream tasks like Automatic Speech Recognition (ASR). This article aims to research how to enhance the audio quality and intelligibility of civil aviation speech through speech enhancement methods, thereby improving the accuracy of speech recognition and providing support for the digitalization of civil aviation. We propose a speech enhancement model called DIUnet_V (DenseNet & Inception & U-Net & Volume) that combines both time-frequency and time-domain methods to effectively handle the specific characteristics of civil aviation speech, such as predominant electromagnetic interference and fast speech rate. For model evaluation, we assess the denoising and enhancement effects using three metrics: Signal-to-Noise Ratio (SNR), Mean Opinion Score (MOS), and speech recognition error rate. On a simulated ATC training recording dataset, DIUnet_Volume10 achieved an SNR value of 7.3861, showing a 4.5663 improvement compared to the original U-net model. To address the challenge of the absence of clean speech in the ATC working environment, which makes it difficult to accurately calculate SNR, we propose evaluating the denoising effects indirectly based on the recognition performance of an ATC speech recognition system. On a real ATC speech dataset, the average word error rate decreased by 1.79% absolute and the average sentence error rate decreased by 3% absolute for DIUnet_V processed speech compared to the unprocessed speech in the built speech recognition system.
Li HE Xiaowu ZHANG Jianyong DUAN Hao WANG Xin LI Liang ZHAO
Chinese spelling correction (CSC) models detect and correct a text typo based on the misspelled character and its context. Recently, Bert-based models have dominated the research of Chinese spelling correction. However, these methods only focus on the semantic information of the text during the pretraining stage, neglecting the learning of correcting spelling errors. Moreover, when multiple incorrect characters are in the text, the context introduces noisy information, making it difficult for the model to accurately detect the positions of the incorrect characters, leading to false corrections. To address these limitations, we apply the multimodal pre-trained language model ChineseBert to the task of spelling correction. We propose a self-distillation learning-based pretraining strategy, where a confusion set is used to construct text containing erroneous characters, allowing the model to jointly learns how to understand language and correct spelling errors. Additionally, we introduce a single-channel masking mechanism to mitigate the noise caused by the incorrect characters. This mechanism masks the semantic encoding channel while preserving the phonetic and glyph encoding channels, reducing the noise introduced by incorrect characters during the prediction process. Finally, experiments are conducted on widely used benchmarks. Our model achieves superior performance against state-of-the-art methods by a remarkable gain.
Tetsuo KOSAKA Kazuya SAEKI Yoshitaka AIZAWA Masaharu KATO Takashi NOSE
Emotional speech recognition is generally considered more difficult than non-emotional speech recognition. The acoustic characteristics of emotional speech differ from those of non-emotional speech. Additionally, acoustic characteristics vary significantly depending on the type and intensity of emotions. Regarding linguistic features, emotional and colloquial expressions are also observed in their utterances. To solve these problems, we aim to improve recognition performance by adapting acoustic and language models to emotional speech. We used Japanese Twitter-based Emotional Speech (JTES) as an emotional speech corpus. This corpus consisted of tweets and had an emotional label assigned to each utterance. Corpus adaptation is possible using the utterances contained in this corpus. However, regarding the language model, the amount of adaptation data is insufficient. To solve this problem, we propose an adaptation of the language model by using online tweet data downloaded from the internet. The sentences used for adaptation were extracted from the tweet data based on certain rules. We extracted the data of 25.86 M words and used them for adaptation. In the recognition experiments, the baseline word error rate was 36.11%, whereas that with the acoustic and language model adaptation was 17.77%. The results demonstrated the effectiveness of the proposed method.
Acoustic scene classification (ASC) is a fundamental domain within the realm of artificial intelligence classification tasks. ASC-based tasks commonly employ models based on convolutional neural networks (CNNs) that utilize log-Mel spectrograms as input for gathering acoustic features. In this paper, we designed a CNN-based multi-scale pooling (MSP) strategy for ASC. The log-Mel spectrograms are utilized as the input to CNN, which is partitioned into four frequency axis segments. Furthermore, we devised four CNN channels to acquire inputs from distinct frequency ranges. The high-level features extracted from outputs in various frequency bands are integrated through frequency pyramid average pooling layers at multiple levels. Subsequently, a softmax classifier is employed to classify different scenes. Our study demonstrates that the implementation of our designed model leads to a significant enhancement in the model's performance, as evidenced by the testing of two acoustic datasets.
Kenichi FUJITA Atsushi ANDO Yusuke IJIMA
This paper proposes a speech rhythm-based method for speaker embeddings to model phoneme duration using a few utterances by the target speaker. Speech rhythm is one of the essential factors among speaker characteristics, along with acoustic features such as F0, for reproducing individual utterances in speech synthesis. A novel feature of the proposed method is the rhythm-based embeddings extracted from phonemes and their durations, which are known to be related to speaking rhythm. They are extracted with a speaker identification model similar to the conventional spectral feature-based one. We conducted three experiments, speaker embeddings generation, speech synthesis with generated embeddings, and embedding space analysis, to evaluate the performance. The proposed method demonstrated a moderate speaker identification performance (15.2% EER), even with only phonemes and their duration information. The objective and subjective evaluation results demonstrated that the proposed method can synthesize speech with speech rhythm closer to the target speaker than the conventional method. We also visualized the embeddings to evaluate the relationship between the distance of the embeddings and the perceptual similarity. The visualization of the embedding space and the relation analysis between the closeness indicated that the distribution of embeddings reflects the subjective and objective similarity.
Chang SUN Xiaoyu SUN Jiamin LI Pengcheng ZHU Dongming WANG Xiaohu YOU
The application of millimeter wave (mmWave) directional transmission technology in high-speed railway (HSR) scenarios helps to achieve the goal of multiple gigabit data rates with low latency. However, due to the high mobility of trains, the traditional initial access (IA) scheme with high time consumption is difficult to guarantee the effectiveness of the beam alignment. In addition, the high path loss at the coverage edge of the millimeter wave remote radio unit (mmW-RRU) will also bring great challenges to the stability of IA performance. Fortunately, the train trajectory in HSR scenarios is periodic and regular. Moreover, the cell-free network helps to improve the system coverage performance. Based on these observations, this paper proposes an efficient IA scheme based on location and history information in cell-free networks, where the train can flexibly select a set of mmW-RRUs according to the received signal quality. We specifically analyze the collaborative IA process based on the exhaustive search and based on location and history information, derive expressions for IA success probability and delay, and perform the numerical analysis. The results show that the proposed scheme can significantly reduce the IA delay and effectively improve the stability of IA success probability.
Chaorong ZHANG Yuyang PENG Ming YUE Fawaz AL-HAZEMI
As a potential member of next generation wireless communications, the reconfigurable intelligent surface (RIS) can control the reflected elements to adjust the phase of the transmitted signal with less energy consumption. A novel RIS-assisted index modulation scheme is proposed in this paper, which is named the generalized reflected phase modulation (GRPM). In the GRPM, the transmitted bits are mapped into the reflected phase combination which is conveyed through the reflected elements on the RIS, and detected by the maximum likelihood (ML) detector. The performance analysis of the GRPM with the ML detector is presented, in which the closed form expression of pairwise error probability is derived. The simulation results show the bit error rate (BER) performance of GRPM by comparing with various RIS-assisted index modulation schemes in the conditions of various spectral efficiency and number of antennas.
Yukihiro TOZAWA Takeshi ISHIDA Jiaqing WANG Osamu FUJIWARA
Measurements of contact discharge current waveforms from an ESD generator with a test voltage of 4kV are conducted with the IEC specified arrangement of a 2m long return current cable in different three calibration environments that all comply with the IEC calibration standard to identify the occurrence source of damped oscillations (ringing), which has remained unclear since contact discharge testing was first adopted in 1989 IEC publication 801-2. Their frequency spectra are analyzed comparing with the spectrum calculated from the ideal contact discharge current waveform without ringing (IEC specified waveform) offered in IEC 61000-4-2 and the spectra derived from a simplified equivalent circuit based on the IEC standard in combination with the measured input impedances of one-ended grounding return current cable with the same arrangement in the same calibration environment as those for the current measurements. The results show that the measured contact discharge waveforms have ringing around the IEC specified waveform after the falling edge of the peak, causing their spectra from 20MHz to 200MHz, but the spectra from 40MHz to 200MHz significantly differ depending on the calibration environments even for the same cable arrangement, which do not almost affect the spectra from 20MHz to 40MHz and over 200MHz. In the calibration environment under the cable arrangement close to the reference ground, the spectral shapes of the measured contact discharge currents and their frequencies of the multiple peaks and dips roughly correspond to the spectral distributions calculated from the simplified equivalent circuit using the measured cable input impedances. These findings reveal that the root cause of ringing is mainly due to the resonances of the return current cable, and calibration environment under the cable arrangement away from the reference ground tends to mitigate the cable resonances.
Mashiho MUKAIDA Yoshiaki UEDA Noriaki SUETAKE
Recently, a lot of low-light image enhancement methods have been proposed. However, these methods have some problems such as causing fine details lost in bright regions and/or unnatural color tones. In this paper, we propose a new low-light image enhancement method to cope with these problems. In the proposed method, a pixel is represented by a convex combination of white, black, and pure color. Then, an equi-hue plane in RGB color space is represented as a triangle whose vertices correspond to white, black, and pure color. The visibility of low-light image is improved by applying a modified gamma transform to the combination coefficients on an equi-hue plane in RGB color space. The contrast of the image is enhanced by the histogram specification method using the histogram smoothed by a filter with a kernel determined based on a gamma distribution. In the experiments, the effectiveness of the proposed method is verified by the comparison with the state-of-the-art low-light image enhancement methods.
Tu NGUYEN VAN Satoshi YAGITANI Kensuke SHIMIZU Shinjiro NISHI Mitsunori OZAKI Tomohiko IMACHI
A metasurface absorber capable of monitoring two-dimensional (2-d) electric field distributions has been developed, where a matrix of lumped resistors between surface patches formed on a mushroom-type structure works as a 2-d array of short dipole sensors. In this paper absorption and reflection of a spherical wave incident on the metasurface absorber are analyzed by numerical computation by the plane-wave spectrum (PWS) technique using 2-d Fourier analysis. The electromagnetic field of the spherical wave incident on the absorber surface is expanded into a large number of plane waves, for each of which the TE and TM reflection and absorption coefficients are applied. Then by synthesizing all the plane wave fields we obtain the spatial distributions of reflected and absorbed fields. The detailed formulation of the computation is described, and the computed field distributions are compared with those obtained by simulation and actual measurement when the spherical wave from a dipole is illuminated onto a metasurface absorber. It is demonstrated that the PWS technique is effective and efficient in obtaining the accurate field distributions of the spherical wave on and around the absorber. This is useful for evaluating the performance of the metasurface absorber to absorb and measure the spherical wave field distributions around an EM source.