Compared to subword based Neural Machine Translation (NMT), character based NMT eschews linguistic-motivated segmentation which performs directly on the raw character sequence, following a more absolute end-to-end manner. This property is more fascinating for machine translation (MT) between Japanese and Chinese, both of which use consecutive logographic characters without explicit word boundaries. However, there is still one disadvantage which should be addressed, that is, character is a less meaning-bearing unit than the subword, which requires the character models to be capable of sense discrimination. Specifically, there are two types of sense ambiguities existing in the source and target language, separately. With the former, it has been partially solved by the deep encoder and several existing works. But with the later, interestingly, the ambiguity in the target side is rarely discussed. To address this problem, we propose two simple yet effective methods, including a non-parametric pre-clustering for sense induction and a joint model to perform sense discrimination and NMT training simultaneously. Extensive experiments on Japanese⟷Chinese MT show that our proposed methods consistently outperform the strong baselines, and verify the effectiveness of using sense-discriminated representation for character based NMT.
Li HE Xiaowu ZHANG Jianyong DUAN Hao WANG Xin LI Liang ZHAO
Chinese spelling correction (CSC) models detect and correct a text typo based on the misspelled character and its context. Recently, Bert-based models have dominated the research of Chinese spelling correction. However, these methods only focus on the semantic information of the text during the pretraining stage, neglecting the learning of correcting spelling errors. Moreover, when multiple incorrect characters are in the text, the context introduces noisy information, making it difficult for the model to accurately detect the positions of the incorrect characters, leading to false corrections. To address these limitations, we apply the multimodal pre-trained language model ChineseBert to the task of spelling correction. We propose a self-distillation learning-based pretraining strategy, where a confusion set is used to construct text containing erroneous characters, allowing the model to jointly learns how to understand language and correct spelling errors. Additionally, we introduce a single-channel masking mechanism to mitigate the noise caused by the incorrect characters. This mechanism masks the semantic encoding channel while preserving the phonetic and glyph encoding channels, reducing the noise introduced by incorrect characters during the prediction process. Finally, experiments are conducted on widely used benchmarks. Our model achieves superior performance against state-of-the-art methods by a remarkable gain.
Kyohei MURAKATA Koichi KOBAYASHI Yuh YAMASHITA
The multi-agent surveillance problem is to find optimal trajectories of multiple agents that patrol a given area as evenly as possible. In this paper, we consider the multi-agent surveillance problem based on travel cost minimization. The surveillance area is given by an undirected graph. The penalty for each agent is introduced to evaluate the surveillance performance. Through a mixed logical dynamical system model, the multi-agent surveillance problem is reduced to a mixed integer linear programming (MILP) problem. In model predictive control, trajectories of agents are generated by solving the MILP problem at each discrete time. Furthermore, a condition that the MILP problem is always feasible is derived based on the Chinese postman problem. Finally, the proposed method is demonstrated by a numerical example.
Hojun SHIMOYAMA Soh YOSHIDA Takao FUJITA Mitsuji MUNEYASU
Recent character detectors have been modeled using deep neural networks and have achieved high performance in various tasks, such as text detection in natural scenes and character detection in historical documents. However, existing methods cannot achieve high detection accuracy for wooden slips because of their multi-scale character sizes and aspect ratios, high character density, and close character-to-character distance. In this study, we propose a new U-Net-based character detection and localization framework that learns character regions and boundaries between characters. The proposed method enhances the learning performance of character regions by simultaneously learning the vertical and horizontal boundaries between characters. Furthermore, by adding simple and low-cost post-processing using the learned regions of character boundaries, it is possible to more accurately detect the location of a group of characters in a close neighborhood. In this study, we construct a wooden slip dataset. Experiments demonstrated that the proposed method outperformed existing character detection methods, including state-of-the-art character detection methods for historical documents.
Tianbin WANG Ruiyang HUANG Nan HU Huansha WANG Guanghan CHU
Chinese Named Entity Recognition is the fundamental technology in the field of the Chinese Natural Language Process. It is extensively adopted into information extraction, intelligent question answering, and knowledge graph. Nevertheless, due to the diversity and complexity of Chinese, most Chinese NER methods fail to sufficiently capture the character granularity semantics, which affects the performance of the Chinese NER. In this work, we propose DSKE-Chinese NER: Chinese Named Entity Recognition based on Dictionary Semantic Knowledge Enhancement. We novelly integrate the semantic information of character granularity into the vector space of characters and acquire the vector representation containing semantic information by the attention mechanism. In addition, we verify the appropriate number of semantic layers through the comparative experiment. Experiments on public Chinese datasets such as Weibo, Resume and MSRA show that the model outperforms character-based LSTM baselines.
Hao WANG Sirui LIU Jianyong DUAN Li HE Xin LI
Sememes are the smallest semantic units of human languages, the composition of which can represent the meaning of words. Sememes have been successfully applied to many downstream applications in natural language processing (NLP) field. Annotation of a word's sememes depends on language experts, which is both time-consuming and labor-consuming, limiting the large-scale application of sememe. Researchers have proposed some sememe prediction methods to automatically predict sememes for words. However, existing sememe prediction methods focus on information of the word itself, ignoring the expert-annotated knowledge bases which indicate the relations between words and should value in sememe predication. Therefore, we aim at incorporating the expert-annotated knowledge bases into sememe prediction process. To achieve that, we propose a CilinE-guided sememe prediction model which employs an existing word knowledge base CilinE to remodel the sememe prediction from relational perspective. Experiments on HowNet, a widely used Chinese sememe knowledge base, have shown that CilinE has an obvious positive effect on sememe prediction. Furthermore, our proposed method can be integrated into existing methods and significantly improves the prediction performance. We will release the data and code to the public.
Qing-dao-er-ji REN Yuan LI Shi BAO Yong-chao LIU Xiu-hong CHEN
As the mainstream approach in the field of machine translation, neural machine translation (NMT) has achieved great improvements on many rich-source languages, but performance of NMT for low-resource languages ae not very good yet. This paper uses data enhancement technology to construct Mongolian-Chinese pseudo parallel corpus, so as to improve the translation ability of Mongolian-Chinese translation model. Experiments show that the above methods can improve the translation ability of the translation model. Finally, a translation model trained with large-scale pseudo parallel corpus and integrated with soft context data enhancement technology is obtained, and its BLEU value is 39.3.
Jing ZHU Song HUANG Yaqing SHI Kaishun WU Yanqiu WANG
Nowadays there is no way to automatically obtain the function points when using function point analyze (FPA) method, especially for the requirement documents written in Chinese language. Considering the characteristics of Chinese grammar in words segmentation, it is necessary to divide words accurately Chinese words, so that the subsequent entity recognition and disambiguation can be carried out in a smaller range, which lays a solid foundation for the efficient automatic extraction of the function points. Therefore, this paper proposed a method of K-Means clustering based on TF-IDF, and conducts experiments with 24 software requirement documents written in Chinese language. The results show that the best clustering effect is achieved when the extracted information is retained by 55% to 75% and the number of clusters takes the middle value of the total number of clusters. Not only for Chinese, this method and conclusion of this paper, but provides an important reference for automatic extraction of function points from software requirements documents written in other Oriental languages, and also fills the gaps of data preprocessing in the early stage of automatic calculation function points.
A construction method of self-orthogonal and self-dual quasi-cyclic codes is shown which relies on factorization of modulus polynomials for cyclicity in this study. The smaller-size generator polynomial matrices are used instead of the generator matrices as linear codes. An algorithm based on Chinese remainder theorem finds the generator polynomial matrix on the original modulus from the ones constructed on each factor. This method enables us to efficiently construct and search these codes when factoring modulus polynomials into reciprocal polynomials.
Xiumin SHEN Xiaofei SONG Yanguo JIA Yubo LI
Binary sequence pairs with optimal periodic correlation have important applications in many fields of communication systems. In this letter, four new families of binary sequence pairs are presented based on the generalized cyclotomy over Z5q, where q ≠ 5 is an odd prime. All these binary sequence pairs have optimal three-level correlation values {-1, 3}.
Shinichi KAWAMURA Yuichi KOMANO Hideo SHIMIZU Saki OSUKA Daisuke FUJIMOTO Yuichi HAYASHI Kentaro IMAFUKU
The residue number system (RNS) is a method for representing an integer x as an n-tuple of its residues with respect to a given set of moduli. In RNS, addition, subtraction, and multiplication can be carried out by independent operations with respect to each modulus. Therefore, an n-fold speedup can be achieved by parallel processing. The main disadvantage of RNS is that we cannot efficiently compare the magnitude of two integers or determine the sign of an integer. Two general methods of comparison are to transform a number in RNS to a mixed-radix system or to a radix representation using the Chinese remainder theorem (CRT). We used the CRT to derive an equation approximating a value of x relative to M, the product of moduli. Then, we propose two algorithms that efficiently evaluate the equation and output a sign bit. The expected number of steps of these algorithms is of order n. The algorithms use a lookup table that is (n+3) times as large as M, which is reasonably small for most applications including cryptography.
Faster R-CNN uses a region proposal network which consists of a single scale convolution filter and fully connected networks to localize detected regions. However, using a single scale filter is not enough to detect full regions of characters. In this letter, we propose a simple but effective way, i.e., utilizing variously sized convolution filters, to accurately detect Chinese characters of multiple scales in documents. We experimentally verified that our method improved IoU by 4% and detection rate by 3% than the previous single scale Faster R-CNN method.
Named Entity Recognition (NER) systems are often realized by supervised methods such as CRF and neural network methods, which require large annotated data. In some domains that small annotated training data is available, multi-domain or multi-task learning methods are often used. In this paper, we explore the methods that use news domain and Chinese Word Segmentation (CWS) task to improve the performance of Chinese named entity recognition in weibo domain. We first propose two baseline models combining multi-domain and multi-task information. The two baseline models share information between different domains and tasks through sharing parameters simply. Then, we propose a Double ADVersarial model (DoubADV model). The model uses two adversarial networks considering the shared and private features in different domains and tasks. Experimental results show that our DoubADV model outperforms other baseline models and achieves state-of-the-art performance compared with previous works in multi-domain and multi-task situation.
Daiki SEKIZAWA Shinnosuke TAKAMICHI Hiroshi SARUWATARI
This article proposes a prosody correction method based on partial model adaptation for Chinese-accented Japanese hidden Markov model (HMM)-based text-to-speech synthesis. Although text-to-speech synthesis built from non-native speech accurately reproduces the speaker's individuality in synthetic speech, the naturalness of the synthetic speech is strongly degraded. In the proposed model, to improve the naturalness while preserving the speaker individuality of Chinese-accented Japanese text-to-speech synthesis, we partially utilize HMM parameters of native Japanese speech to synthesize prosody-corrected synthetic speech. Results of an experimental evaluation demonstrate that duration and F0 correction are significantly effective for improving naturalness.
Huu-Anh TRAN Heyan HUANG Phuoc TRAN Shumin SHI Huu NGUYEN
Word order is one of the most significant differences between the Chinese and Vietnamese. In the phrase-based statistical machine translation, the reordering model will learn reordering rules from bilingual corpora. If the bilingual corpora are large and good enough, the reordering rules are exact and coverable. However, Chinese-Vietnamese is a low-resource language pair, the extraction of reordering rules is limited. This leads to the quality of reordering in Chinese-Vietnamese machine translation is not high. In this paper, we have combined Chinese dependency relation and Chinese-Vietnamese word alignment results in order to pre-order Chinese word order to be suitable to Vietnamese one. The experimental results show that our methodology has improved the machine translation performance compared to the translation system using only the reordering models of phrase-based statistical machine translation.
Xiaofei SONG Yanguo JIA Xiumin SHEN Yubo LI Xiuping PENG
In this letter, two new families of quaternary sequences with low four-level or five-level autocorrelation are constructed based on generalized cyclotomy over Z2p. These quaternary sequences are balanced and the maximal absolute value of the out-of-phase autocorrelation is 4.
Integer codes are defined by error-correcting codes over integers modulo a fixed positive integer. In this paper, we show that the construction of integer codes can be reduced into the cases of prime-power moduli. We can efficiently search integer codes with small prime-power moduli and can construct target integer codes with a large composite-number modulus. Moreover, we also show that this prime-factorization reduction is useful for the construction of self-orthogonal and self-dual integer codes, i.e., these properties in the prime-power moduli are preserved in the composite-number modulus. Numerical examples of integer codes and generator matrices demonstrate these facts and processes.
Zhibao LIN Zhengqian LI Pinhui KE
Zero-difference balanced (ZDB) functions, which have many applications in coding theory and sequence design, have received a lot of attention in recent years. In this letter, based on two known classes of ZDB functions, a new class of ZDB functions, which is defined on the group (Z2e-1×Zn,+) is presented, where e is a prime and n=p1m1p2m2…pkmk, pi is odd prime satisfying that e|(pi-1) for any 1≤i≤k . In the case of gcd(2e-1,n)=1, the new constructed ZDB functions are cyclic.
Xiumin SHEN Yanguo JIA Xiaofei SONG Yubo LI
In this paper, a new generalized cyclotomy over Zpq is presented based on cyclotomy and Chinese remainder theorem, where p and q are different odd primes. Several new construction methods for binary sequence pairs of period pq with ideal two-level correlation are given by utilizing these generalized cyclotomic classes. All the binary sequence pairs from our constructions have both ideal out-of-phase correlation values -1 and optimum balance property.
Jing ZHANG Degen HUANG Kaiyu HUANG Zhuang LIU Fuji REN
Microblog data contains rich information of real-world events with great commercial values, so microblog-oriented natural language processing (NLP) tasks have grabbed considerable attention of researchers. However, the performance of microblog-oriented Chinese Word Segmentation (CWS) based on deep neural networks (DNNs) is still not satisfying. One critical reason is that the existing microblog-oriented training corpus is inadequate to train effective weight matrices for DNNs. In this paper, we propose a novel active learning method to extend the scale of the training corpus for DNNs. However, due to a large amount of partially overlapped sentences in the microblogs, it is difficult to select samples with high annotation values from raw microblogs during the active learning procedure. To select samples with higher annotation values, parameter λ is introduced to control the number of repeatedly selected samples. Meanwhile, various strategies are adopted to measure the overall annotation values of a sample during the active learning procedure. Experiments on the benchmark datasets of NLPCC 2015 show that our λ-active learning method outperforms the baseline system and the state-of-the-art method. Besides, the results also demonstrate that the performances of the DNNs trained on the extended corpus are significantly improved.