1-12hit |
Jianyong DUAN Yuwei WU Mingli WU Hao WANG
The similarity of words extracted from the rich text relation network is the main way to calculate the semantic similarity. Complex relational information and text content in Wikipedia website, Community Question Answering and social network, provide abundant corpus for semantic similarity calculation. However, most typical research only focused on single relationship. In this paper, we propose a semantic similarity calculation model which integrates multiple relational information, and map multiple relationship to the same semantic space through learning representing matrix and semantic matrix to improve the accuracy of semantic similarity calculation. In experiments, we confirm that the semantic calculation method which integrates many kinds of relationships can improve the accuracy of semantic calculation, compared with other semantic calculation methods.
Hao WANG Sirui LIU Jianyong DUAN Li HE Xin LI
Sememes are the smallest semantic units of human languages, the composition of which can represent the meaning of words. Sememes have been successfully applied to many downstream applications in natural language processing (NLP) field. Annotation of a word's sememes depends on language experts, which is both time-consuming and labor-consuming, limiting the large-scale application of sememe. Researchers have proposed some sememe prediction methods to automatically predict sememes for words. However, existing sememe prediction methods focus on information of the word itself, ignoring the expert-annotated knowledge bases which indicate the relations between words and should value in sememe predication. Therefore, we aim at incorporating the expert-annotated knowledge bases into sememe prediction process. To achieve that, we propose a CilinE-guided sememe prediction model which employs an existing word knowledge base CilinE to remodel the sememe prediction from relational perspective. Experiments on HowNet, a widely used Chinese sememe knowledge base, have shown that CilinE has an obvious positive effect on sememe prediction. Furthermore, our proposed method can be integrated into existing methods and significantly improves the prediction performance. We will release the data and code to the public.
Jianyong DUAN Tianxiao JI Hao WANG
Automatic error correction of users' search terms for search engines is an important aspect of improving search engine retrieval efficiency, accuracy and user experience. In the era of big data, we can analyze and mine massive search engine logs to release the hidden mind with big data ideas. It can obtain better results through statistical modeling of query errors in search engine log data. But when we cannot find the error query in the log, we can't make good use of the information in the log to correct the query result. These undiscovered error queries are called Bad Case. This paper combines the error correction algorithm model and search engine query log mining analysis. First, we explored Bad Cases in the query error correction process through the search engine query logs. Then we quantified the characteristics of these Bad Cases and built a model to allow search engines to automatically mine Bad Cases with these features. Finally, we applied Bad Cases to the N-gram error correction algorithm model to check the impact of Bad Case mining on error correction. The experimental results show that the error correction based on Bad Case mining makes the precision rate and recall rate of the automatic error correction improved obviously. Users experience is improved and the interaction becomes more friendly.
Yingxun FU Shilin WEN Li MA Jianyong DUAN
With the rapid growth on data scale and complexity, single disk failure recovery becomes very important for erasure coded storage systems. In this paper, we propose a new strip-switched deployment method, which utilizes the feature that strips of each stripe of erasure codes could be switched, and uses simulated annealing algorithm to search for the proper strip-deployment on the stack level to balance the read accesses, in order to improve the recovery performance. The analysis and experiments results show that SSDM could effectively improve the single failure recovery performance.
Jianyong DUAN Liangcai LI Mei ZHANG Hao WANG
Personalized news recommendation is becoming increasingly important for online news platforms to help users alleviate information overload and improve news reading experience. A key problem in news recommendation is learning accurate user representations to capture their interest. However, most existing news recommendation methods usually learn user representation only from their interacted historical news, while ignoring the clustering features among users. Here we proposed a hierarchical user preference hash network to enhance the representation of users' interest. In the hash part, a series of buckets are generated based on users' historical interactions. Users with similar preferences are assigned into the same buckets automatically. We also learn representations of users from their browsed news in history part. And then, a Route Attention is adopted to combine these two parts (history vector and hash vector) and get the more informative user preference vector. As for news representation, a modified transformer with category embedding is exploited to build news semantic representation. By comparing the hierarchical hash network with multiple news recommendation methods and conducting various experiments on the Microsoft News Dataset (MIND) validate the effectiveness of our approach on news recommendation.
Hao WANG GaoJun LIU Jianyong DUAN Lei ZHANG
Existing studies on transportation mode detection from global positioning system (GPS) trajectories mainly adopt handcrafted features. These features require researchers with a professional background and do not always work well because of the complexity of traffic behavior. To address these issues, we propose a model using a sparse autoencoder to extract point-level deep features from point-level handcrafted features. A convolution neural network then aggregates the point-level deep features and generates a trajectory-level deep feature. A deep neural network incorporates the trajectory-level handcrafted features and the trajectory-level deep feature for detecting the users' transportation modes. Experiments conducted on Microsoft's GeoLife data show that our model can automatically extract the effective features and improve the accuracy of transportation mode detection. Compared with the model using only handcrafted features and shallow classifiers, the proposed model increases the maximum accuracy by 6%.
Yingxun FU Junyi GUO Li MA Jianyong DUAN
As the demand of data reliability becomes more and more larger, most of today's storage systems adopt erasure codes to assure the data could be reconstructed when suffering from physical device failures. In order to fast recover the lost data from a single failure, recovery optimization methods have attracted a lot of attention in recent years. However, most of the existing optimization methods focus on homogeneous devices, ignoring the fact that the storage devices are usually heterogeneous. In this paper, we propose a new recovery optimization method named HSR (Heterogeneous Storage Recovery) method, which uses both loads and speed rate among physical devices as the optimization target, in order to further improve the recovery performance for heterogeneous devices. The experiment results show that, compared to existing popular recovery optimization methods, HSR method gains much higher recovery speed over heterogeneous storage devices.
Li HE Jingxuan ZHAO Jianyong DUAN Hao WANG Xin LI
In Natural Language Understanding, intent detection and slot filling have been widely used to understand user queries. However, current methods tend to rely on single words and sentences to understand complex semantic concepts, and can only consider local information within the sentence. Therefore, they usually cannot capture long-distance dependencies well and are prone to problems where complex intentions in sentences are difficult to recognize. In order to solve the problem of long-distance dependency of the model, this paper uses ConceptNet as an external knowledge source and introduces its extensive semantic information into the multi-intent detection and slot filling model. Specifically, for a certain sentence, based on confidence scores and semantic relationships, the most relevant conceptual knowledge is selected to equip the sentence, and a concept context map with rich information is constructed. Then, the multi-head graph attention mechanism is used to strengthen context correlation and improve the semantic understanding ability of the model. The experimental results indicate that the model has significantly improved performance compared to other models on the MixATIS and MixSNIPS multi-intent datasets.
Li HE Xiaowu ZHANG Jianyong DUAN Hao WANG Xin LI Liang ZHAO
Chinese spelling correction (CSC) models detect and correct a text typo based on the misspelled character and its context. Recently, Bert-based models have dominated the research of Chinese spelling correction. However, these methods only focus on the semantic information of the text during the pretraining stage, neglecting the learning of correcting spelling errors. Moreover, when multiple incorrect characters are in the text, the context introduces noisy information, making it difficult for the model to accurately detect the positions of the incorrect characters, leading to false corrections. To address these limitations, we apply the multimodal pre-trained language model ChineseBert to the task of spelling correction. We propose a self-distillation learning-based pretraining strategy, where a confusion set is used to construct text containing erroneous characters, allowing the model to jointly learns how to understand language and correct spelling errors. Additionally, we introduce a single-channel masking mechanism to mitigate the noise caused by the incorrect characters. This mechanism masks the semantic encoding channel while preserving the phonetic and glyph encoding channels, reducing the noise introduced by incorrect characters during the prediction process. Finally, experiments are conducted on widely used benchmarks. Our model achieves superior performance against state-of-the-art methods by a remarkable gain.
Hao WANG Yao MA Jianyong DUAN Li HE Xin LI
Chinese Spelling Correction (CSC) is an important natural language processing task. Existing methods for CSC mostly utilize BERT models, which select a character from a candidate list to correct errors in the sentence. World knowledge refers to structured information and relationships spanning a wide range of domains and subjects, while definition knowledge pertains to textual explanations or descriptions of specific words or concepts. Both forms of knowledge have the potential to enhance a model’s ability to comprehend contextual nuances. As BERT lacks sufficient guidance from world knowledge for error correction and existing models overlook the rich definition knowledge in Chinese dictionaries, the performance of spelling correction models is somewhat compromised. To address these issues, within the world knowledge network, this study injects world knowledge from knowledge graphs into the model to assist in correcting spelling errors caused by a lack of world knowledge. Additionally, the definition knowledge network in this model improves the error correction capability by utilizing the definitions from the Chinese dictionary through a comparative learning approach. Experimental results on the SIGHAN benchmark dataset validate the effectiveness of our approach.
Jiakai LI Jianyong DUAN Hao WANG Li HE Qing ZHANG
Chinese spelling correction is a foundational task in natural language processing that aims to detect and correct spelling errors in text. Most spelling corrections in Chinese used multimodal information to model the relationship between incorrect and correct characters. However, feature information mismatch occured during fusion result from the different sources of features, causing the importance relationships between different modalities to be ignored, which in turn restricted the model from learning in an efficient manner. To this end, this paper proposes a multimodal language model-based Chinese spelling corrector, named as MISpeller. The method, based on ChineseBERT as the basic model, allows the comprehensive capture and fusion of character semantic information, phonetic information and graphic information in a single model without the need to construct additional neural networks, and realises the phenomenon of unequal fusion of multi-feature information. In addition, in order to solve the overcorrection issues, the replication mechanism is further introduced, and the replication factor is used as the dynamic weight to efficiently fuse the multimodal information. The model is able to control the proportion of original characters and predicted characters according to different input texts, and it can learn more specifically where errors occur. Experiments conducted on the SIGHAN benchmark show that the proposed model achieves the state-of-the-art performance of the F1 score at the correction level by an average of 4.36%, which validates the effectiveness of the model.
Jianyong DUAN Zheng TAN Mei ZHANG Hao WANG
With the widespread popularity of a large number of social platforms, an increasing number of new words gradually appear. However, such new words have made some NLP tasks like word segmentation more challenging. Therefore, new word detection is always an important and tough task in NLP. This paper aims to extract new words using the BiLSTM+CRF model which added some features selected by us. These features include word length, part of speech (POS), contextual entropy and degree of word coagulation. Comparing to the traditional new word detection methods, our method can use both the features extracted by the model and the features we select to find new words. Experimental results demonstrate that our model can perform better compared to the benchmark models.