1-11hit |
Kensuke SUMOTO Kenta KANAKOGI Hironori WASHIZAKI Naohiko TSUDA Nobukazu YOSHIOKA Yoshiaki FUKAZAWA Hideyuki KANUKA
Security-related issues have become more significant due to the proliferation of IT. Collating security-related information in a database improves security. For example, Common Vulnerabilities and Exposures (CVE) is a security knowledge repository containing descriptions of vulnerabilities about software or source code. Although the descriptions include various entities, there is not a uniform entity structure, making security analysis difficult using individual entities. Developing a consistent entity structure will enhance the security field. Herein we propose a method to automatically label select entities from CVE descriptions by applying the Named Entity Recognition (NER) technique. We manually labeled 3287 CVE descriptions and conducted experiments using a machine learning model called BERT to compare the proposed method to labeling with regular expressions. Machine learning using the proposed method significantly improves the labeling accuracy. It has an f1 score of about 0.93, precision of about 0.91, and recall of about 0.95, demonstrating that our method has potential to automatically label select entities from CVE descriptions.
Tianbin WANG Ruiyang HUANG Nan HU Huansha WANG Guanghan CHU
Chinese Named Entity Recognition is the fundamental technology in the field of the Chinese Natural Language Process. It is extensively adopted into information extraction, intelligent question answering, and knowledge graph. Nevertheless, due to the diversity and complexity of Chinese, most Chinese NER methods fail to sufficiently capture the character granularity semantics, which affects the performance of the Chinese NER. In this work, we propose DSKE-Chinese NER: Chinese Named Entity Recognition based on Dictionary Semantic Knowledge Enhancement. We novelly integrate the semantic information of character granularity into the vector space of characters and acquire the vector representation containing semantic information by the attention mechanism. In addition, we verify the appropriate number of semantic layers through the comparative experiment. Experiments on public Chinese datasets such as Weibo, Resume and MSRA show that the model outperforms character-based LSTM baselines.
Peng WANG Xiaohang CHEN Ziyu SHANG Wenjun KE
Multimodal named entity recognition (MNER) is the task of recognizing named entities in multimodal context. Existing methods focus on utilizing co-attention mechanism to discover the relationships between multiple modalities. However, they still have two deficiencies: First, current methods fail to fuse the multimodal representations in a fine-grained way, which may bring noise of visual modalities. Second, current methods ignore bridging the semantic gap between heterogeneous modalities. To solve the above issues, we propose a novel MNER method with bottleneck fusion and contrastive learning (BFCL). Specifically, we first incorporate the transformer-based bottleneck fusion mechanism, subsequently, information between different modalities can only be exchanged through several bottleneck tokens, thus reducing the noise propagation. Then we propose two decoupled image-text contrastive losses to align the unimodal representations, making the representations of semantically similar modalities closer, while the representations of semantically different modalities farther away. Experimental results demonstrate that our method is competitive to the state-of-the-art models, and achieves 74.54% and 85.70% F1-scores on Twitter-2015 and Twitter-2017 datasets, respectively.
Named Entity Recognition (NER) systems are often realized by supervised methods such as CRF and neural network methods, which require large annotated data. In some domains that small annotated training data is available, multi-domain or multi-task learning methods are often used. In this paper, we explore the methods that use news domain and Chinese Word Segmentation (CWS) task to improve the performance of Chinese named entity recognition in weibo domain. We first propose two baseline models combining multi-domain and multi-task information. The two baseline models share information between different domains and tasks through sharing parameters simply. Then, we propose a Double ADVersarial model (DoubADV model). The model uses two adversarial networks considering the shared and private features in different domains and tasks. Experimental results show that our DoubADV model outperforms other baseline models and achieves state-of-the-art performance compared with previous works in multi-domain and multi-task situation.
Van-Hai VU Quang-Phuoc NGUYEN Kiem-Hieu NGUYEN Joon-Choul SHIN Cheol-Young OCK
Since deep learning was introduced, a series of achievements has been published in the field of automatic machine translation (MT). However, Korean-Vietnamese MT systems face many challenges because of a lack of data, multiple meanings of individual words, and grammatical diversity that depends on context. Therefore, the quality of Korean-Vietnamese MT systems is still sub-optimal. This paper discusses a method for applying Named Entity Recognition (NER) and Part-of-Speech (POS) tagging to Vietnamese sentences to improve the performance of Korean-Vietnamese MT systems. In terms of implementation, we used a tool to tag NER and POS in Vietnamese sentences. In addition, we had access to a Korean-Vietnamese parallel corpus with more than 450K paired sentences from our previous research paper. The experimental results indicate that tagging NER and POS in Vietnamese sentences can improve the quality of Korean-Vietnamese Neural MT (NMT) in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score. On average, our MT system improved by 1.21 BLEU points or 2.33 TER scores after applying both NER and POS tagging to the Vietnamese corpus. Due to the structural features of language, the MT systems in the Korean to Vietnamese direction always give better BLEU and TER results than translation machines in the reverse direction.
Ping ZENG Qingping TAN Haoyu ZHANG Xiankai MENG Zhuo ZHANG Jianjun XU Yan LEI
The deep neural named entity recognition model automatically learns and extracts the features of entities and solves the problem of the traditional model relying heavily on complex feature engineering and obscure professional knowledge. This issue has become a hot topic in recent years. Existing deep neural models only involve simple character learning and extraction methods, which limit their capability. To further explore the performance of deep neural models, we propose two character feature learning models based on convolution neural network and long short-term memory network. These two models consider the local semantic and position features of word characters. Experiments conducted on the CoNLL-2003 dataset show that the proposed models outperform traditional ones and demonstrate excellent performance.
Pieces of personal information, such as personal names and relationships, are crucial in text mining applications. Obituaries are good sources for this kind of information. This study proposes an effective method for extracting various facts about people from obituary Web pages. Experiments show that the proposed method achieves high performance in terms of recall and precision.
Yanling LI Qingwei ZHAO Yonghong YAN
Semantic concept in an utterance is obtained by a fuzzy matching methods to solve problems such as words' variation induced by automatic speech recognition (ASR), or missing field of key information by users in the process of spoken language understanding (SLU). A two-stage method is proposed: first, we adopt conditional random field (CRF) for building probabilistic models to segment and label entity names from an input sentence. Second, fuzzy matching based on similarity function is conducted between the named entities labeled by a CRF model and the reference characters of a dictionary. The experiments compare the performances in terms of accuracy and processing speed. Dice similarity and cosine similarity based on TF score can achieve better accuracy performance among four similarity measures, which equal to and greater than 93% in F1-measure. Especially the latter one improved by 8.8% and 9% respectively compared to q-gram and improved edit-distance, which are two conventional methods for string fuzzy matching.
Spoken language understanding (SLU) aims to map a user's speech into a semantic frame. Since most of the previous works use the semantic structures for SLU, we verify that the structure is useful even for noisy input. We apply a structured prediction method to SLU problem and compare it to an unstructured one. In addition, we present a combined method to embed long-distance dependency between entities in a cascaded manner. On air travel data, we show that our approach improves performance over baseline models.
In this paper, we describe a two-phase method for biomedical named entity recognition consisting of term boundary detection and biomedical category labeling. The term boundary detection can be defined as a task to assign label sequences to a given sentence, and biomedical category labeling can be viewed as a local classification problem which does not need knowledge of the labels of other named entities in a sentence. The advantage of dividing the recognition process into two phases is that we can measure the effectiveness of models at each phase and select separately the appropriate model for each subtask. In order to obtain a better performance in biomedical named entity recognition, we conducted comparative experiments using several learning methods at each phase. Moreover, results by these machine learning based models are refined by rule-based postprocessing. We tested our methods on the JNLPBA 2004 shared task and the GENIA corpus.
Seungwoo LEE Joohui AN Byung-Kwan KWAK Gary Geunbae LEE
An important issue in applying machine learning algorithms to Natural Language Processing areas such as Named Entity Recognition tasks is to overcome the lack of tagged corpora. Several bootstrapping methods such as co-training have been proposed as a solution. In this paper, we present a different approach using the Web resources. A Named Entity (NE) tagged corpus is generated from the Web using about 3,000 names as seeds. The generated corpus may have a lower quality than the manually tagged corpus but its size can be increased sufficiently. Several features are developed and the decision list is learned using the generated corpus. Our method is verified by comparing it to both the decision list learned on the manual corpus and the DL-CoTrain method. We also present a two-level classification by cascading highly precise lexical patterns and the decision list to improve the performance.