1-8hit |
This paper introduces a technique for automatically generating potential training data from sentences in which entity pairs are not apparently presented in a relation extraction. Most previous works on relation extraction by distant supervision ignored cases in which a relationship may be expressed via null-subjects or anaphora. However, natural language text basically has a network structure that is composed of several sentences. If they are closely related, this is not expressed explicitly in the text, which can make relation extraction difficult. This paper describes a new model that augments a paragraph with a “salient entity” that is determined without parsing. The entity can create additional tuple extraction environments as potential subjects in paragraphs. Including the salient entity as part of the sentential input may allow the proposed method to identify relationships that conventional methods cannot identify. This method also has promising potential applicability to languages for which advanced natural language processing tools are lacking.
Masatoshi SUZUKI Koji MATSUDA Satoshi SEKINE Naoaki OKAZAKI Kentaro INUI
This paper addresses the task of assigning labels of fine-grained named entity (NE) types to Wikipedia articles. Information of NE types are useful when extracting knowledge of NEs from natural language text. It is common to apply an approach based on supervised machine learning to named entity classification. However, in a setting of classifying into fine-grained types, one big challenge is how to alleviate the data sparseness problem since one may obtain far fewer instances for each fine-grained types. To address this problem, we propose two methods. First, we introduce a multi-task learning framework, in which NE type classifiers are all jointly trained with a neural network. The neural network has a hidden layer, where we expect that effective combinations of input features are learned across different NE types. Second, we propose to extend the input feature set by exploiting the hyperlink structure of Wikipedia. While most of previous studies are focusing on engineering features from the articles' contents, we observe that the information of the contexts the article is mentioned can also be a useful clue for NE type classification. Concretely, we propose to learn article vectors (i.e. entity embeddings) from Wikipedia's hyperlink structure using a Skip-gram model. Then we incorporate the learned article vectors into the input feature set for NE type classification. To conduct large-scale practical experiments, we created a new dataset containing over 22,000 manually labeled articles. With the dataset, we empirically show that both of our ideas gained their own statistically significant improvement separately in classification accuracy. Moreover, we show that our proposed methods are particularly effective in labeling infrequent NE types. We've made the learned article vectors publicly available. The labeled dataset is available if one contacts the authors.
In this paper, we propose a method to find similar sentences based on language resources for building a parallel corpus between English and Korean from Wikipedia. We use a Wiki-dictionary consisted of document titles from the Wikipedia and bilingual example sentence pairs from Web dictionary instead of traditional machine readable dictionary. In this way, we perform similarity calculation between sentences using sequential matching of the language resources, and evaluate the extracted parallel sentences. In the experiments, the proposed parallel sentences extraction method finally shows 65.4% of F1-score.
As one of the popular social media that many people turn to in recent years, collaborative encyclopedia Wikipedia provides information in a more “Neutral Point of View” way than others. Towards this core principle, plenty of efforts have been put into collaborative contribution and editing. The trajectories of how such collaboration appears by revisions are valuable for group dynamics and social media research, which suggest that we should extract the underlying derivation relationships among revisions from chronologically-sorted revision history in a precise way. In this paper, we propose a revision graph extraction method based on supergram decomposition in the document collection of near-duplicates. The plain text of revisions would be measured by its frequency distribution of supergram, which is the variable-length token sequence that keeps the same through revisions. We show that this method can effectively perform the task than existing methods.
Xinpeng ZHANG Yasuhito ASANO Masatoshi YOSHIKAWA
How do global warming and agriculture influence each other? It is possible to answer the question by searching knowledge about the relationship between global warming and agriculture. As exemplified by this question, strong demands exist for searching relationships between objects. Mining knowledge about relationships on Wikipedia has been studied. However, it is desired to search more diverse knowledge about relationships on the Web. By utilizing the objects constituting relationships mined from Wikipedia, we propose a new method to search images with surrounding text that include knowledge about relationships on the Web. Experimental results show that our method is effective and applicable in searching knowledge about relationships. We also construct a relationship search system named “Enishi” based on the proposed new method. Enishi supplies a wealth of diverse knowledge including images with surrounding text to help users to understand relationships deeply, by complementarily utilizing knowledge from Wikipedia and the Web.
Xiang WANG Yan JIA Ruhua CHEN Hua FAN Bin ZHOU
Text categorization, especially short text categorization, is a difficult and challenging task since the text data is sparse and multidimensional. In traditional text classification methods, document texts are represented with “Bag of Words (BOW)” text representation schema, which is based on word co-occurrence and has many limitations. In this paper, we mapped document texts to Wikipedia concepts and used the Wikipedia-concept-based document representation method to take the place of traditional BOW model for text classification. In order to overcome the weakness of ignoring the semantic relationships among terms in document representation model and utilize rich semantic knowledge in Wikipedia, we constructed a semantic matrix to enrich Wikipedia-concept-based document representation. Experimental evaluation on five real datasets of long and short text shows that our approach outperforms the traditional BOW method.
Junsan ZHANG Youli QU Shu GONG Shengfeng TIAN Haoliang SUN
Entity is an important information carrier in Web pages. Users would like to directly get a list of relevant entities instead of a list of documents when they submit a query to the search engine. So the research of related entity finding (REF) is a meaningful work. In this paper we investigate the most important task of REF: Entity Ranking. The wrong-type entities which don't belong to the target-entity type will pollute the ranking result. We propose a novel method to filter wrong-type entities. We focus on the acquisition of seed entities and automatically extracting the common Wikipedia categories of target-entity type. Also we demonstrate how to filter wrong-type entities using the proposed model. The experimental results show our method can filter wrong-type entities effectively and improve the results of entity ranking.
Xinpeng ZHANG Yasuhito ASANO Masatoshi YOSHIKAWA
Mining and explaining relationships between concepts are challenging tasks in the field of knowledge search. We propose a new approach for the tasks using disjoint paths formed by links in Wikipedia. Disjoint paths are easy to understand and do not contain redundant information. To achieve this approach, we propose a naive method, as well as a generalized flow based method, and a technique for mining more disjoint paths using the generalized flow based method. We also apply the approach to classification of relationships. Our experiments reveal that the generalized flow based method can mine many disjoint paths important for understanding a relationship, and the classification is effective for explaining relationships.