Xiaoguang YUAN Chaofan DAI Zongkai TIAN Xinyu FAN Yingyi SONG Zengwen YU Peng WANG Wenjun KE
Question answering (QA) systems are designed to answer questions based on given information or with the help of external information. Recent advances in QA systems are overwhelmingly contributed by deep learning techniques, which have been employed in a wide range of fields such as finance, sports and biomedicine. For generative QA in open-domain QA, although deep learning can leverage massive data to learn meaningful feature representations and generate free text as answers, there are still problems to limit the length and content of answers. To alleviate this problem, we focus on the variant YNQA of generative QA and propose a model CasATT (cascade prompt learning framework with the sentence-level attention mechanism). In the CasATT, we excavate text semantic information from document level to sentence level and mine evidence accurately from large-scale documents by retrieval and ranking, and answer questions with ranked candidates by discriminative question answering. Our experiments on several datasets demonstrate the superior performance of the CasATT over state-of-the-art baselines, whose accuracy score can achieve 93.1% on IR&QA Competition dataset and 90.5% on BoolQ dataset.
This paper addresses the novel task of detecting chorus sections in English and Japanese lyrics text. Although chorus-section detection using audio signals has been studied, whether chorus sections can be detected from text-only lyrics is an open issue. Another open issue is whether patterns of repeating lyric lines such as those appearing in chorus sections depend on language. To investigate these issues, we propose a neural-network-based model for sequence labeling. It can learn phrase repetition and linguistic features to detect chorus sections in lyrics text. It is, however, difficult to train this model since there was no dataset of lyrics with chorus-section annotations as there was no prior work on this task. We therefore generate a large amount of training data with such annotations by leveraging pairs of musical audio signals and their corresponding manually time-aligned lyrics; we first automatically detect chorus sections from the audio signals and then use their temporal positions to transfer them to the line-level chorus-section annotations for the lyrics. Experimental results show that the proposed model with the generated data contributes to detecting the chorus sections, that the model trained on Japanese lyrics can detect chorus sections surprisingly well in English lyrics, and that patterns of repeating lyric lines are language-independent.
Reo ERIGUCHI Noboru KUNIHIRO Koji NUIDA
Ramp secret sharing is a variant of secret sharing which can achieve better information ratio than perfect schemes by allowing some partial information on a secret to leak out. Strongly secure ramp schemes can control the amount of leaked information on the components of a secret. In this paper, we reduce the construction of strongly secure ramp secret sharing for general access structures to a linear algebraic problem. As a result, we show that previous results on strongly secure network coding imply two linear transformation methods to make a given linear ramp scheme strongly secure. They are explicit or provide a deterministic algorithm while the previous methods which work for any linear ramp scheme are non-constructive. In addition, we present a novel application of strongly secure ramp schemes to symmetric PIR in a multi-user setting. Our solution is advantageous over those based on a non-strongly secure scheme in that it reduces the amount of communication between users and servers and also the amount of correlated randomness that servers generate in the setup.
Considering the rapidly increasing number of academic papers, searching for and citing appropriate references has become a nontrivial task during manuscript composition. Recommending a handful of candidate papers to a working draft could ease the burden of the authors. Conventional approaches to citation recommendation generally consider recommending one ground-truth citation from an input manuscript for a query context. However, it is common for a given context to be supported by two or more co-citation pairs. Here, we propose a novel scientific paper modelling for citation recommendations, namely Multi-Positive BERT Model for Citation Recommendation (MP-BERT4REC), complied with a series of Multi-Positive Triplet objectives to recommend multiple positive citations for a query context. The proposed approach has the following advantages: First, the proposed multi-positive objectives are effective in recommending multiple positive candidates. Second, we adopt noise distributions on the basis of historical co-citation frequencies; thus, MP-BERT4REC is not only effective in recommending high-frequency co-citation pairs, but it also significantly improves the performance of retrieving low-frequency ones. Third, the proposed dynamic context sampling strategy captures macroscopic citing intents from a manuscript and empowers the citation embeddings to be content-dependent, which allows the algorithm to further improve performance. Single and multiple positive recommendation experiments confirmed that MP-BERT4REC delivers significant improvements over current methods. It also effectively retrieves the full list of co-citations and historically low-frequency pairs better than prior works.
Esrat FARJANA Natthawut KERTKEIDKACHORN Ryutaro ICHISE
The usefulness and usability of existing knowledge graphs (KGs) are mostly limited because of the incompleteness of knowledge compared to the growing number of facts about the real world. Most existing ontology-based KG completion methods are based on the closed-world assumption, where KGs are fixed. In these methods, entities and relations are defined, and new entity information cannot be easily added. In contrast, in open-world assumptions, entities and relations are not previously defined. Thus there is a vast scope to find new entity information. Despite this, knowledge acquisition under the open-world assumption is challenging because most available knowledge is in a noisy unstructured text format. Nevertheless, Open Information Extraction (OpenIE) systems can extract triples, namely (head text; relation text; tail text), from raw text without any prespecified vocabulary. Such triples contain noisy information that is not essential for KGs. Therefore, to use such triples for the KG completion task, it is necessary to identify competent triples for KGs from the extracted triple set. Here, competent triples are the triples that can contribute to add new information to the existing KGs. In this paper, we propose the Competent Triple Identification (CTID) model for KGs. We also propose two types of feature, namely syntax- and semantic-based features, to identify competent triples from a triple set extracted by a state-of-the-art OpenIE system. We investigate both types of feature and test their effectiveness. It is found that the performance of the proposed features is about 20% better compared to that of the ReVerb system in identifying competent triples.
Jun KURIHARA Toru NAKAMURA Ryu WATANABE
This paper investigates an adversarial model in the scenario of private information retrieval (PIR) from n coded storage servers, called Byzantine adversary. The Byzantine adversary is defined as the one altering b server responses and erasing u server responses to a user's query. In this paper, two types of Byzantine adversaries are considered; 1) the classic omniscient type that has the full knowledge on n servers as considered in existing literature, and 2) the reasonable limited-knowledge type that has information on only b+u servers, i.e., servers under the adversary's control. For these two types, this paper reveals that the resistance of a PIR scheme, i.e., the condition of b and u to correctly obtain the desired message, can be expressed in terms of a code parameter called the coset distance of linear codes employed in the scheme. For the omniscient type, the derived condition expressed by the coset distance is tighter and more precise than the estimation of the resistance by the minimum Hamming weight of the codes considered in existing researches. Furthermore, this paper also clarifies that if the adversary is limited-knowledge, the resistance of a PIR scheme could exceed that for the case of the omniscient type. Namely, PIR schemes can increase their resistance to Byzantine adversaries by allowing the limitation on adversary's knowledge.
The song-level feature summarization is an essential building block for browsing, retrieval, and indexing of digital music. This paper proposes a local pooling method to aggregate the feature vectors of a song over the universal background model. Two types of local activation patterns of feature vectors are derived; one representation is derived in the form of histogram, and the other is given by a binary vector. Experiments over three publicly-available music datasets show that the proposed local aggregation of the auditory features is promising for music-similarity computation.
Automatic music transcription from audio has long been one of the most intriguing problems and a challenge in the field of music information retrieval, because it requires a series of low-level tasks such as onset/offset detection and F0 estimation, followed by high-level post-processing for symbolic representation. In this paper, a comprehensive transcription system for monophonic singing voice based on harmonic structure analysis is proposed. Given a precise tracking of the fundamental frequency, a novel acoustic feature is derived to signify the harmonic structure in singing voice signals, regardless of the loudness and pitch. It is then used to generate a parametric mixture model based on the von Mises-Fisher distribution, so that the model represents the intrinsic harmonic structures within a region of smoothly connected notes. To identify the note boundaries, the local homogeneity in the harmonic structure is exploited by two different methods: the self-similarity analysis and hidden Markov model. The proposed system identifies the note attributes including the onset time, duration and note pitch. Evaluations are conducted from various aspects to verify the performance improvement of the proposed system and its robustness, using the latest evaluation methodology for singing transcription. The results show that the proposed system significantly outperforms other systems including the state-of-the-art systems.
Abu Nowshed CHY Md Zia ULLAH Masaki AONO
Microblog, especially twitter, has become an integral part of our daily life for searching latest news and events information. Due to the short length characteristics of tweets and frequent use of unconventional abbreviations, content-relevance based search cannot satisfy user's information need. Recent research has shown that considering temporal and contextual aspects in this regard has improved the retrieval performance significantly. In this paper, we focus on microblog retrieval, emphasizing the alleviation of the vocabulary mismatch, and the leverage of the temporal (e.g., recency and burst nature) and contextual characteristics of tweets. To address the temporal and contextual aspect of tweets, we propose new features based on query-tweet time, word embedding, and query-tweet sentiment correlation. We also introduce some popularity features to estimate the importance of a tweet. A three-stage query expansion technique is applied to improve the relevancy of tweets. Moreover, to determine the temporal and sentiment sensitivity of a query, we introduce query type determination techniques. After supervised feature selection, we apply random forest as a feature ranking method to estimate the importance of selected features. Then, we make use of ensemble of learning to rank (L2R) framework to estimate the relevance of query-tweet pair. We conducted experiments on TREC Microblog 2011 and 2012 test collections over the TREC Tweets2011 corpus. Experimental results demonstrate the effectiveness of our method over the baseline and known related works in terms of precision at 30 (P@30), mean average precision (MAP), normalized discounted cumulative gain at 30 (NDCG@30), and R-precision (R-Prec) metrics.
Masaya MURATA Hidehisa NAGANO Kaoru HIRAMATSU Kunio KASHINO Shin'ichi SATOH
In this paper, we first analyze the discriminative power in the Best Match (BM) 25 formula and provide its calculation method from the Bayesian point of view. The resulting, derived discriminative power is quite similar to the exponential inverse document frequency (EIDF) that we have previously proposed [1] but retains more preferable theoretical advantages. In our previous paper [1], we proposed the EIDF in the framework of the probabilistic information retrieval (IR) method BM25 to address the instance search task, which is a specific object search for videos using an image query. Although the effectiveness of our EIDF was experimentally demonstrated, we did not consider its theoretical justification and interpretation. We also did not describe the use of region-of-interest (ROI) information, which is supposed to be input to the instance search system together with the original image query showing the instance. Therefore, here, we justify the EIDF by calculating the discriminative power in the BM25 from the Bayesian viewpoint. We also investigate the effect of the ROI information for improving the instance search accuracy and propose two search methods incorporating the ROI effect into the BM25 video ranking function. We validated the proposed methods through a series of experiments using the TREC Video Retrieval Evaluation instance search task dataset.
Sentence similarity computation is an increasingly important task in applications of natural language processing such as information retrieval, machine translation, text summarization and so on. From the viewpoint of information theory, the essential attribute of natural language is that the carrier of information and the capacity of information can be measured by information content which is already successfully used for word similarity computation in simple ways. Existing sentence similarity methods don't emphasize the information contained by the sentence, and the complicated models they employ often need using empirical parameters or training parameters. This paper presents a fully unsupervised computational model of sentence semantic similarity. It is also a simply and straightforward model that neither needs any empirical parameter nor rely on other NLP tools. The method can obtain state-of-the-art experimental results which show that sentence similarity evaluated by the model is closer to human judgment than multiple competing baselines. The paper also tests the proposed model on the influence of external corpus, the performance of various sizes of the semantic net, and the relationship between efficiency and accuracy.
Shunsuke OHASHI Giovanni Yoko KRISTIANTO Goran TOPIC Akiko AIZAWA
Mathematical formulae play an important role in many scientific domains. Regardless of the importance of mathematical formula search, conventional keyword-based retrieval methods are not sufficient for searching mathematical formulae, which are structured as trees. The increasing number as well as the structural complexity of mathematical formulae in scientific articles lead to the necessity for large-scale structure-aware formula search techniques. In this paper, we formulate three types of measures that represent distinctive features of semantic similarity of math formulae, and develop efficient hash-based algorithms for the approximate calculation. Our experiments using NTCIR-11 Math-2 Task dataset, a large-scale test collection for math information retrieval with about 60-million formulae, show that the proposed method improves the search precision while also keeps the scalability and runtime efficiency high.
Yuto NAKANO Shinsaku KIYOMOTO Yutaka MIYAKE Kouichi SAKURAI
Oblivious RAM (ORAM) schemes, the concept introduced by Goldreich and Ostrovsky, are very useful technique for protecting users' privacy when storing data in remote untrusted servers and running software on untrusted systems. However they are usually considered impractical due to their huge overhead. In order to reduce overhead, many improvements have been presented. Thanks to these improvements, ORAM schemes can be considered practical on cloud environment where users can expect huge storage and high computational power. Especially for private information retrieval (PIR), some literatures demonstrated they are usable. Also dedicated PIRs have been proposed and shown that they are usable in practice. Yet, they are still impractical for protecting software running on untrusted systems. We first survey recent researches on ORAM and PIR. Then, we present a practical software-based memory protection scheme applicable to several environments. The main feature of our scheme is that it records the history of accesses and uses the history to hide the access pattern. We also address implementing issues of ORAM and propose practical solutions for these issues.
Kazuhiko KINOSHITA Nariyoshi YAMAI Koso MURAKAMI
The recent explosive growth in information networks has driven a huge increase in content. For efficient and flexible information retrieval over such large networks, agent technology has received much attention. We previously proposed an agent execution control method for time-constrained information retrieval that finds better results by terminating an agent that has already acquired results of high-enough quality or one that is unlikely to improve the quality of results with continued retrieval. However, this method assumed that all agents have identical time constraints. This leads to a disparity in the obtained score between users who give individual time constraints. In this paper, we propose a fair and efficient scheduling method based on the expected improvement of the highest score (EIS). The proposed method allocates all CPU resources to the agent that has the highest EIS to decrease the difference between users' scores and to increase the mean highest score of requested results.
Yasuhito ASANO Taihei OSHINO Masatoshi YOSHIKAWA
Graph pattern mining has played important roles in network analysis and information retrieval. However, temporal characteristics of networks have not been estimated sufficiently. We propose time graph pattern mining as a new concept of graph mining reflecting the temporal information of a network. We conduct two case studies of time graph pattern mining: extensively discussed topics on blog sites and a book recommendation network. Through examination of case studies, we ascertain that time graph pattern mining has numerous possibilities as a novel means for information retrieval and network analysis reflecting both structural and temporal characteristics.
Degen HUANG Shanshan WANG Fuji REN
Comparable Corpora are valuable resources for many NLP applications, and extensive research has been done on information mining based on comparable corpora in recent years. While there are not enough large-scale available public comparable corpora at present, this paper presents a bi-directional CLIR-based method for creating comparable corpora from two independent news collections in different languages. The original Chinese document collections and English documents collections are crawled from XinHuaNet respectively and formatted in a consistent manner. For each document from the two collections, the best query keywords are extracted to represent the essential content of the document, and then the keywords are translated into the language of the other collection. The translated queries are run against the collection in the same language to pick up the candidate documents in the other language and candidates are aligned based on their publication dates and the similarity scores. Results show that our approach significantly outperforms previous approaches to the construction of Chinese-English comparable corpora.
Yeo-Chan YOON Myung-Gil JANG Hyun-Ki KIM So-Young PARK
In this paper, we propose a duplicate document detection model recognizing both partial duplicates and near duplicates. The proposed model can detect partial duplicates as well as exact duplicates by splitting a large document into many small sentence fingerprints. Furthermore, the proposed model can detect even near duplicates, the result of trivial revisions, by filtering the common words and reordering the word sequence.
Zhengong CAI Xiaohu YANG Xinyu WANG Aleksander J. KAVS
Feature location is to identify source code that implements a given feature. It is essential for software maintenance and evolution. A large amount of research, including static analysis, dynamic analysis and the hybrid approaches, has been done on the feature location problems. The existing approaches either need plenty of scenarios or rely on domain experts heavily. This paper proposes a new approach to locate functional feature in source code by combining the change impact analysis and information retrieval. In this approach, the source code is instrumented and executed using a single scenario to obtain the execution trace. The execution trace is extended according to the control flow to cover all the potentially relevant classes. The classes are ranked by trace-based impact analysis and information retrieval. The ranking analysis takes advantages of the semantics and structural characteristics of source code. The identified results are of higher precision than the individual approaches. Finally, two open source cases have been studied and the efficiency of the proposed approach is verified.
Learning to rank refers to machine learning techniques for training the model in a ranking task. Learning to rank is useful for many applications in Information Retrieval, Natural Language Processing, and Data Mining. Intensive studies have been conducted on the problem and significant progress has been made [1],[2]. This short paper gives an introduction to learning to rank, and it specifically explains the fundamental problems, existing approaches, and future work of learning to rank. Several learning to rank methods using SVM techniques are described in details.
Kazuhiko KINOSHITA Atsushi NARISHIGE Yusuke HARA Nariyoshi YAMAI Koso MURAKAMI
Networks have gotten bigger recently, and users have a more difficult time finding the information that they want. The use of mobile agents to help users effectively retrieve information has garnered a lot of attention. In this paper, we propose an agent control method for time constrained information retrieval. We pay attention to the highest past score gained by the agents and control the agents with the expectation of achieving better scores. Using computer simulations, we confirmed that our control method gave the best improvement over the whole network while reducing the overall variance. From these results, we can say that our control method improves the quality of information retrieved by the agent.