Yuxin HUANG Yuanlin YANG Enchang ZHU Yin LIANG Yantuan XIAN
Chinese-Vietnamese cross-lingual event retrieval aims to retrieve the Vietnamese sentence describing the same event as a given Chinese query sentence from a set of Vietnamese sentences. Existing mainstream cross-lingual event retrieval methods rely on extracting textual representations from query texts and calculating their similarity with textual representations in other language candidate sets. However, these methods ignore the difference in event elements present during Chinese-Vietnamese cross-language retrieval. Consequently, sentences with similar meanings but different event elements may be incorrectly considered to describe the same event. To address this problem, we propose a cross-lingual retrieval method that integrates event elements. We introduce event elements as an additional supervisory signal, where we calculate the semantic similarity of event elements in two sentences using an attention mechanism to determine the attention score of the event elements. This allows us to establish a one-to-one correspondence between event elements in the text. Additionally, we leverage the multilingual pre-trained language model fine-tuned based on contrastive learning to obtain cross-language sentence representation to calculate the semantic similarity of the sentence texts. By combining these two approaches, we obtain the final text similarity score. Experimental results demonstrate that our proposed method achieves higher retrieval accuracy than the baseline model.
Jinsoo SEO Junghyun KIM Hyemi KIM
Song-level feature summarization is fundamental for the browsing, retrieval, and indexing of digital music archives. This study proposes a deep neural network model, CQTXNet, for extracting song-level feature summary for cover song identification. CQTXNet incorporates depth-wise separable convolution, residual network connections, and attention models to extend previous approaches. An experimental evaluation of the proposed CQTXNet was performed on two publicly available cover song datasets by varying the number of network layers and the type of attention modules.
Xiaoguang YUAN Chaofan DAI Zongkai TIAN Xinyu FAN Yingyi SONG Zengwen YU Peng WANG Wenjun KE
Question answering (QA) systems are designed to answer questions based on given information or with the help of external information. Recent advances in QA systems are overwhelmingly contributed by deep learning techniques, which have been employed in a wide range of fields such as finance, sports and biomedicine. For generative QA in open-domain QA, although deep learning can leverage massive data to learn meaningful feature representations and generate free text as answers, there are still problems to limit the length and content of answers. To alleviate this problem, we focus on the variant YNQA of generative QA and propose a model CasATT (cascade prompt learning framework with the sentence-level attention mechanism). In the CasATT, we excavate text semantic information from document level to sentence level and mine evidence accurately from large-scale documents by retrieval and ranking, and answer questions with ranked candidates by discriminative question answering. Our experiments on several datasets demonstrate the superior performance of the CasATT over state-of-the-art baselines, whose accuracy score can achieve 93.1% on IR&QA Competition dataset and 90.5% on BoolQ dataset.
This paper addresses the novel task of detecting chorus sections in English and Japanese lyrics text. Although chorus-section detection using audio signals has been studied, whether chorus sections can be detected from text-only lyrics is an open issue. Another open issue is whether patterns of repeating lyric lines such as those appearing in chorus sections depend on language. To investigate these issues, we propose a neural-network-based model for sequence labeling. It can learn phrase repetition and linguistic features to detect chorus sections in lyrics text. It is, however, difficult to train this model since there was no dataset of lyrics with chorus-section annotations as there was no prior work on this task. We therefore generate a large amount of training data with such annotations by leveraging pairs of musical audio signals and their corresponding manually time-aligned lyrics; we first automatically detect chorus sections from the audio signals and then use their temporal positions to transfer them to the line-level chorus-section annotations for the lyrics. Experimental results show that the proposed model with the generated data contributes to detecting the chorus sections, that the model trained on Japanese lyrics can detect chorus sections surprisingly well in English lyrics, and that patterns of repeating lyric lines are language-independent.
Longjiao ZHAO Yu WANG Jien KATO Yoshiharu ISHIKAWA
Convolutional Neural Networks (CNNs) have recently demonstrated outstanding performance in image retrieval tasks. Local convolutional features extracted by CNNs, in particular, show exceptional capability in discrimination. Recent research in this field has concentrated on pooling methods that incorporate local features into global features and assess the global similarity of two images. However, the pooling methods sacrifice the image's local region information and spatial relationships, which are precisely known as the keys to the robustness against occlusion and viewpoint changes. In this paper, instead of pooling methods, we propose an alternative method based on local similarity, determined by directly using local convolutional features. Specifically, we first define three forms of local similarity tensors (LSTs), which take into account information about local regions as well as spatial relationships between them. We then construct a similarity CNN model (SCNN) based on LSTs to assess the similarity between the query and gallery images. The ideal configuration of our method is sought through thorough experiments from three perspectives: local region size, local region content, and spatial relationships between local regions. The experimental results on a modified open dataset (where query images are limited to occluded ones) confirm that the proposed method outperforms the pooling methods because of robustness enhancement. Furthermore, testing on three public retrieval datasets shows that combining LSTs with conventional pooling methods achieves the best results.
Reo ERIGUCHI Noboru KUNIHIRO Koji NUIDA
Ramp secret sharing is a variant of secret sharing which can achieve better information ratio than perfect schemes by allowing some partial information on a secret to leak out. Strongly secure ramp schemes can control the amount of leaked information on the components of a secret. In this paper, we reduce the construction of strongly secure ramp secret sharing for general access structures to a linear algebraic problem. As a result, we show that previous results on strongly secure network coding imply two linear transformation methods to make a given linear ramp scheme strongly secure. They are explicit or provide a deterministic algorithm while the previous methods which work for any linear ramp scheme are non-constructive. In addition, we present a novel application of strongly secure ramp schemes to symmetric PIR in a multi-user setting. Our solution is advantageous over those based on a non-strongly secure scheme in that it reduces the amount of communication between users and servers and also the amount of correlated randomness that servers generate in the setup.
Considering the rapidly increasing number of academic papers, searching for and citing appropriate references has become a nontrivial task during manuscript composition. Recommending a handful of candidate papers to a working draft could ease the burden of the authors. Conventional approaches to citation recommendation generally consider recommending one ground-truth citation from an input manuscript for a query context. However, it is common for a given context to be supported by two or more co-citation pairs. Here, we propose a novel scientific paper modelling for citation recommendations, namely Multi-Positive BERT Model for Citation Recommendation (MP-BERT4REC), complied with a series of Multi-Positive Triplet objectives to recommend multiple positive citations for a query context. The proposed approach has the following advantages: First, the proposed multi-positive objectives are effective in recommending multiple positive candidates. Second, we adopt noise distributions on the basis of historical co-citation frequencies; thus, MP-BERT4REC is not only effective in recommending high-frequency co-citation pairs, but it also significantly improves the performance of retrieving low-frequency ones. Third, the proposed dynamic context sampling strategy captures macroscopic citing intents from a manuscript and empowers the citation embeddings to be content-dependent, which allows the algorithm to further improve performance. Single and multiple positive recommendation experiments confirmed that MP-BERT4REC delivers significant improvements over current methods. It also effectively retrieves the full list of co-citations and historically low-frequency pairs better than prior works.
Zhi LIU Fangyuan ZHAO Mengmeng ZHANG
In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.
Esrat FARJANA Natthawut KERTKEIDKACHORN Ryutaro ICHISE
The usefulness and usability of existing knowledge graphs (KGs) are mostly limited because of the incompleteness of knowledge compared to the growing number of facts about the real world. Most existing ontology-based KG completion methods are based on the closed-world assumption, where KGs are fixed. In these methods, entities and relations are defined, and new entity information cannot be easily added. In contrast, in open-world assumptions, entities and relations are not previously defined. Thus there is a vast scope to find new entity information. Despite this, knowledge acquisition under the open-world assumption is challenging because most available knowledge is in a noisy unstructured text format. Nevertheless, Open Information Extraction (OpenIE) systems can extract triples, namely (head text; relation text; tail text), from raw text without any prespecified vocabulary. Such triples contain noisy information that is not essential for KGs. Therefore, to use such triples for the KG completion task, it is necessary to identify competent triples for KGs from the extracted triple set. Here, competent triples are the triples that can contribute to add new information to the existing KGs. In this paper, we propose the Competent Triple Identification (CTID) model for KGs. We also propose two types of feature, namely syntax- and semantic-based features, to identify competent triples from a triple set extracted by a state-of-the-art OpenIE system. We investigate both types of feature and test their effectiveness. It is found that the performance of the proposed features is about 20% better compared to that of the ReVerb system in identifying competent triples.
Local discriminative regions play important roles in fine-grained image analysis tasks. How to locate local discriminative regions with only category label and learn discriminative representation from these regions have been hot spots. In our work, we propose Searching Discriminative Regions (SDR) and Learning Discriminative Regions (LDR) method to search and learn local discriminative regions in images. The SDR method adopts attention mechanism to iteratively search for high-response regions in images, and uses this as a clue to locate local discriminative regions. Moreover, the LDR method is proposed to learn compact within category and sparse between categories representation from the raw image and local images. Experimental results show that our proposed approach achieves excellent performance in both fine-grained image retrieval and classification tasks, which demonstrates its effectiveness.
Motohiro SUNOUCHI Masaharu YOSHIOKA
This paper proposes new acoustic feature signatures based on the multiscale fractal dimension (MFD), which are robust against the diversity of environmental sounds, for the content-based similarity search. The diversity of sound sources and acoustic compositions is a typical feature of environmental sounds. Several acoustic features have been proposed for environmental sounds. Among them is the widely-used Mel-Frequency Cepstral Coefficients (MFCCs), which describes frequency-domain features. However, in addition to these features in the frequency domain, environmental sounds have other important features in the time domain with various time scales. In our previous paper, we proposed enhanced multiscale fractal dimension signature (EMFD) for environmental sounds. This paper extends EMFD by using the kernel density estimation method, which results in better performance of the similarity search tasks. Furthermore, it newly proposes another acoustic feature signature based on MFD, namely very-long-range multiscale fractal dimension signature (MFD-VL). The MFD-VL signature describes several features of the time-varying envelope for long periods of time. The MFD-VL signature has stability and robustness against background noise and small fluctuations in the parameters of sound sources, which are produced in field recordings. We discuss the effectiveness of these signatures in the similarity sound search by comparing with acoustic features proposed in the DCASE 2018 challenges. Due to the unique descriptiveness of our proposed signatures, we confirmed the signatures are effective when they are used with other acoustic features.
Jun KURIHARA Toru NAKAMURA Ryu WATANABE
This paper investigates an adversarial model in the scenario of private information retrieval (PIR) from n coded storage servers, called Byzantine adversary. The Byzantine adversary is defined as the one altering b server responses and erasing u server responses to a user's query. In this paper, two types of Byzantine adversaries are considered; 1) the classic omniscient type that has the full knowledge on n servers as considered in existing literature, and 2) the reasonable limited-knowledge type that has information on only b+u servers, i.e., servers under the adversary's control. For these two types, this paper reveals that the resistance of a PIR scheme, i.e., the condition of b and u to correctly obtain the desired message, can be expressed in terms of a code parameter called the coset distance of linear codes employed in the scheme. For the omniscient type, the derived condition expressed by the coset distance is tighter and more precise than the estimation of the resistance by the minimum Hamming weight of the codes considered in existing researches. Furthermore, this paper also clarifies that if the adversary is limited-knowledge, the resistance of a PIR scheme could exceed that for the case of the omniscient type. Namely, PIR schemes can increase their resistance to Byzantine adversaries by allowing the limitation on adversary's knowledge.
Content-based image retrieval has been a hot topic among computer vision researchers for a long time. There have been many advances over the years, one of the recent ones being deep metric learning, inspired by the success of deep neural networks in many machine learning tasks. The goal of metric learning is to extract good high-level features from image pixel data using neural networks. These features provide useful abstractions, which can enable algorithms to perform visual comparison between images with human-like accuracy. To learn these features, supervised information of image similarity or relative similarity is often used. One important issue in deep metric learning is how to define similarity for multi-label or multi-object scenes in images. Traditionally, pairwise similarity is defined based on the presence of a single common label between two images. However, this definition is very coarse and not suitable for multi-label or multi-object data. Another common mistake is to completely ignore the multiplicity of objects in images, hence ignoring the multi-object facet of certain types of datasets. In our work, we propose an approach for learning deep image representations based on the relative similarity of both multi-label and multi-object image data. We introduce an intuitive and effective similarity metric based on the Jaccard similarity coefficient, which is equivalent to the intersection over union of two label sets. Hence we treat similarity as a continuous, as opposed to discrete quantity. We incorporate this similarity metric into a triplet loss with an adaptive margin, and achieve good mean average precision on image retrieval tasks. We further show, using a recently proposed quantization method, that the resulting deep feature can be quantized whilst preserving similarity. We also show that our proposed similarity metric performs better for multi-object images than a previously proposed cosine similarity-based metric. Our proposed method outperforms several state-of-the-art methods on two benchmark datasets.
Rintaro YANAGI Ren TOGO Takahiro OGAWA Miki HASEYAMA
Various cross-modal retrieval methods that can retrieve images related to a query sentence without text annotations have been proposed. Although a high level of retrieval performance is achieved by these methods, they have been developed for a single domain retrieval setting. When retrieval candidate images come from various domains, the retrieval performance of these methods might be decreased. To deal with this problem, we propose a new domain adaptive cross-modal retrieval method. By translating a modality and domains of a query and candidate images, our method can retrieve desired images accurately in a different domain retrieval setting. Experimental results for clipart and painting datasets showed that the proposed method has better retrieval performance than that of other conventional and state-of-the-art methods.
Leilei KONG Yong HAN Haoliang QI Zhongyuan HAN
Source retrieval is the primary task of plagiarism detection. It searches the documents that may be the sources of plagiarism to a suspicious document. The state-of-the-art approaches usually rely on the classical information retrieval models, such as the probability model or vector space model, to get the plagiarism sources. However, the goal of source retrieval is to obtain the source documents that contain the plagiarism parts of the suspicious document, rather than to rank the documents relevant to the whole suspicious document. To model the “partial matching” between documents, this paper proposes a Partial Matching Convolution Neural Network (PMCNN) for source retrieval. In detail, PMCNN exploits a sequential convolution neural network to extract the plagiarism patterns of contiguous text segments. The experimental results on PAN 2013 and PAN 2014 plagiarism source retrieval corpus show that PMCNN boosts the performance of source retrieval significantly, outperforming other state-of-the-art document models.
Maximum inner product search (MIPS) problem has gained much attention in a wide range of applications. In order to overcome the curse of dimensionality in high-dimensional spaces, most of existing methods first transform the MIPS problem into another approximate nearest neighbor search (ANNS) problem and then solve it by Locality Sensitive Hashing (LSH). However, due to the error incurred by the transmission and incomprehensive search strategies, these methods suffer from low precision and have loose probability guarantees. In this paper, we propose a novel search method named Adaptive-LSH (AdaLSH) to solve MIPS problem more efficiently and more precisely. AdaLSH examines objects in the descending order of both norms and (the probably correctly estimated) cosine angles with a query object in support of LSH with extendable windows. Such extendable windows bring not only efficiency in searching but also the probability guarantee of finding exact or approximate MIP objects. AdaLSH gives a better probability guarantee of success than those in conventional algorithms, bringing less running times on various datasets compared with them. In addition, AdaLSH can even support exact MIPS with probability guarantee.
Longjiao ZHAO Yu WANG Jien KATO
Recently, local features computed using convolutional neural networks (CNNs) show good performance to image retrieval. The local convolutional features obtained by the CNNs (LC features) are designed to be translation invariant, however, they are inherently sensitive to rotation perturbations. This leads to miss-judgements in retrieval tasks. In this work, our objective is to enhance the robustness of LC features against image rotation. To do this, we conduct a thorough experimental evaluation of three candidate anti-rotation strategies (in-model data augmentation, in-model feature augmentation, and post-model feature augmentation), over two kinds of rotation attack (dataset attack and query attack). In the training procedure, we implement a data augmentation protocol and network augmentation method. In the test procedure, we develop a local transformed convolutional (LTC) feature extraction method, and evaluate it over different network configurations. We end up a series of good practices with steady quantitative supports, which lead to the best strategy for computing LC features with high rotation invariance in image retrieval.
This paper proposes a salient chromagram by removing local trend to improve cover song identification accuracy. The proposed salient chromagram emphasizes tonal contents of music, which are well-preserved between an original song and its cover version, while reducing the effects of timber difference. We apply the proposed salient chromagram to the sequence-alignment based cover song identification. Experiments on two cover song datasets confirm that the proposed salient chromagram improves the cover song identification accuracy.
Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation. Many studies have attempted to extract representations from given entities and align them in a shared embedding space. However, because entities in different modalities exhibit different abstraction levels and modality-specific information, it is insufficient to embed related entities close to each other. In this study, we propose the Target-Oriented Deformation Network (TOD-Net), a novel module that continuously deforms the embedding space into a new space under a given condition, thereby providing conditional similarities between entities. Unlike methods based on cross-modal attention applied to words and cropped images, TOD-Net is a post-process applied to the embedding space learned by existing embedding systems and improves their performances of retrieval. In particular, when combined with cutting-edge models, TOD-Net gains the state-of-the-art image-caption retrieval model associated with the MS COCO and Flickr30k datasets. Qualitative analysis reveals that TOD-Net successfully emphasizes entity-specific concepts and retrieves diverse targets via handling higher levels of diversity than existing models.
Shoko IMAIZUMI Yusuke IZAWA Ryoichi HIRASAWA Hitoshi KIYA
We propose a reversible data hiding (RDH) method in compressible encrypted images called the encryption-then-compression (EtC) images. The proposed method allows us to not only embed a payload in encrypted images but also compress the encrypted images containing the payload. In addition, the proposed RDH method can be applied to both plain images and encrypted ones, and the payload can be extracted flexibly in the encrypted domain or from the decrypted images. Various RDH methods have been studied in the encrypted domain, but they are not considered to be two-domain data hiding, and the resultant images cannot be compressed by using image coding standards, such as JPEG-LS and JPEG 2000. In our experiment, the proposed method shows high performance in terms of lossless compression efficiency by using JPEG-LS and JPEG 2000, data hiding capacity, and marked image quality.