The search functionality is under construction.

Keyword Search Result

[Keyword] similarity search(14hit)

1-14hit
  • Continuous Similarity Search for Dynamic Text Streams

    Yuma TSUCHIDA  Kohei KUBO  Hisashi KOGA  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2023/09/21
      Vol:
    E106-D No:12
      Page(s):
    2026-2035

    Similarity search for data streams has attracted much attention for information recommendation. In this context, recent leading works regard the latest W items in a data stream as an evolving set and reduce similarity search for data streams to set similarity search. Whereas they consider standard sets composed of items, this paper uniquely studies similarity search for text streams and treats evolving sets whose elements are texts. Specifically, we formulate a new continuous range search problem named the CTS problem (Continuous similarity search for Text Sets). The task of the CTS problem is to find all the text streams from the database whose similarity to the query becomes larger than a threshold ε. It abstracts a scenario in which a user-based recommendation system searches similar users from social networking services. The CTS is important because it allows both the query and the database to change dynamically. We develop a fast pruning-based algorithm for the CTS. Moreover, we discuss how to speed up it with the inverted index.

  • Exact Algorithm to Solve Continuous Similarity Search for Evolving Queries and Its Variant

    Tomohiro YAMAZAKI  Hisashi KOGA  

     
    PAPER

      Pubricized:
    2022/02/07
      Vol:
    E105-D No:5
      Page(s):
    898-908

    We study the continuous similarity search problem for evolving queries which has recently been formulated. Given a data stream and a database composed of n sets of items, the purpose of this problem is to maintain the top-k most similar sets to the query which evolves over time and consists of the latest W items in the data stream. For this problem, the previous exact algorithm adopts a pruning strategy which, at the present time T, decides the candidates of the top-k most similar sets from past similarity values and computes the similarity values only for them. This paper proposes a new exact algorithm which shortens the execution time by computing the similarity values only for sets whose similarity values at T can change from time T-1. We identify such sets very fast with frequency-based inverted lists (FIL). Moreover, we derive the similarity values at T in O(1) time by updating the previous values computed at time T-1. Experimentally, our exact algorithm runs faster than the previous exact algorithm by one order of magnitude and as fast as the previous approximation algorithm.

  • Similarity Search in InterPlanetary File System with the Aid of Locality Sensitive Hash

    Satoshi FUJITA  

     
    PAPER-Information Network

      Pubricized:
    2021/07/08
      Vol:
    E104-D No:10
      Page(s):
    1616-1623

    To realize an information-centric networking, IPFS (InterPlanetary File System) generates a unique ContentID for each content by applying a cryptographic hash to the content itself. Although it could improve the security against attacks such as falsification, it makes difficult to realize a similarity search in the framework of IPFS, since the similarity of contents is not reflected in the proximity of ContentIDs. To overcome this issue, we propose a method to apply a locality sensitive hash (LSH) to feature vectors extracted from contents as the key of indexes stored in IPFS. By conducting experiments with 10,000 random points corresponding to stored contents, we found that more than half of randomly given queries return a non-empty result for the similarity search, and yield an accurate result which is outside the σ confidence interval of an ordinary flooding-based method. Note that such a collection of random points corresponds to the worst case scenario for the proposed scheme since the performance of similarity search could improve when points and queries follow an uneven distribution.

  • Multi-Feature Sensor Similarity Search for the Internet of Things

    Suyan LIU  Yuanan LIU  Fan WU  Puning ZHANG  

     
    PAPER-Network

      Pubricized:
    2017/12/08
      Vol:
    E101-B No:6
      Page(s):
    1388-1397

    The tens of billions of devices expected to be connected to the Internet will include so many sensors that the demand for sensor-based services is rising. The task of effectively utilizing the enormous numbers of sensors deployed is daunting. The need for automatic sensor identification has expanded the need for research on sensor similarity searches. The Internet of Things (IoT) features massive non-textual dynamic data, which is raising the critical challenge of efficiently and effectively searching for and selecting the sensors most related to a need. Unfortunately, single-attribute similarity searches are highly inaccurate when searching among similar attribute values. In this paper, we propose a group-fitting correlation calculation algorithm (GFC) that can identify the most similar clusters of sensors. The GFC method considers multiple attributes (e.g., humidity, temperature) to calculate sensor similarity; thus, it performs more accurate searches than do existing solutions.

  • Efficient Methods for Aggregate Reverse Rank Queries

    Yuyang DONG  Hanxiong CHEN  Kazutaka FURUSE  Hiroyuki KITAGAWA  

     
    PAPER

      Pubricized:
    2018/01/18
      Vol:
    E101-D No:4
      Page(s):
    1012-1020

    Given two data sets of user preferences and product attributes in addition to a set of query products, the aggregate reverse rank (ARR) query returns top-k users who regard the given query products as the highest aggregate rank than other users. ARR queries are designed to focus on product bundling in marketing. Manufacturers are mostly willing to bundle several products together for the purpose of maximizing benefits or inventory liquidation. This naturally leads to an increase in data on users and products. Thus, the problem of efficiently processing ARR queries become a big issue. In this paper, we reveal two limitations of the state-of-the-art solution to ARR query; that is, (a) It has poor efficiency when the distribution of the query set is dispersive. (b) It has to process a large portion user data. To address these limitations, we develop a cluster-and-process method and a sophisticated indexing strategy. From the theoretical analysis of the results and experimental comparisons, we conclude that our proposals have superior performance.

  • Pivot Generation Algorithm with a Complete Binary Tree for Efficient Exact Similarity Search

    Yuki YAMAGISHI  Kazuo AOYAMA  Kazumi SAITO  Tetsuo IKEDA  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2017/10/20
      Vol:
    E101-D No:1
      Page(s):
    142-151

    This paper presents a pivot-set generation algorithm for accelerating exact similarity search in a large-scale data set. To deal with the large-scale data set, it is important to efficiently construct a search index offline as well as to perform fast exact similarity search online. Our proposed algorithm efficiently generates competent pivots with two novel techniques: hierarchical data partitioning and fast pivot optimization techniques. To make effective use of a small number of pivots, the former recursively partitions a data set into two subsets with the same size depending on the rank order from each of two assigned pivots, resulting in a complete binary tree. The latter calculates a defined objective function for pivot optimization with a low computational cost by skillfully operating data objects mapped into a pivot space. Since the generated pivots provide the tight lower bounds on distances between a query object and the data objects, an exact similarity search algorithm effectively avoids unnecessary distance calculations. We demonstrate that the search algorithm using the pivots generated by the proposed algorithm reduces distance calculations with an extremely high rate regarding a range query problem for real large-scale image data sets.

  • Efficient Similarity Search with a Pivot-Based Complete Binary Tree

    Yuki YAMAGISHI  Kazuo AOYAMA  Kazumi SAITO  Tetsuo IKEDA  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2017/07/04
      Vol:
    E100-D No:10
      Page(s):
    2526-2536

    This paper presents an efficient similarity search method utilizing as an index a complete binary tree (CBT) based on optimized pivots for a large-scale and high-dimensional data set. A similarity search method, in general, requires high-speed performance on both index construction off-line and similarity search itself online. To fulfill the requirement, we introduce novel techniques into an index construction and a similarity search algorithm in the proposed method for a range query. The index construction algorithm recursively employs the following two main functions, resulting in a CBT index. One is a pivot generation function that obtains one effective pivot at each node by efficiently maximizing a defined objective function. The other is a node bisection function that partitions a set of objects at a node into two almost equal-sized subsets based on the optimized pivot. The similarity search algorithm employs a three-stage process that narrows down candidate objects within a given range by pruning unnecessary branches and filtering objects in each stage. Experimental results on one million real image data set with high dimensionality demonstrate that the proposed method finds an exact solution for a range query at around one-quarter to half of the computational cost of one of the state-of-the-art methods, by using a CBT index constructed off-line at a reasonable computational cost.

  • Flexible and Fast Similarity Search for Enriched Trajectories

    Hideaki OHASHI  Toshiyuki SHIMIZU  Masatoshi YOSHIKAWA  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2017/05/30
      Vol:
    E100-D No:9
      Page(s):
    2081-2091

    In this study, we focus on a method to search for similar trajectories. In the majority of previous works on searching for similar trajectories, only raw trajectory data were used. However, to obtain deeper insights, additional time-dependent trajectory features should be utilized depending on the search intent. For instance, to identify similar combination plays in soccer games, such additional features include the movements of the team players. In this paper, we develop a framework to flexibly search for similar trajectories associated with time-dependent features, which we call enriched trajectories. In this framework, weights, which represent the relative importance of each feature, can be flexibly given by users. Moreover, to facilitate fast searching, we first propose a lower bounding measure of the DTW distance between enriched trajectories, and then we propose algorithms based on this lower bounding measure. We evaluate the effectiveness of the lower bounding measure and compare the performances of the algorithms under various conditions using soccer data and synthetic data. Our experimental results suggest that the proposed lower bounding measure is superior to the existing measure, and one of the proposed algorithms, which is based on the threshold algorithm, is suitable for practical use.

  • A Loitering Discovery System Using Efficient Similarity Search Based on Similarity Hierarchy

    Jianquan LIU  Shoji NISHIMURA  Takuya ARAKI  Yuichi NAKAMURA  

     
    INVITED PAPER

      Vol:
    E100-A No:2
      Page(s):
    367-375

    Similarity search is an important and fundamental problem, and thus widely used in various fields of computer science including multimedia, computer vision, database, information retrieval, etc. Recently, since loitering behavior often leads to abnormal situations, such as pickpocketing and terrorist attacks, its analysis attracts increasing attention from research communities. In this paper, we present AntiLoiter, a loitering discovery system adopting efficient similarity search on surveillance videos. As we know, most of existing systems for loitering analysis, mainly focus on how to detect or identify loiterers by behavior tracking techniques. However, the difficulties of tracking-based methods are known as that their analysis results are heavily influenced by occlusions, overlaps, and shadows. Moreover, tracking-based methods need to track the human appearance continuously. Therefore, existing methods are not readily applied to real-world surveillance cameras due to the appearance discontinuity of criminal loiterers. To solve this problem, we abandon the tracking method, instead, propose AntiLoiter to efficiently discover loiterers based on their frequent appearance patterns in longtime multiple surveillance videos. In AntiLoiter, we propose a novel data structure Luigi that indexes data using only similarity value returned by a corresponding function (e.g., face matching). Luigi is adopted to perform efficient similarity search to realize loitering discovery. We conducted extensive experiments on both synthetic and real surveillance videos to evaluate the efficiency and efficacy of our approach. The experimental results show that our system can find out loitering candidates correctly and outperforms existing method by 100 times in terms of runtime.

  • Multiple Binary Codes for Fast Approximate Similarity Search

    Shinichi SHIRAKAWA  

     
    PAPER-Pattern Recognition

      Pubricized:
    2014/12/11
      Vol:
    E98-D No:3
      Page(s):
    671-680

    One of the fast approximate similarity search techniques is a binary hashing method that transforms a real-valued vector into a binary code. The similarity between two binary codes is measured by their Hamming distance. In this method, a hash table is often used when undertaking a constant-time similarity search. The number of accesses to the hash table, however, increases when the number of bits lengthens. In this paper, we consider a method that does not access data with a long Hamming radius by using multiple binary codes. Further, we attempt to integrate the proposed approach and the existing multi-index hashing (MIH) method to accelerate the performance of the similarity search in the Hamming space. Then, we propose a learning method of the binary hash functions for multiple binary codes. We conduct an experiment on similarity search utilizing a dataset of up to 50 million items and show that our proposed method achieves a faster similarity search than that possible with the conventional linear scan and hash table search.

  • Textual Approximation Methods for Time Series Classification: TAX and l-TAX Open Access

    Abdulla Al MARUF  Hung-Hsuan HUANG  Kyoji KAWAGOE  

     
    PAPER

      Vol:
    E97-D No:4
      Page(s):
    798-810

    A lot of work has been conducted on time series classification and similarity search over the past decades. However, the classification of a time series with high accuracy is still insufficient in applications such as ubiquitous or sensor systems. In this paper, a novel textual approximation of a time series, called TAX, is proposed to achieve high accuracy time series classification. l-TAX, an extended version of TAX that shows promising classification accuracy over TAX and other existing methods, is also proposed. We also provide a comprehensive comparison between TAX and l-TAX, and discuss the benefits of both methods. Both TAX and l-TAX transform a time series into a textual structure using existing document retrieval methods and bioinformatics algorithms. In TAX, a time series is represented as a document like structure, whereas l-TAX used a sequence of textual symbols. This paper provides a comprehensive overview of the textual approximation and techniques used by TAX and l-TAX

  • Optimal Pivot Selection Method Based on the Partition and the Pruning Effect for Metric Space Indexes

    Hisashi KURASAWA  Daiji FUKAGAWA  Atsuhiro TAKASU  Jun ADACHI  

     
    PAPER

      Vol:
    E94-D No:3
      Page(s):
    504-514

    This paper proposes a new method to reduce the cost of nearest neighbor searches in metric spaces. Many similarity search indexes recursively divide a region into subregions by using pivots, and construct a tree-structured index. Most of recently developed indexes focus on pruning objects and do not pay much attention to the tree balancing. As a result, indexes having imbalanced tree-structure may be constructed and the search cost is degraded. We propose a similarity search index called the Partitioning Capacity (PC) Tree. It selects the optimal pivot in terms of the PC that quantifies the balance of the regions partitioned by a pivot as well as the estimated effectiveness of the search pruning by the pivot. As a result, PCTree reduces the search cost for various data distributions. We experimentally compared PCTree with four indexes using synthetic data and five real datasets. The experimental results shows that the PCTree successfully reduces the search cost.

  • Margin-Based Pivot Selection for Similarity Search Indexes

    Hisashi KURASAWA  Daiji FUKAGAWA  Atsuhiro TAKASU  Jun ADACHI  

     
    PAPER-Multimedia Databases

      Vol:
    E93-D No:6
      Page(s):
    1422-1432

    When developing an index for a similarity search in metric spaces, how to divide the space for effective search pruning is a fundamental issue. We present Maximal Metric Margin Partitioning (MMMP), a partitioning scheme for similarity search indexes. MMMP divides the data based on its distribution pattern, especially for the boundaries of clusters. A partitioning boundary created by MMMP is likely to be located in a sparse area between clusters. Moreover, the partitioning boundary is at maximum distances from the two cluster edges. We also present an indexing scheme, named the MMMP-Index, which uses MMMP and pivot filtering. The MMMP-Index can prune many objects that are not relevant to a query, and it reduces the query execution cost. Our experimental results show that MMMP effectively indexes clustered data and reduces the search cost. For clustered data in a vector space, the MMMP-Index reduces the computational cost to less than two thirds that of comparable schemes.

  • Physical Database Design for Efficient Time-Series Similarity Search

    Sang-Wook KIM  Jinho KIM  Sanghyun PARK  

     
    LETTER-Multimedia Systems for Communications

      Vol:
    E91-B No:4
      Page(s):
    1251-1254

    Similarity search in time-series databases finds such data sequences whose changing patterns are similar to that of a query sequence. For efficient processing, it normally employs a multi-dimensional index. In order to alleviate the well-known dimensionality curse, the previous methods for similarity search apply the Discrete Fourier Transform (DFT) to data sequences, and take only the first two or three DFT coefficients as organizing attributes. Other than this ad-hoc approach, there have been no research efforts on devising a systematic guideline for choosing the best organizing attributes. This paper first points out the problems occurring in the previous methods, and proposes a novel solution to construct optimal multi-dimensional indexes. The proposed method analyzes the characteristics of a target time-series database, and identifies the organizing attributes having the best discrimination power. It also determines the optimal number of organizing attributes for efficient similarity search by using a cost model. Through a series of experiments, we show that the proposed method outperforms the previous ones significantly.