The search functionality is under construction.

Author Search Result

[Author] Yasuo ARIKI(10hit)

1-10hit
  • Discriminating Unknown Objects from Known Objects Using Image and Speech Information

    Yuko OZASA  Mikio NAKANO  Yasuo ARIKI  Naoto IWAHASHI  

     
    PAPER-Multimedia Pattern Processing

      Pubricized:
    2014/12/16
      Vol:
    E98-D No:3
      Page(s):
    704-711

    This paper deals with a problem where a robot identifies an object that a human asks it to bring by voice when there is a set of objects that the human and the robot can see. When the robot knows the requested object, it must identify the object and when it does not know the object, it must say it does not. This paper presents a new method for discriminating unknown objects from known objects using object images and human speech. It uses a confidence measure that integrates image recognition confidences and speech recognition confidences based on logistic regression.

  • Organization and Retrieval of Video Data

    Katsumi TANAKA  Yasuo ARIKI  Kuniaki UEHARA  

     
    REVIEW PAPER

      Vol:
    E82-D No:1
      Page(s):
    34-44

    This paper focuses on the problems how to organize and retrieve video data in an effective manner. First we identify several issues to be solved for the problems. Next, we overview our current research results together with a brief survey in the research area of video databases. We especially describe the following research results obtained by the the Japanese Ministry of Education under Grant-in-Aid for Scientific Research on Priority Area: "Advanced Databases" concerned with organization and retrieval of video data: Instance-Based Video Annotation Models, Self-Organization of Video Data, and A Query Model for Fragmentally Indexed Video.

  • Graph Cuts Segmentation by Using Local Texture Features of Multiresolution Analysis

    Keita FUKUDA  Tetsuya TAKIGUCHI  Yasuo ARIKI  

     
    PAPER-Image Recognition, Computer Vision

      Vol:
    E92-D No:7
      Page(s):
    1453-1461

    This paper proposes an approach to image segmentation using Iterated Graph Cuts based on local texture features of wavelet coefficients. Using Haar Wavelet based Multiresolution Analysis, the low-frequency range (smoothed image) is used for the n-link and the high-frequency range (local texture features) is used for the t-link along with the color histogram. The proposed method can segment an object region having not only noisy edges and colors similar to the background, but also heavy texture change. Experimental results illustrate the validity of our method.

  • Acoustic Model Adaptation Using First-Order Linear Prediction for Reverberant Speech

    Tetsuya TAKIGUCHI  Masafumi NISHIMURA  Yasuo ARIKI  

     
    PAPER-Speech Recognition

      Vol:
    E89-D No:3
      Page(s):
    908-914

    This paper describes a hands-free speech recognition technique based on acoustic model adaptation to reverberant speech. In hands-free speech recognition, the recognition accuracy is degraded by reverberation, since each segment of speech is affected by the reflection energy of the preceding segment. To compensate for the reflection signal we introduce a frame-by-frame adaptation method adding the reflection signal to the means of the acoustic model. The reflection signal is approximated by a first-order linear prediction from the observation signal at the preceding frame, and the linear prediction coefficient is estimated with a maximum likelihood method by using the EM algorithm, which maximizes the likelihood of the adaptation data. Its effectiveness is confirmed by word recognition experiments on reverberant speech.

  • Voice Conversion Based on Speaker-Dependent Restricted Boltzmann Machines

    Toru NAKASHIKA  Tetsuya TAKIGUCHI  Yasuo ARIKI  

     
    PAPER-Voice Conversion and Speech Enhancement

      Vol:
    E97-D No:6
      Page(s):
    1403-1410

    This paper presents a voice conversion technique using speaker-dependent Restricted Boltzmann Machines (RBM) to build high-order eigen spaces of source/target speakers, where it is easier to convert the source speech to the target speech than in the traditional cepstrum space. We build a deep conversion architecture that concatenates the two speaker-dependent RBMs with neural networks, expecting that they automatically discover abstractions to express the original input features. Under this concept, if we train the RBMs using only the speech of an individual speaker that includes various phonemes while keeping the speaker individuality unchanged, it can be considered that there are fewer phonemes and relatively more speaker individuality in the output features of the hidden layer than original acoustic features. Training the RBMs for a source speaker and a target speaker, we can then connect and convert the speaker individuality abstractions using Neural Networks (NN). The converted abstraction of the source speaker is then back-propagated into the acoustic space (e.g., MFCC) using the RBM of the target speaker. We conducted speaker-voice conversion experiments and confirmed the efficacy of our method with respect to subjective and objective criteria, comparing it with the conventional Gaussian Mixture Model-based method and an ordinary NN.

  • Noise-Robust Voice Conversion Based on Sparse Spectral Mapping Using Non-negative Matrix Factorization

    Ryo AIHARA  Ryoichi TAKASHIMA  Tetsuya TAKIGUCHI  Yasuo ARIKI  

     
    PAPER-Voice Conversion and Speech Enhancement

      Vol:
    E97-D No:6
      Page(s):
    1411-1418

    This paper presents a voice conversion (VC) technique for noisy environments based on a sparse representation of speech. Sparse representation-based VC using Non-negative matrix factorization (NMF) is employed for noise-added spectral conversion between different speakers. In our previous exemplar-based VC method, source exemplars and target exemplars are extracted from parallel training data, having the same texts uttered by the source and target speakers. The input source signal is represented using the source exemplars and their weights. Then, the converted speech is constructed from the target exemplars and the weights related to the source exemplars. However, this exemplar-based approach needs to hold all training exemplars (frames), and it requires high computation times to obtain the weights of the source exemplars. In this paper, we propose a framework to train the basis matrices of the source and target exemplars so that they have a common weight matrix. By using the basis matrices instead of the exemplars, the VC is performed with lower computation times than with the exemplar-based method. The effectiveness of this method was confirmed by comparing its effectiveness (in speaker conversion experiments using noise-added speech data) with that of an exemplar-based method and a conventional Gaussian mixture model (GMM)-based method.

  • A Low-Power Real-Time SIFT Descriptor Generation Engine for Full-HDTV Video Recognition

    Kosuke MIZUNO  Hiroki NOGUCHI  Guangji HE  Yosuke TERACHI  Tetsuya KAMINO  Tsuyoshi FUJINAGA  Shintaro IZUMI  Yasuo ARIKI  Hiroshi KAWAGUCHI  Masahiko YOSHIMOTO  

     
    PAPER

      Vol:
    E94-C No:4
      Page(s):
    448-457

    This paper describes a SIFT (Scale Invariant Feature Transform) descriptor generation engine which features a VLSI oriented SIFT algorithm, three-stage pipelined architecture and novel systolic array architectures for Gaussian filtering and key-point extraction. The ROI-based scheme has been employed for the VLSI oriented algorithm. The novel systolic array architecture drastically reduces the number of operation cycle and memory access. The cycle counts of Gaussian filtering module is reduced by 82%, compared with the SIMD architecture. The number of memory accesses of the Gaussian filtering module and the key-point extraction module are reduced by 99.8% and 66% respectively, compared with the results obtained assuming the SIMD architecture. The proposed schemes provide processing capability for HDTV resolution video (1920 1080 pixels) at 30 frames per second (fps). The test chip has been fabricated in 65 nm CMOS technology and occupies 4.2 4.2 mm2 containing 1.1 M gates and 1.38 Mbit on-chip memory. The measured data demonstrates 38.2 mW power consumption at 78 MHz and 1.2 V.

  • Language Modeling Using PLSA-Based Topic HMM

    Atsushi SAKO  Tetsuya TAKIGUCHI  Yasuo ARIKI  

     
    PAPER-Language Modeling

      Vol:
    E91-D No:3
      Page(s):
    522-528

    In this paper, we propose a PLSA-based language model for sports-related live speech. This model is implemented using a unigram rescaling technique that combines a topic model and an n-gram. In the conventional method, unigram rescaling is performed with a topic distribution estimated from a recognized transcription history. This method can improve the performance, but it cannot express topic transition. By incorporating the concept of topic transition, it is expected that the recognition performance will be improved. Thus, the proposed method employs a "Topic HMM" instead of a history to estimate the topic distribution. The Topic HMM is an Ergodic HMM that expresses typical topic distributions as well as topic transition probabilities. Word accuracy results from our experiments confirmed the superiority of the proposed method over a trigram and a PLSA-based conventional method that uses a recognized history.

  • Exemplar-Based Voice Conversion Using Sparse Representation in Noisy Environments

    Ryoichi TAKASHIMA  Tetsuya TAKIGUCHI  Yasuo ARIKI  

     
    PAPER

      Vol:
    E96-A No:10
      Page(s):
    1946-1953

    This paper presents a voice conversion (VC) technique for noisy environments, where parallel exemplars are introduced to encode the source speech signal and synthesize the target speech signal. The parallel exemplars (dictionary) consist of the source exemplars and target exemplars, having the same texts uttered by the source and target speakers. The input source signal is decomposed into the source exemplars, noise exemplars and their weights (activities). Then, by using the weights of the source exemplars, the converted signal is constructed from the target exemplars. We carried out speaker conversion tasks using clean speech data and noise-added speech data. The effectiveness of this method was confirmed by comparing its effectiveness with that of a conventional Gaussian Mixture Model (GMM)-based method.

  • LLC Revisit: Scene Classification with k-Farthest Neighbours

    Katsuyuki TANAKA  Tetsuya TAKIGUCHI  Yasuo ARIKI  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2016/02/08
      Vol:
    E99-D No:5
      Page(s):
    1375-1383

    This paper introduces a simple but effective way to boost the performance of scene classification through a novel approach to the LLC coding process. In our proposed method, a local descriptor is encoded not only with k-nearest visual words but also with k-farthest visual words to produce more discriminative code. Since the proposed method is a simple modification of the image classification model, it can be easily integrated into various existing BoF models proposed in various areas, such as coding, pooling, to boost their scene classification performance. The results of experiments conducted with three scene datasets: 15-Scenes, MIT-Indoor67, and Sun367 show that adding k-farthest visual words better enhances scene classification performance than increasing the number of k-nearest visual words.