The search functionality is under construction.

Author Search Result

[Author] Jien KATO(13hit)

1-13hit
  • Learning Local Similarity with Spatial Interrelations on Content-Based Image Retrieval

    Longjiao ZHAO  Yu WANG  Jien KATO  Yoshiharu ISHIKAWA  

     
    PAPER-Image Processing and Video Processing

      Pubricized:
    2023/02/14
      Vol:
    E106-D No:5
      Page(s):
    1069-1080

    Convolutional Neural Networks (CNNs) have recently demonstrated outstanding performance in image retrieval tasks. Local convolutional features extracted by CNNs, in particular, show exceptional capability in discrimination. Recent research in this field has concentrated on pooling methods that incorporate local features into global features and assess the global similarity of two images. However, the pooling methods sacrifice the image's local region information and spatial relationships, which are precisely known as the keys to the robustness against occlusion and viewpoint changes. In this paper, instead of pooling methods, we propose an alternative method based on local similarity, determined by directly using local convolutional features. Specifically, we first define three forms of local similarity tensors (LSTs), which take into account information about local regions as well as spatial relationships between them. We then construct a similarity CNN model (SCNN) based on LSTs to assess the similarity between the query and gallery images. The ideal configuration of our method is sought through thorough experiments from three perspectives: local region size, local region content, and spatial relationships between local regions. The experimental results on a modified open dataset (where query images are limited to occluded ones) confirm that the proposed method outperforms the pooling methods because of robustness enhancement. Furthermore, testing on three public retrieval datasets shows that combining LSTs with conventional pooling methods achieves the best results.

  • Social Relation Atmosphere Recognition with Relevant Visual Concepts

    Ying JI  Yu WANG  Kensaku MORI  Jien KATO  

     
    PAPER

      Pubricized:
    2023/06/02
      Vol:
    E106-D No:10
      Page(s):
    1638-1649

    Social relationships (e.g., couples, opponents) are the foundational part of society. Social relation atmosphere describes the overall interaction environment between social relationships. Discovering social relation atmosphere can help machines better comprehend human behaviors and improve the performance of social intelligent applications. Most existing research mainly focuses on investigating social relationships, while ignoring the social relation atmosphere. Due to the complexity of the expressions in video data and the uncertainty of the social relation atmosphere, it is even difficult to define and evaluate. In this paper, we innovatively analyze the social relation atmosphere in video data. We introduce a Relevant Visual Concept (RVC) from the social relationship recognition task to facilitate social relation atmosphere recognition, because social relationships contain useful information about human interactions and surrounding environments, which are crucial clues for social relation atmosphere recognition. Our approach consists of two main steps: (1) we first generate a group of visual concepts that preserve the inherent social relationship information by utilizing a 3D explanation module; (2) the extracted relevant visual concepts are used to supplement the social relation atmosphere recognition. In addition, we present a new dataset based on the existing Video Social Relation Dataset. Each video is annotated with four kinds of social relation atmosphere attributes and one social relationship. We evaluate the proposed method on our dataset. Experiments with various 3D ConvNets and fusion methods demonstrate that the proposed method can effectively improve recognition accuracy compared to end-to-end ConvNets. The visualization results also indicate that essential information in social relationships can be discovered and used to enhance social relation atmosphere recognition.

  • Adaptive Metric Learning for People Re-Identification

    Guanwen ZHANG  Jien KATO  Yu WANG  Kenji MASE  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E97-D No:11
      Page(s):
    2888-2902

    There exist two intrinsic issues in multiple-shot person re-identification: (1) large differences in camera view, illumination, and non-rigid deformation of posture that make the intra-class variance even larger than the inter-class variance; (2) only a few training data that are available for learning tasks in a realistic re-identification scenario. In our previous work, we proposed a local distance comparison framework to deal with the first issue. In this paper, to deal with the second issue (i.e., to derive a reliable distance metric from limited training data), we propose an adaptive learning method to learn an adaptive distance metric, which integrates prior knowledge learned from a large existing auxiliary dataset and task-specific information extracted from a much smaller training dataset. Experimental results on several public benchmark datasets show that combined with the local distance comparison framework, our adaptive learning method is superior to conventional approaches.

  • A Highway Surveillance System Using an HMM-Based Segmentation Method

    Jien KATO  Toyohide WATANABE  Hiroyuki HASE  

     
    PAPER

      Vol:
    E85-D No:11
      Page(s):
    1767-1775

    Automatic traffic surveillance based on visual tracking techniques has been desired for many years. This paper proposes a basic highway surveillance system using an HMM-based segmentation method. The presented system meets the essential requirement of ITS: real-time running. Its another advantage is robustness to the shadows of moving objects, which have been recognized as one of main obstacles to robust car tracking. At present, using the system we can estimate velocity of vehicles with high accuracy. For acquiring metric information in the real world, the system does not require a precise calibration but only needs four point correspondences between the image plane and ground plane.

  • People Re-Identification with Local Distance Comparison Using Learned Metric

    Guanwen ZHANG  Jien KATO  Yu WANG  Kenji MASE  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E97-D No:9
      Page(s):
    2461-2472

    In this paper, we propose a novel approach for multiple-shot people re-identification. Due to high variance in camera view, light illumination, non-rigid deformation of posture and so on, there exists a crucial inter-/intra- variance issue, i.e., the same people may look considerably different, whereas different people may look extremely similar. This issue leads to an intractable, multimodal distribution of people appearance in feature space. To deal with such multimodal properties of data, we solve the re-identification problem under a local distance comparison framework, which significantly alleviates the difficulty induced by varying appearance of each individual. Furthermore, we build an energy-based loss function to measure the similarity between appearance instances, by calculating the distance between corresponding subsets in feature space. This loss function not only favors small distances that indicate high similarity between appearances of the same people, but also penalizes small distances or undesirable overlaps between subsets, which reflect high similarity between appearances of different people. In this way, effective people re-identification can be achieved in a robust manner against the inter-/intra- variance issue. The performance of our approach has been evaluated by applying it to the public benchmark datasets ETHZ and CAVIAR4REID. Experimental results show significant improvements over previous reports.

  • Predicting Violence Rating Based on Pairwise Comparison

    Ying JI  Yu WANG  Jien KATO  Kensaku MORI  

     
    PAPER-Data Engineering, Web Information Systems

      Pubricized:
    2020/08/28
      Vol:
    E103-D No:12
      Page(s):
    2578-2589

    With the rapid development of multimedia, violent video can be easily accessed in games, movies, websites, and so on. Identifying violent videos and rating violence extent is of great importance to media filtering and children protection. Many previous studies only address the problems of violence scene detection and violent action recognition, yet violence rating problem is still not solved. In this paper, we present a novel video-level rating prediction method to estimate violence extent automatically. It has two main characteristics: (1) a two-stream network is fine-tuned to construct effective representations of violent videos; (2) a violence rating prediction machine is designed to learn the strength relationship among different videos. Furthermore, we present a novel violent video dataset with a total of 1,930 human-involved violent videos designed for violence rating analysis. Each video is annotated with 6 fine-grained objective attributes, which are considered to be closely related to violence extent. The ground-truth of violence rating is given by pairwise comparison method. The dataset is evaluated in both stability and convergence. Experiment results on this dataset demonstrate the effectiveness of our method compared with the state-of-art classification methods.

  • Rethinking the Rotation Invariance of Local Convolutional Features for Content-Based Image Retrieval

    Longjiao ZHAO  Yu WANG  Jien KATO  

     
    PAPER-Image Processing and Video Processing

      Pubricized:
    2020/10/14
      Vol:
    E104-D No:1
      Page(s):
    174-182

    Recently, local features computed using convolutional neural networks (CNNs) show good performance to image retrieval. The local convolutional features obtained by the CNNs (LC features) are designed to be translation invariant, however, they are inherently sensitive to rotation perturbations. This leads to miss-judgements in retrieval tasks. In this work, our objective is to enhance the robustness of LC features against image rotation. To do this, we conduct a thorough experimental evaluation of three candidate anti-rotation strategies (in-model data augmentation, in-model feature augmentation, and post-model feature augmentation), over two kinds of rotation attack (dataset attack and query attack). In the training procedure, we implement a data augmentation protocol and network augmentation method. In the test procedure, we develop a local transformed convolutional (LTC) feature extraction method, and evaluate it over different network configurations. We end up a series of good practices with steady quantitative supports, which lead to the best strategy for computing LC features with high rotation invariance in image retrieval.

  • Discriminative Part CNN for Pedestrian Detection

    Yu WANG  Cong CAO  Jien KATO  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2021/12/06
      Vol:
    E105-D No:3
      Page(s):
    700-712

    Pedestrian detection is a significant task in computer vision. In recent years, it is widely used in applications such as intelligent surveillance systems and automated driving systems. Although it has been exhaustively studied in the last decade, the occlusion handling issue still remains unsolved. One convincing idea is to first detect human body parts, and then utilize the parts information to estimate the pedestrians' existence. Many parts-based pedestrian detection approaches have been proposed based on this idea. However, in most of these approaches, the low-quality parts mining and the clumsy part detector combination is a bottleneck that limits the detection performance. To eliminate the bottleneck, we propose Discriminative Part CNN (DP-CNN). Our approach has two main contributions: (1) We propose a high-quality body parts mining method based on both convolutional layer features and body part subclasses. The mined part clusters are not only discriminative but also representative, and can help to construct powerful pedestrian detectors. (2) We propose a novel method to combine multiple part detectors. We convert the part detectors to a middle layer of a CNN and optimize the whole detection pipeline by fine-tuning that CNN. In experiments, it shows astonishing effectiveness of optimization and robustness of occlusion handling.

  • Multiple-Shot People Re-Identification by Patch-Wise Learning

    Guanwen ZHANG  Jien KATO  Yu WANG  Kenji MASE  

     
    PAPER-Pattern Recognition

      Pubricized:
    2015/08/31
      Vol:
    E98-D No:12
      Page(s):
    2257-2270

    In this paper, we propose a patch-wise learning based approach to deal with the multiple-shot people re-identification task. In the proposed approach, re-identification is formulated as a patch-wise set-to-set matching problem, with each patch set being matched using a specifically learned Mahalanobis distance metric. The proposed approach has two advantages: (1) a patch-wise representation that moderates the ambiguousness of a non-rigid matching problem (of human body) to an approximate rigid one (of body parts); (2) a patch-wise learning algorithm that enables more constraints to be included in the learning process and results in distance metrics of high quality. We evaluate the proposed approach on popular benchmark datasets and confirm its competitive performance compared to the state-of-the-art methods.

  • Attention-Guided Spatial Transformer Networks for Fine-Grained Visual Recognition

    Dichao LIU  Yu WANG  Jien KATO  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2019/09/04
      Vol:
    E102-D No:12
      Page(s):
    2577-2586

    The aim of this paper is to propose effective attentional regions for fine-grained visual recognition. Based on the Spatial Transformers' capability of spatial manipulation within networks, we propose an extension model, the Attention-Guided Spatial Transformer Networks (AG-STNs). This model can guide the Spatial Transformers with hard-coded attentional regions at first. Then such guidance can be turned off, and the network model will adjust the region learning in terms of the location and scale. Such adjustment is conditioned to the classification loss so that it is actually optimized for better recognition results. With this model, we are able to successfully capture detailed attentional information. Also, the AG-STNs are able to capture attentional information in multiple levels, and different levels of attentional information are complementary to each other in our experiments. A fusion of them brings better results.

  • Efficient Local Feature Encoding for Human Action Recognition with Approximate Sparse Coding

    Yu WANG  Jien KATO  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2016/01/06
      Vol:
    E99-D No:4
      Page(s):
    1212-1220

    Local spatio-temporal features are popular in the human action recognition task. In practice, they are usually coupled with a feature encoding approach, which helps to obtain the video-level vector representations that can be used in learning and recognition. In this paper, we present an efficient local feature encoding approach, which is called Approximate Sparse Coding (ASC). ASC computes the sparse codes for a large collection of prototype local feature descriptors in the off-line learning phase using Sparse Coding (SC) and look up the nearest prototype's precomputed sparse code for each to-be-encoded local feature in the encoding phase using Approximate Nearest Neighbour (ANN) search. It shares the low dimensionality of SC and the high speed of ANN, which are both desired properties for a local feature encoding approach. ASC has been excessively evaluated on the KTH dataset and the HMDB51 dataset. We confirmed that it is able to encode large quantity of local video features into discriminative low dimensional representations efficiently.

  • Recursive Multi-Scale Channel-Spatial Attention for Fine-Grained Image Classification

    Dichao LIU  Yu WANG  Kenji MASE  Jien KATO  

     
    PAPER-Image Recognition, Computer Vision

      Pubricized:
    2021/12/22
      Vol:
    E105-D No:3
      Page(s):
    713-726

    Fine-grained image classification is a difficult problem, and previous studies mainly overcome this problem by locating multiple discriminative regions in different scales and then aggregating complementary information explored from the located regions. However, locating discriminative regions introduces heavy overhead and is not suitable for real-world application. In this paper, we propose the recursive multi-scale channel-spatial attention module (RMCSAM) for addressing this problem. Following the experience of previous research on fine-grained image classification, RMCSAM explores multi-scale attentional information. However, the attentional information is explored by recursively refining the deep feature maps of a convolutional neural network (CNN) to better correspond to multi-scale channel-wise and spatial-wise attention, instead of localizing attention regions. In this way, RMCSAM provides a lightweight module that can be inserted into standard CNNs. Experimental results show that RMCSAM can improve the classification accuracy and attention capturing ability over baselines. Also, RMCSAM performs better than other state-of-the-art attention modules in fine-grained image classification, and is complementary to some state-of-the-art approaches for fine-grained image classification. Code is available at https://github.com/Dichao-Liu/Recursive-Multi-Scale-Channel-Spatial-Attention-Module.

  • Pedestrian Detection with Sparse Depth Estimation

    Yu WANG  Jien KATO  

     
    PAPER-Image Recognition, Computer Vision

      Vol:
    E94-D No:8
      Page(s):
    1690-1699

    In this paper, we deal with the pedestrian detection task in outdoor scenes. Because of the complexity of such scenes, generally used gradient-feature-based detectors do not work well on them. We propose to use sparse 3D depth information as an additional cue to do the detection task, in order to achieve a fast improvement in performance. Our proposed method uses a probabilistic model to integrate image-feature-based classification with sparse depth estimation. Benefiting from the depth estimates, we map the prior distribution of human's actual height onto the image, and update the image-feature-based classification result probabilistically. We have two contributions in this paper: 1) a simplified graphical model which can efficiently integrate depth cue in detection; and 2) a sparse depth estimation method which could provide fast and reliable estimation of depth information. An experiment shows that our method provides a promising enhancement over baseline detector within minimal additional time.