1-15hit |
David WONG Daisuke DEGUCHI Ichiro IDE Hiroshi MURASE
Advances in intelligent vehicle systems have led to modern automobiles being able to aid drivers with tasks such as lane following and automatic braking. Such automated driving tasks increasingly require reliable ego-localization. Although there is a large number of sensors that can be employed for this purpose, the use of a single camera still remains one of the most appealing, but also one of the most challenging. GPS localization in urban environments may not be reliable enough for automated driving systems, and various combinations of range sensors and inertial navigation systems are often too complex and expensive for a consumer setup. Therefore accurate localization with a single camera is a desirable goal. In this paper we propose a method for vehicle localization using images captured from a single vehicle-mounted camera and a pre-constructed database. Image feature points are extracted, but the calculation of camera poses is not required — instead we make use of the feature points' scale. For image feature-based localization methods, matching of many features against candidate database images is time consuming, and database sizes can become large. Therefore, here we propose a method that constructs a database with pre-matched features of known good scale stability. This limits the number of unused and incorrectly matched features, and allows recording of the database scales into “tracklets”. These “Feature scale tracklets” are used for fast image match voting based on scale comparison with corresponding query image features. This process reduces the number of image-to-image matching iterations that need to be performed while improving the localization stability. We also present an analysis of the system performance using a dataset with high accuracy ground truth. We demonstrate robust vehicle positioning even in challenging lane change and real traffic situations.
Lina Tomokazu TAKAHASHI Ichiro IDE Hiroshi MURASE
We propose the construction of an appearance manifold with embedded view-dependent covariance matrix to recognize 3D objects which are influenced by geometric distortions and quality degradation effects. The appearance manifold is used to capture the pose variability, while the covariance matrix is used to learn the distribution of samples for gaining noise-invariance. However, since the appearance of an object in the captured image is different for every different pose, the covariance matrix value is also different for every pose position. Therefore, it is important to embed view-dependent covariance matrices in the manifold of an object. We propose two models of constructing an appearance manifold with view-dependent covariance matrix, called the View-dependent Covariance matrix by training-Point Interpolation (VCPI) and View-dependent Covariance matrix by Eigenvector Interpolation (VCEI) methods. Here, the embedded view-dependent covariance matrix of the VCPI method is obtained by interpolating every training-points from one pose to other training-points in a consecutive pose. Meanwhile, in the VCEI method, the embedded view-dependent covariance matrix is obtained by interpolating only the eigenvectors and eigenvalues without considering the correspondences of each training image. As it embeds the covariance matrix in manifold, our view-dependent covariance matrix methods are robust to any pose changes and are also noise invariant. Our main goal is to construct a robust and efficient manifold with embedded view-dependent covariance matrix for recognizing objects from images which are influenced with various degradation effects.
Xi LI Tomokazu TAKAHASHI Daisuke DEGUCHI Ichiro IDE Hiroshi MURASE
This paper presents an approach for cross-pose face recognition by virtual view generation using an appearance clustering based local view transition model. Previously, the traditional global pattern based view transition model (VTM) method was extended to its local version called LVTM, which learns the linear transformation of pixel values between frontal and non-frontal image pairs from training images using partial image in a small region for each location, instead of transforming the entire image pattern. In this paper, we show that the accuracy of the appearance transition model and the recognition rate can be further improved by better exploiting the inherent linear relationship between frontal-nonfrontal face image patch pairs. This is achieved based on the observation that variations in appearance caused by pose are closely related to the corresponding 3D structure and intuitively frontal-nonfrontal patch pairs from more similar local 3D face structures should have a stronger linear relationship. Thus for each specific location, instead of learning a common transformation as in the LVTM, the corresponding local patches are first clustered based on an appearance similarity distance metric and then the transition models are learned separately for each cluster. In the testing stage, each local patch for the input non-frontal probe image is transformed using the learned local view transition model corresponding to the most visually similar cluster. The experimental results on a real-world face dataset demonstrated the superiority of the proposed method in terms of recognition rate.
Hitoshi NISHIMURA Naoya MAKIBUCHI Kazuyuki TASAKA Yasutomo KAWANISHI Hiroshi MURASE
Multiple human tracking is widely used in various fields such as marketing and surveillance. The typical approach associates human detection results between consecutive frames using the features and bounding boxes (position+size) of detected humans. Some methods use an omnidirectional camera to cover a wider area, but ID switch often occurs in association with detections due to following two factors: i) The feature is adversely affected because the bounding box includes many background regions when a human is captured from an oblique angle. ii) The position and size change dramatically between consecutive frames because the distance metric is non-uniform in an omnidirectional image. In this paper, we propose a novel method that accurately tracks humans with an association metric for omnidirectional images. The proposed method has two key points: i) For feature extraction, we introduce local rectification, which reduces the effect of background regions in the bounding box. ii) For distance calculation, we describe the positions in a world coordinate system where the distance metric is uniform. In the experiments, we confirmed that the Multiple Object Tracking Accuracy (MOTA) improved 3.3 in the LargeRoom dataset and improved 2.3 in the SmallRoom dataset.
Lina Tomokazu TAKAHASHI Ichiro IDE Hiroshi MURASE
We propose an appearance manifold with view-dependent covariance matrix for face recognition from video sequences in two learning frameworks: the supervised-learning and the incremental unsupervised-learning. The advantages of this method are, first, the appearance manifold with view-dependent covariance matrix model is robust to pose changes and is also noise invariant, since the embedded covariance matrices are calculated based on their poses in order to learn the samples' distributions along the manifold. Moreover, the proposed incremental unsupervised-learning framework is more realistic for real-world face recognition applications. It is obvious that it is difficult to collect large amounts of face sequences under complete poses (from left sideview to right sideview) for training. Here, an incremental unsupervised-learning framework allows us to train the system with the available initial sequences, and later update the system's knowledge incrementally every time an unlabelled sequence is input. In addition, we also integrate the appearance manifold with view-dependent covariance matrix model with a pose estimation system for improving the classification accuracy and easily detecting sequences with overlapped poses for merging process in the incremental unsupervised-learning framework. The experimental results showed that, in both frameworks, the proposed appearance manifold with view-dependent covariance matrix method could recognize faces from video sequences accurately.
Mahmud Dwi SULISTIYO Yasutomo KAWANISHI Daisuke DEGUCHI Ichiro IDE Takatsugu HIRAYAMA Jiang-Yu ZHENG Hiroshi MURASE
Numerous applications such as autonomous driving, satellite imagery sensing, and biomedical imaging use computer vision as an important tool for perception tasks. For Intelligent Transportation Systems (ITS), it is required to precisely recognize and locate scenes in sensor data. Semantic segmentation is one of computer vision methods intended to perform such tasks. However, the existing semantic segmentation tasks label each pixel with a single object's class. Recognizing object attributes, e.g., pedestrian orientation, will be more informative and help for a better scene understanding. Thus, we propose a method to perform semantic segmentation with pedestrian attribute recognition simultaneously. We introduce an attribute-aware loss function that can be applied to an arbitrary base model. Furthermore, a re-annotation to the existing Cityscapes dataset enriches the ground-truth labels by annotating the attributes of pedestrian orientation. We implement the proposed method and compare the experimental results with others. The attribute-aware semantic segmentation shows the ability to outperform baseline methods both in the traditional object segmentation task and the expanded attribute detection task.
Kazuma TAKAHASHI Tatsumi HATTORI Keisuke DOMAN Yasutomo KAWANISHI Takatsugu HIRAYAMA Ichiro IDE Daisuke DEGUCHI Hiroshi MURASE
We introduce a method to estimate the attractiveness of a food photo. It extracts image features focusing on the appearances of 1) the entire food, and 2) the main ingredients. To estimate the attractiveness of an arbitrary food photo, these features are integrated in a regression scheme. We also constructed and released a food image dataset composed of images of ten food categories taken from 36 angles and accompanied with attractiveness values. Evaluation results showed the effectiveness of integrating the two kinds of image features.
Brahmastro KRESNARAMAN Yasutomo KAWANISHI Daisuke DEGUCHI Tomokazu TAKAHASHI Yoshito MEKADA Ichiro IDE Hiroshi MURASE
This paper addresses the attribute recognition problem, a field of research that is dominated by studies in the visible spectrum. Only a few works are available in the thermal spectrum, which is fundamentally different from the visible one. This research performs recognition specifically on wearable attributes, such as glasses and masks. Usually these attributes are relatively small in size when compared with the human body, on top of a large intra-class variation of the human body itself, therefore recognizing them is not an easy task. Our method utilizes a decomposition framework based on Robust Principal Component Analysis (RPCA) to extract the attribute information for recognition. However, because it is difficult to separate the body and the attributes without any prior knowledge, noise is also extracted along with attributes, hampering the recognition capability. We made use of prior knowledge; namely the location where the attribute is likely to be present. The knowledge is referred to as the Probability Map, incorporated as a weight in the decomposition by RPCA. Using the Probability Map, we achieve an attribute-wise decomposition. The results show a significant improvement with this approach compared to the baseline, and the proposed method achieved the highest performance in average with a 0.83 F-score.
Hitoshi NISHIMURA Satoshi KOMORITA Yasutomo KAWANISHI Hiroshi MURASE
Multiple human tracking is a fundamental problem in understanding the context of a visual scene. Although both accuracy and speed are required in real-world applications, recent tracking methods based on deep learning focus on accuracy and require a substantial amount of running time. We aim to improve tracking running speeds by performing human detections at certain frame intervals because it accounts for most of the running time. The question is how to maintain accuracy while skipping human detection. In this paper, we propose a method that interpolates the detection results by using an optical flow, which is based on the fact that someone's appearance does not change much between adjacent frames. To maintain the tracking accuracy, we introduce robust interest point detection within the human regions and a tracking termination metric defined by the distribution of the interest points. On the MOT17 and MOT20 datasets in the MOTChallenge, the proposed SDOF-Tracker achieved the best performance in terms of total running time while maintaining the MOTA metric. Our code is available at https://github.com/hitottiez/sdof-tracker.
Hiroyuki ISHIDA Tomokazu TAKAHASHI Ichiro IDE Yoshito MEKADA Hiroshi MURASE
We present a novel training method for recognizing traffic sign symbols. The symbol images captured by a car-mounted camera suffer from various forms of image degradation. To cope with degradations, similarly degraded images should be used as training data. Our method artificially generates such training data from original templates of traffic sign symbols. Degradation models and a GA-based algorithm that simulates actual captured images are established. The proposed method enables us to obtain training data of all categories without exhaustively collecting them. Experimental results show the effectiveness of the proposed method for traffic sign symbol recognition.
Yuki IMAEDA Takatsugu HIRAYAMA Yasutomo KAWANISHI Daisuke DEGUCHI Ichiro IDE Hiroshi MURASE
We propose an estimation method of pedestrian detectability considering the driver's visual adaptation to drastic illumination change, which has not been studied in previous works. We assume that driver's visual characteristics change in proportion to the elapsed time after illumination change. In this paper, as a solution, we construct multiple estimators corresponding to different elapsed periods, and estimate the detectability by switching them according to the elapsed period. To evaluate the proposed method, we construct an experimental setup to present a participant with illumination changes and conduct a preliminary simulated experiment to measure and estimate the pedestrian detectability according to the elapsed period. Results show that the proposed method can actually estimate the detectability accurately after a drastic illumination change.
Hiroshi FUKUI Takayoshi YAMASHITA Yuji YAMAUCHI Hironobu FUJIYOSHI Hiroshi MURASE
Pedestrian attribute information is important function for an advanced driver assistance system (ADAS). Pedestrian attributes such as body pose, face orientation and open umbrella indicate the intended action or state of the pedestrian. Generally, this information is recognized using independent classifiers for each task. Performing all of these separate tasks is too time-consuming at the testing stage. In addition, the processing time increases with increasing number of tasks. To address this problem, multi-task learning or heterogeneous learning is performed to train a single classifier to perform multiple tasks. In particular, heterogeneous learning is able to simultaneously train a classifier to perform regression and recognition tasks, which reduces both training and testing time. However, heterogeneous learning tends to result in a lower accuracy rate for classes with few training samples. In this paper, we propose a method to improve the performance of heterogeneous learning for such classes. We introduce a rarity rate based on the importance and class probability of each task. The appropriate rarity rate is assigned to each training sample. Thus, the samples in a mini-batch for training a deep convolutional neural network are augmented according to this rarity rate to focus on the classes with a few samples. Our heterogeneous learning approach with the rarity rate performs pedestrian attribute recognition better, especially for classes representing few training samples.
Ichiro IDE Tomoyoshi KINOSHITA Tomokazu TAKAHASHI Hiroshi MO Norio KATAYAMA Shin'ichi SATOH Hiroshi MURASE
Recent advance in digital storage technology has enabled us to archive a large volume of video data. Thanks to this trend, we have archived more than 1,800 hours of video data from a daily Japanese news show in the last ten years. When considering the effective use of such a large news video archive, we assumed that analysis of its chronological and semantic structure becomes important. We also consider that providing the users with the development of news topics is more important to help their understanding of current affairs, rather than providing a list of relevant news stories as in most of the current news video retrieval systems. Therefore, in this paper, we propose a structuring method for a news video archive, together with an interface that visualizes the structure, so that users could track the development of news topics according to their interest, efficiently. The proposed news video structure, namely the “topic thread structure”, is obtained as a result of an analysis of the chronological and semantic relation between news stories. Meanwhile, the proposed interface, namely “mediaWalker II”, allows users to track the development of news topics along the topic thread structure, and at the same time watch the video footage corresponding to each news story. Analyses on the topic thread structures obtained by applying the proposed method to actual news video footages revealed interesting and comprehensible relations between news topics in the real world. At the same time, analyses on their size quantified the efficiency of tracking a user's topic-of-interest based on the proposed topic thread structure. We consider this as a first step towards facilitating video authoring by users based on existing contents in a large-scale news video archive.
Esmaeil POURJAM Daisuke DEGUCHI Ichiro IDE Hiroshi MURASE
Human body segmentation has many applications in a wide variety of image processing tasks, from intelligent vehicles to entertainment. A substantial amount of research has been done in the field of segmentation and it is still one of the active research areas, resulting in introduction of many innovative methods in literature. Still, until today, a method that can overcome the human segmentation problems and adapt itself to different kinds of situations, has not been introduced. Many of methods today try to use the graph-cut framework to solve the segmentation problem. Although powerful, these methods rely on a distance penalty term (intensity difference or RGB color distance). This term does not always lead to a good separation between two regions. For example, if two regions are close in color, even if they belong to two different objects, they will be grouped together, which is not acceptable. Also, if one object has multiple parts with different colors, e.g. humans wear various clothes with different colors and patterns, each part will be segmented separately. Although this can be overcome by multiple inputs from user, the inherent problem would not be solved. In this paper, we have considered solving the problem by making use of a human probability map, super-pixels and Grab-cut framework. Using this map relives us from the need for matching the model to the actual body, thus helps to improve the segmentation accuracy. As a result, not only the accuracy has improved, but also it also became comparable to the state-of-the-art interactive methods.