1-16hit |
Nenghuan ZHANG Yongbin WANG Xiaoguang WANG Peng YU
Recently, multi-modal fusion methods based on remote sensing data and social sensing data have been widely used in the field of urban region function recognition. However, due to the high complexity of noise problem, most of the existing methods are not robust enough when applied in real-world scenes, which seriously affect their application value in urban planning and management. In addition, how to extract valuable periodic feature from social sensing data still needs to be further study. To this end, we propose a multi-modal fusion network guided by feature co-occurrence for urban region function recognition, which leverages the co-occurrence relationship between multi-modal features to identify abnormal noise feature, so as to guide the fusion network to suppress noise feature and focus on clean feature. Furthermore, we employ a graph convolutional network that incorporates node weighting layer and interactive update layer to effectively extract valuable periodic feature from social sensing data. Lastly, experimental results on public available datasets indicate that our proposed method yeilds promising improvements of both accuracy and robustness over several state-of-the-art methods.
Zhi LIU Fangyuan ZHAO Mengmeng ZHANG
In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.
Shuhei YAMAMOTO Takeshi KURASHIMA Hiroyuki TODA
Front video and sensor data captured by vehicle-mounted event recorders are used for not only traffic accident evidence but also safe-driving education as near-miss traffic incident data. However, most event recorder (ER) data shows only regular driving events. To utilize near-miss data for safe-driving education, we need to be able to easily and rapidly locate the appropriate data from large amounts of ER data through labels attached to the scenes/events of interest. This paper proposes a method that can automatically identify near-misses with objects such as pedestrians and bicycles by processing the ER data. The proposed method extracts two deep feature representations that consider car status and the environment surrounding the car. The first feature representation is generated by considering the temporal transitions of car status. The second one can extract the positional relationship between the car and surrounding objects by processing object detection results. Experiments on actual ER data demonstrate that the proposed method can accurately identify and tag near-miss events.
Ruicong ZHI Caixia ZHOU Junwei YU Tingting LI Ghada ZAMZMI
Pain is an essential physiological phenomenon of human beings. Accurate assessment of pain is important to develop proper treatment. Although self-report method is the gold standard in pain assessment, it is not applicable to individuals with communicative impairment. Non-verbal pain indicators such as pain related facial expressions and changes in physiological parameters could provide valuable insights for pain assessment. In this paper, we propose a multimodal-based Stream Integrated Neural Network with Different Frame Rates (SINN) that combines facial expression and biomedical signals for automatic pain assessment. The main contributions of this research are threefold. (1) There are four-stream inputs of the SINN for facial expression feature extraction. The variant facial features are integrated with biomedical features, and the joint features are utilized for pain assessment. (2) The dynamic facial features are learned in both implicit and explicit manners to better represent the facial changes that occur during pain experience. (3) Multiple modalities are utilized to identify various pain states, including facial expression and biomedical signals. The experiments are conducted on publicly available pain datasets, and the performance is compared with several deep learning models. The experimental results illustrate the superiority of the proposed model, and it achieves the highest accuracy of 68.2%, which is up to 5% higher than the basic deep learning models on pain assessment with binary classification.
Dongni HU Chengxin CHEN Pengyuan ZHANG Junfeng LI Yonghong YAN Qingwei ZHAO
Recently, automated recognition and analysis of human emotion has attracted increasing attention from multidisciplinary communities. However, it is challenging to utilize the emotional information simultaneously from multiple modalities. Previous studies have explored different fusion methods, but they mainly focused on either inter-modality interaction or intra-modality interaction. In this letter, we propose a novel two-stage fusion strategy named modality attention flow (MAF) to model the intra- and inter-modality interactions simultaneously in a unified end-to-end framework. Experimental results show that the proposed approach outperforms the widely used late fusion methods, and achieves even better performance when the number of stacked MAF blocks increases.
Shaojie ZHU Lei ZHANG Bailong LIU Shumin CUI Changxing SHAO Yun LI
Multi-modal semantic trajectory prediction has become a new challenge due to the rapid growth of multi-modal semantic trajectories with text message. Traditional RNN trajectory prediction methods have the following problems to process multi-modal semantic trajectory. The distribution of multi-modal trajectory samples shifts gradually with training. It leads to difficult convergency and long training time. Moreover, each modal feature shifts in different directions, which produces multiple distributions of dataset. To solve the above problems, MNERM (Mode Normalization Enhanced Recurrent Model) for multi-modal semantic trajectory is proposed. MNERM embeds multiple modal features together and combines the LSTM network to capture long-term dependency of trajectory. In addition, it designs Mode Normalization mechanism to normalize samples with multiple means and variances, and each distribution normalized falls into the action area of the activation function, so as to improve the prediction efficiency while improving greatly the training speed. Experiments on real dataset show that, compared with SERM, MNERM reduces the sensitivity of learning rate, improves the training speed by 9.120 times, increases HR@1 by 0.03, and reduces the ADE by 120 meters.
Kenji KANAI Keigo OGAWA Masaru TAKEUCHI Jiro KATTO Toshitaka TSUDA
To reduce the backbone video traffic generated by video surveillance, we propose an intelligent video surveillance system that offers multi-modal sensor-based event detection and event-driven video rate adaptation. Our proposed system can detect pedestrian existence and movements in the monitoring area by using multi-modal sensors (camera, laser scanner and infrared distance sensor) and control surveillance video quality according to the detected events. We evaluate event detection accuracy and video traffic volume in the experiment scenarios where up to six pedestrians pass through and/or stop at the monitoring area. Evaluation results conclude that our system can significantly reduce video traffic while ensuring high-quality surveillance.
We present a simple technique for enhancing multi-modal images. The unsharp masking (UM) is at first nonlinearized to prevent halos around large edges. This edge-preserving UM is then extended to cross-sharpening of multi-modal images where a component image is sharpened with the aid of more clear edges in another component image.
We present the PCA self-cross bilateral filter for denoising multi-modal images. We firstly apply the principal component analysis for input multi-modal images. We next smooth the first principal component with a preliminary filter and use it as a supplementary image for cross bilateral filtering of input images. Among some preliminary filters, the undecimated wavelet transform is useful for effective denoising of various multi-modal images such as color, multi-lighting and medical images.
Kouichi KATSURADA Hiroaki ADACHI Kunitoshi SATO Hirobumi YAMADA Tsuneo NITTA
We have developed Interaction Builder (IB), a rapid prototyping tool for constructing web-based Multi-Modal Interaction (MMI) applications. The goal of IB is making it easy to develop MMI applications with speech recognition, life-like agents, speech synthesis, web browsing, etc. For this purpose, IB supports the following interface and functions: (1) GUI for implementing MMI systems without the details of MMI and MMI description language, (2) functionalities of handling synchronized multimodal inputs/outputs, (3) a test run mode for run-time testing. The results of evaluation tests showed that the application development cycle using IB was significantly shortened in comparison with the time using a text editor both for MMI description language experts and for beginners.
Seungzoo JEONG Naoki HASHIMOTO Makoto SATO
Many immersive displays developed in previous researches are strongly influenced by the design concept of the CAVE, which is the origin of the immersive displays. In the view of human-scale interactive system for virtual environment (VE), the existing immersive systems are not enough to use the potential of a human sense further extent. The displays require more complicated structure for flexible extension, and are more restrictive to user's movement. Therefore we propose a novel multi-projector display for immersive VE with haptic interface for more flexible and dynamic interaction. The display part of our system named "D-vision" has a hybrid curved screen which consist of compound prototype with flat and curve screen. This renders images seamlessly in real time, and generates high-quality stereovision by PC cluster and two-pass technology. Furthermore a human-scale string-based haptic device will integrate with the D-vision for more interactive and immersive VE. In this paper, we show an overview of the D-vision and technologies used for the human-scale haptic interface.
Hiromitsu BAN Chiyomi MIYAJIMA Katsunobu ITOU Kazuya TAKEDA Fumitada ITAKURA
Behavioral synchronization between speech and finger tapping provides a novel approach to improving speech recognition accuracy. We combine a sequence of finger tapping timings recorded alongside an utterance using two distinct methods: in the first method, HMM state transition probabilities at the word boundaries are controlled by the timing of the finger tapping; in the second, the probability (relative frequency) of the finger tapping is used as a 'feature' and combined with MFCC in a HMM recognition system. We evaluate these methods through connected digit recognition under different noise conditions (AURORA-2J). Leveraging the synchrony between speech and finger tapping provides a 46% relative improvement in connected digit recognition experiments.
Hanxi ZHU Ikuo YOSHIHARA Kunihito YAMAMORI Moritoshi YASUNAGA
We have developed Multi-modal Neural Networks (MNN) to improve the accuracy of symbolic sequence pattern classification. The basic structure of the MNN is composed of several sub-classifiers using neural networks and a decision unit. Two types of the MNN are proposed: a primary MNN and a twofold MNN. In the primary MNN, the sub-classifier is composed of a conventional three-layer neural network. The decision unit uses the majority decision to produce the final decisions from the outputs of the sub-classifiers. In the twofold MNN, the sub-classifier is composed of the primary MNN for partial classification. The decision unit uses a three-layer neural network to produce the final decisions. In the latter type of the MNN, since the structure of the primary MNN is folded into the sub-classifier, the basic structure of the MNN is used twice, which is the reason why we call the method twofold MNN. The MNN is validated with two benchmark tests: EPR (English Pronunciation Reasoning) and prediction of protein secondary structure. The reasoning accuracy of EPR is improved from 85.4% by using a three-layer neural network to 87.7% by using the primary MNN. In the prediction of protein secondary structure, the average accuracy is improved from 69.1% of a three-layer neural network to 74.6% by the primary MNN and 75.6% by the twofold MNN. The prediction test is based on a database of 126 non-homologous protein sequences.
Kazumasa MURAI Satoshi NAKAMURA
This paper discusses "face-to-talk" audio-visual speech detection for robust speech recognition in noisy environment, which consists of facial orientation based switch and audio-visual speech section detection. Most of today's speech recognition systems must actually turned on and off by a switch e.g. "push-to-talk" to indicate which utterance should be recognized, and a specific speech section must be detected prior to any further analysis. To improve usability and performance, we have researched how to extract the useful information from visual modality. We implemented a facial orientation based switch, which activates the speech recognition during a speaker is facing to the camera. Then, the speech section is detected by analyzing the image of the face. Visual speech detection is robust to audio noise, but because the articulation starts prior to the speech and lasts longer than the speech, the detected section tends to be longer and ends up with insertion errors. Therefore, we have fused the audio-visual modality detected sections. Our experiment confirms that the proposed audio-visual speech detection method improves recognition performance in noisy environment.
Seigou YASUDA Akira OKAMOTO Hiroshi HASEGAWA Yoshito MEKADA Masao KASUGA Kazuo KAMATA
For people with serious disability, it is most significant to be able to use the same communication methods, for instance a telephone and an electronic mail system (e-mail), as ordinary people do in order to get a normal life and communicate with other people for leading a social life. In particular, having communications access to an e-mail is a very effective method of communication that enables them to convey their intention to other people directly while at the same time keep their privacy. However, it takes them much time and effort to input an e-mail text on the computer. They also need much support by their attendants. From this point of view, we propose a multi-modal communication system that is composed of a voice recognizer, a pointing device, and a text composer. This system intend to improve the man-machine interface for people with physical disability. In this system, our voice recognition technology plays a key role in providing a good interface between disabled people and the personal computer. When generating e-mail contents, users access the database containing user keywords, and the guidance menu from which they select the appropriate word by voice. Our experimental results suggest that this communication system improves not only the time efficiency of text composition but also the readiness of disabled people to communicate with other people. In addition, our disabled subject on this paper is not able to move his body, legs and hands due to suffer from muscular dystrophy. And he is able to move only his fingers and speak command words with the assistance of a respirator.
Keiko WATANUKI Kenji SAKAMOTO Fumio TOGAWA
We are developing multimodal man-machine interfaces through which users can communicate by integrating speech, gaze, facial expressions, and gestures such as nodding and finger pointing. Such multimodal interfaces are expected to provide more flexible, natural and productive communications between humans and computers. To achieve this goal, we have taken the approach of modeling human behavior in the context of ordinary face-to-face conversations. As the first step, we have implemented a system which utilizes video and audio recording equipment to capture verbal and nonverbal information in interpersonal communications. Using this system, we have collected data from a task-oriented conversation between a guest (subject) and a receptionist at company reception desk, and quantitatively analyzed this data with respect to multi-modalities which would be functional in fluid interactions. This paper presents detailed analyses of the data collected: (1) head nodding and eye-contact are related to the beginning and end of speaking turns, acting to supplement speech information; (2) listener responses occur after an average of 0.35 sec. from the receptionist's utterance of a keyword, and turn-taking for tag-questions occurs after an average of 0.44 sec.; and (3) there is a rhythmical coordination between speakers and listeners.