1-7hit |
Zhi LIU Fangyuan ZHAO Mengmeng ZHANG
In video-text retrieval task, mainstream framework consists of three parts: video encoder, text encoder and similarity calculation. MMT (Multi-modal Transformer) achieves remarkable performance for this task, however, it faces the problem of insufficient training dataset. In this paper, an efficient multimodal aggregation network for video-text retrieval is proposed. Different from the prior work using MMT to fuse video features, the NetVLAD is introduced in the proposed network. It has fewer parameters and is feasible for training with small datasets. In addition, since the function of CLIP (Contrastive Language-Image Pre-training) can be considered as learning language models from visual supervision, it is introduced as text encoder in the proposed network to avoid overfitting. Meanwhile, in order to make full use of the pre-training model, a two-step training scheme is designed. Experiments show that the proposed model achieves competitive results compared with the latest work.
Soh YOSHIDA Mitsuji MUNEYASU Takahiro OGAWA Miki HASEYAMA
In this paper, we address the problem of analyzing topics, included in a social video group, to improve the retrieval performance of videos. Unlike previous methods that focused on an individual visual aspect of videos, the proposed method aims to leverage the “mutual reinforcement” of heterogeneous modalities such as tags and users associated with video on the Internet. To represent multiple types of relationships between each heterogeneous modality, the proposed method constructs three subgraphs: user-tag, video-video, and video-tag graphs. We combine the three types of graphs to obtain a heterogeneous graph. Then the extraction of latent features, i.e., topics, becomes feasible by applying graph-based soft clustering to the heterogeneous graph. By estimating the membership of each grouped cluster for each video, the proposed method defines a new video similarity measure. Since the understanding of video content is enhanced by exploiting latent features obtained from different types of data that complement each other, the performance of visual reranking is improved by the proposed method. Results of experiments on a video dataset that consists of YouTube-8M videos show the effectiveness of the proposed method, which achieves a 24.3% improvement in terms of the mean normalized discounted cumulative gain in a search ranking task compared with the baseline method.
Masaya MURATA Hidehisa NAGANO Kaoru HIRAMATSU Kunio KASHINO Shin'ichi SATOH
In this paper, we first analyze the discriminative power in the Best Match (BM) 25 formula and provide its calculation method from the Bayesian point of view. The resulting, derived discriminative power is quite similar to the exponential inverse document frequency (EIDF) that we have previously proposed [1] but retains more preferable theoretical advantages. In our previous paper [1], we proposed the EIDF in the framework of the probabilistic information retrieval (IR) method BM25 to address the instance search task, which is a specific object search for videos using an image query. Although the effectiveness of our EIDF was experimentally demonstrated, we did not consider its theoretical justification and interpretation. We also did not describe the use of region-of-interest (ROI) information, which is supposed to be input to the instance search system together with the original image query showing the instance. Therefore, here, we justify the EIDF by calculating the discriminative power in the BM25 from the Bayesian viewpoint. We also investigate the effect of the ROI information for improving the instance search accuracy and propose two search methods incorporating the ROI effect into the BM25 video ranking function. We validated the proposed methods through a series of experiments using the TREC Video Retrieval Evaluation instance search task dataset.
Yasutaka HATAKEYAMA Takahiro OGAWA Satoshi ASAMIZU Miki HASEYAMA
A novel video retrieval method based on Web community extraction using audio and visual features and textual features of video materials is proposed in this paper. In this proposed method, canonical correlation analysis is applied to these three features calculated from video materials and their Web pages, and transformation of each feature into the same variate space is possible. The transformed variates are based on the relationships between visual, audio and textual features of video materials, and the similarity between video materials in the same feature space for each feature can be calculated. Next, the proposed method introduces the obtained similarities of video materials into the link relationship between their Web pages. Furthermore, by performing link analysis of the obtained weighted link relationship, this approach extracts Web communities including similar topics and provides the degree of attribution of video materials in each Web community for each feature. Therefore, by calculating similarities of the degrees of attribution between the Web communities extracted from the three kinds of features, the desired ones are automatically selected. Consequently, by monitoring the degrees of attribution of the obtained Web communities, the proposed method can perform effective video retrieval. Some experimental results obtained by applying the proposed method to video materials obtained from actual Web pages are shown to verify the effectiveness of the proposed method.
Gamhewage C. DE SILVA Toshihiko YAMASAKI Kiyoharu AIZAWA
Automated capture and retrieval of experiences at home is interesting due to the wide variety and personal significance of such experiences. We present a system for retrieval and summarization of continuously captured multimedia data from Ubiquitous Home, a two-room house consisting of a large number of cameras and microphones. Data from pressure based sensors on the floor are analyzed to segment footsteps of different persons. Video and audio handover are implemented to retrieve continuous video streams corresponding to moving persons. An adaptive algorithm based on the rate of footsteps summarizes these video streams. A novel method for audio segmentation using multiple microphones is used for video retrieval based on sounds with high accuracy. An experiment, in which a family lived in this house for twelve days, was conducted. The system was evaluated by the residents who used the system for retrieving their own experiences; we report and discuss the results.
Jong Myeon JEONG Young Shik MOON
In this paper, efficient algorithms for content-based video retrieval using motion information are proposed. We describe algorithms for temporal scale invariant retrieval using Distance transformation and temporal scale absolute retrieval using Motion Retrieval Code. The effectiveness of the proposed algorithms has been verified by experimental results.
Teruyuki HASEGAWA Toru HASEGAWA Toshihiko KATO Kenji SUZUKI
Most of current real time video retrieval systems use video transfer protocols such that servers simply transmit video packets in the same rate as clients play them. If any packets are corrupted during transmission, they will be lost and cannot be recovered by retransmission. In video retrieval systems, however, teh video data are stored in servers and clients can prefetch them prior to playing. So, it might be possible for the video retrieval systems to make corrupted video packets retransmitted before the play-out dead line. But the application of existing reliable protocols causes problems such that, if a packet does not arrive before the dead line due to retransmission, the packets following it will not be delivered to the upper layer even if they have already arrived. In this paper, we discuss how to apply reliable protocols to real time video retrieval systems and propose an new real time video transfer protocol over ATM network, which provides the video data prefetch, the flow control for video buffer, the selective retransmission with skipping function for video packets late for the play-out dead line, and the resynchronization function for video buffer. We have implemented an experimental system using our protocol and evaluated the performance. The results of performance evaluation shows that the proposed protocol decreases the number of unplayed video data largely when transmission errors are inserted in an ATM network.