1-13hit |
Mayu OTANI Atsushi NISHIDA Yuta NAKASHIMA Tomokazu SATO Naokazu YOKOYA
Finding important regions is essential for applications, such as content-aware video compression and video retargeting to automatically crop a region in a video for small screens. Since people are one of main subjects when taking a video, some methods for finding important regions use a visual attention model based on face/pedestrian detection to incorporate the knowledge that people are important. However, such methods usually do not distinguish important people from passers-by and bystanders, which results in false positives. In this paper, we propose a deep neural network (DNN)-based method, which classifies a person into important or unimportant, given a video containing multiple people in a single frame and captured with a hand-held camera. Intuitively, important/unimportant labels are highly correlated given that corresponding people's spatial motions are similar. Based on this assumption, we propose to boost the performance of our important/unimportant classification by using conditional random fields (CRFs) built upon the DNN, which can be trained in an end-to-end manner. Our experimental results show that our method successfully classifies important people and the use of a DNN with CRFs improves the accuracy.
We propose a method for automatic emphasis estimation using conditional random fields. In our experiments, the value of F-measure obtained using our proposed method (0.31) was higher than that obtained using a random emphasis method (0.20), a method using TF-IDF (0.21), and a method based on LexRank (0.26). On the contrary, the value of F-measure of obtained using our proposed method (0.28) was slightly worse as compared with that obtained using manual estimation (0.26-0.40, with an average of 0.35).
Masayuki SUZUKI Ryo KUROIWA Keisuke INNAMI Shumpei KOBAYASHI Shinya SHIMIZU Nobuaki MINEMATSU Keikichi HIROSE
When synthesizing speech from Japanese text, correct assignment of accent nuclei for input text with arbitrary contents is indispensable in obtaining naturally-sounding synthetic speech. A phenomenon called accent sandhi occurs in utterances of Japanese; when a word is uttered in a sentence, its accent nucleus may change depending on the contexts of preceding/succeeding words. This paper describes a statistical method for automatically predicting the accent nucleus changes due to accent sandhi. First, as the basis of the research, a database of Japanese text was constructed with labels of accent phrase boundaries and accent nucleus positions when uttered in sentences. A single native speaker of Tokyo dialect Japanese annotated all the labels for 6,344 Japanese sentences. Then, using this database, a conditional-random-field-based method was developed using this database to predict accent phrase boundaries and accent nuclei. The proposed method predicted accent nucleus positions for accent phrases with 94.66% accuracy, clearly surpassing the 87.48% accuracy obtained using our rule-based method. A listening experiment was also conducted on synthetic speech obtained using the proposed method and that obtained using the rule-based method. The results show that our method significantly improved the naturalness of synthetic speech.
Naoki SAWADA Hiromitsu NISHIZAKI
This study proposes a two-pass spoken term detection (STD) method. The first pass uses a phoneme-based dynamic time warping (DTW)-based STD, and the second pass recomputes detection scores produced by the first pass using conditional random fields (CRF)-based triphone detectors. In the second-pass, we treat STD as a sequence labeling problem. We use CRF-based triphone detection models based on features generated from multiple types of phoneme-based transcriptions. The models train recognition error patterns such as phoneme-to-phoneme confusions in the CRF framework. Consequently, the models can detect a triphone comprising a query term with a detection probability. In the experimental evaluation of two types of test collections, the CRF-based approach worked well in the re-ranking process for the DTW-based detections. CRF-based re-ranking showed 2.1% and 2.0% absolute improvements in F-measure for each of the two test collections.
Jun-Li LU Makoto P. KATO Takehiro YAMAMOTO Katsumi TANAKA
We address the problem of entity identification on a microblog with special attention to indirect reference cases in which entities are not referred to by their names. Most studies on identifying entities referred to them by their full/partial name or abbreviation, while there are many indirectly mentioned entities in microblogs, which are difficult to identify in short text such as microblogs. We therefore tackled indirect reference cases by developing features that are particularly important for certain types of indirect references and modeling dependency among referred entities by a Conditional Random Field (CRF) model. In addition, we model non-sequential order dependency while keeping the inference tractable by dynamically building dependency among entities. The experimental results suggest that our features were effective for indirect references, and our CRF model with adaptive dependency was robust even when there were multiple mentions in a microblog and achieved the same high performance as that with the fully connected CRF model.
Suofei ZHANG Zhixin SUN Xu CHENG Lin ZHOU
This work presents an object tracking framework which is based on integration of Deformable Part based Models (DPMs) and Dynamic Conditional Random Fields (DCRF). In this framework, we propose a DCRF based novel way to track an object and its details on multiple resolutions simultaneously. Meanwhile, we tackle drastic variations in target appearance such as pose, view, scale and illumination changes with DPMs. To embed DPMs into DCRF, we design specific temporal potential functions between vertices by explicitly formulating deformation and partial occlusion respectively. Furthermore, temporal transition functions between mixture models bring higher robustness to perspective and pose changes. To evaluate the efficacy of our proposed method, quantitative tests on six challenging video sequences are conducted and the results are analyzed. Experimental results indicate that the method effectively addresses serious problems in object tracking and performs favorably against state-of-the-art trackers.
Yuta NAKASHIMA Noboru BABAGUCHI Jianping FAN
The recent popularization of social network services (SNSs), such as YouTube, Dailymotion, and Facebook, enables people to easily publish their personal videos taken with mobile cameras. However, at the same time, such popularity has raised a new problem: video privacy. In such social videos, the privacy of people, i.e., their appearances, must be protected, but naively obscuring all people might spoil the video content. To address this problem, we focus on videographers' capture intentions. In a social video, some persons are usually essential for the video content. They are intentionally captured by the videographers, called intentionally captured persons (ICPs), and the others are accidentally framed-in (non-ICPs). Videos containing the appearances of the non-ICPs might violate their privacy. In this paper, we developed a system called BEPS, which adopts a novel conditional random field (CRF)-based method for ICP detection, as well as a novel approach to obscure non-ICPs and preserve ICPs using background estimation. BEPS reduces the burden of manually obscuring the appearances of the non-ICPs before uploading the video to SNSs. Compared with conventional systems, the following are the main advantages of BEPS: (i) it maintains the video content, and (ii) it is immune to the failure of person detection; false positives in person detection do not violate privacy. Our experimental results successfully validated these two advantages.
Yinhui ZHANG Zifen HE Changyu LIU
Segmenting foreground objects from highly dynamic scenes with missing data is very challenging. We present a novel unsupervised segmentation approach that can cope with extensive scene dynamic as well as a substantial amount of missing data that present in dynamic scene. To make this possible, we exploit convex optimization of total variation beforehand for images with missing data in which depletion mask is available. Inpainting depleted images using total variation facilitates detecting ambiguous objects from highly dynamic images, because it is more likely to yield areas of object instances with improved grayscale contrast. We use a conditional random field that adapts to integrate both appearance and motion knowledge of the foreground objects. Our approach segments foreground object instances while inpainting the highly dynamic scene with a variety amount of missing data in a coupled way. We demonstrate this on a very challenging dataset from the UCSD Highly Dynamic Scene Benchmarks (HDSB) and compare our method with two state-of-the-art unsupervised image sequence segmentation algorithms and provide quantitative and qualitative performance comparisons.
Yanling LI Qingwei ZHAO Yonghong YAN
Semantic concept in an utterance is obtained by a fuzzy matching methods to solve problems such as words' variation induced by automatic speech recognition (ASR), or missing field of key information by users in the process of spoken language understanding (SLU). A two-stage method is proposed: first, we adopt conditional random field (CRF) for building probabilistic models to segment and label entity names from an input sentence. Second, fuzzy matching based on similarity function is conducted between the named entities labeled by a CRF model and the reference characters of a dictionary. The experiments compare the performances in terms of accuracy and processing speed. Dice similarity and cosine similarity based on TF score can achieve better accuracy performance among four similarity measures, which equal to and greater than 93% in F1-measure. Especially the latter one improved by 8.8% and 9% respectively compared to q-gram and improved edit-distance, which are two conventional methods for string fuzzy matching.
Yasuhisa FUJII Kazumasa YAMAMOTO Seiichi NAKAGAWA
In this paper, we propose Hidden Conditional Neural Fields (HCNF) for continuous phoneme speech recognition, which are a combination of Hidden Conditional Random Fields (HCRF) and a Multi-Layer Perceptron (MLP), and inherit their merits, namely, the discriminative property for sequences from HCRF and the ability to extract non-linear features from an MLP. HCNF can incorporate many types of features from which non-linear features can be extracted, and is trained by sequential criteria. We first present the formulation of HCNF and then examine three methods to further improve automatic speech recognition using HCNF, which is an objective function that explicitly considers training errors, provides a hierarchical tandem-style feature and includes a deep non-linear feature extractor for the observation function. We show that HCNF can be trained realistically without any initial model and outperforms HCRF and the triphone hidden Markov model trained by the minimum phone error (MPE) manner using experimental results for continuous English phoneme recognition on the TIMIT core test set and Japanese phoneme recognition on the IPA 100 test set.
Xiaoxuan WANG Lei XIE Mimi LU Bin MA Eng Siong CHNG Haizhou LI
In this paper, we propose integration of multimodal features using conditional random fields (CRFs) for the segmentation of broadcast news stories. We study story boundary cues from lexical, audio and video modalities, where lexical features consist of lexical similarity, chain strength and overall cohesiveness; acoustic features involve pause duration, pitch, speaker change and audio event type; and visual features contain shot boundaries, anchor faces and news title captions. These features are extracted in a sequence of boundary candidate positions in the broadcast news. A linear-chain CRF is used to detect each candidate as boundary/non-boundary tags based on the multimodal features. Important interlabel relations and contextual feature information are effectively captured by the sequential learning framework of CRFs. Story segmentation experiments show that the CRF approach outperforms other popular classifiers, including decision trees (DTs), Bayesian networks (BNs), naive Bayesian classifiers (NBs), multilayer perception (MLP), support vector machines (SVMs) and maximum entropy (ME) classifiers.
Chinese new words and their part-of-speech (POS) are particularly problematic in Chinese natural language processing. With the fast development of internet and information technology, it is impossible to get a complete system dictionary for Chinese natural language processing, as new words out of the basic system dictionary are always being created. A latent semi-CRF model, which combines the strengths of LDCRF (Latent-Dynamic Conditional Random Field) and semi-CRF, is proposed to detect the new words together with their POS synchronously regardless of the types of the new words from the Chinese text without being pre-segmented. Unlike the original semi-CRF, the LDCRF is applied to generate the candidate entities for training and testing the latent semi-CRF, which accelerates the training speed and decreases the computation cost. The complexity of the latent semi-CRF could be further adjusted by tuning the number of hidden variables in LDCRF and the number of the candidate entities from the Nbest outputs of the LDCRF. A new-words-generating framework is proposed for model training and testing, under which the definitions and distributions of the new words conform to the ones existing in real text. Specific features called "Global Fragment Information" for new word detection and POS tagging are adopted in the model training and testing. The experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags. The proposed model is found to be performing competitively with the state-of-the-art models presented.
Xuan-Hieu PHAN Le-Minh NGUYEN Yasushi INOGUCHI Susumu HORIGUCHI
Conditional random fields (CRFs) have been successfully applied to various applications of predicting and labeling structured data, such as natural language tagging & parsing, image segmentation & object recognition, and protein secondary structure prediction. The key advantages of CRFs are the ability to encode a variety of overlapping, non-independent features from empirical data as well as the capability of reaching the global normalization and optimization. However, estimating parameters for CRFs is very time-consuming due to an intensive forward-backward computation needed to estimate the likelihood function and its gradient during training. This paper presents a high-performance training of CRFs on massively parallel processing systems that allows us to handle huge datasets with hundreds of thousand data sequences and millions of features. We performed the experiments on an important natural language processing task (text chunking) on large-scale corpora and achieved significant results in terms of both the reduction of computational time and the improvement of prediction accuracy.