Maoxi LI Qingyu XIANG Zhiming CHEN Mingwen WANG
The-state-of-the-art neural quality estimation (QE) of machine translation model consists of two sub-networks that are tuned separately, a bidirectional recurrent neural network (RNN) encoder-decoder trained for neural machine translation, called the predictor, and an RNN trained for sentence-level QE tasks, called the estimator. We propose to combine the two sub-networks into a whole neural network, called the unified neural network. When training, the bidirectional RNN encoder-decoder are initialized and pre-trained with the bilingual parallel corpus, and then, the networks are trained jointly to minimize the mean absolute error over the QE training samples. Compared with the predictor and estimator approach, the use of a unified neural network helps to train the parameters of the neural networks that are more suitable for the QE task. Experimental results on the benchmark data set of the WMT17 sentence-level QE shared task show that the proposed unified neural network approach consistently outperforms the predictor and estimator approach and significantly outperforms the other baseline QE approaches.
Yu ZHANG Pengyuan ZHANG Qingwei ZHAO
In this letter, we explored the usage of spatio-temporal information in one unified framework to improve the performance of multichannel speech recognition. Generalized cross correlation (GCC) is served as spatial feature compensation, and an attention mechanism across time is embedded within long short-term memory (LSTM) neural networks. Experiments on the AMI meeting corpus show that the proposed method provides a 8.2% relative improvement in word error rate (WER) over the model trained directly on the concatenation of multiple microphone outputs.
Satoshi KAWASE Takayuki ITO Takanobu OTSUKA Akihisa SENGOKU Shun SHIRAMATSU Tokuro MATSUO Tetsuya OISHI Rieko FUJITA Naoki FUKUTA Katsuhide FUJITA
Performance based on multi-party discussion has been reported to be superior to that based on individuals. However, it is impossible that all participants simultaneously express opinions due to the time and space limitations in a large-scale discussion. In particular, only a few representative discussants and audiences can speak in conventional unidirectional discussions (e.g., panel discussion), although many participants gather for the discussion. To solve these problems, in this study, we proposed a cyber-physical discussion using “COLLAGREE,” which we developed for building consensus of large-scale online discussions. COLLAGREE is equipped with functions such as a facilitator, point ranking system, and display of discussion in tree structure. We focused on the relationship between satisfaction with the discussion and participants' desire to express opinions. We conducted the experiment in the panel discussion of an actual international conference. Participants who were audiences in the floor used COLLAGREE during the panel discussion. They responded to questionnaires after the experiment. The main findings are as follows: (1) Participation in online discussion was associated with the satisfaction of the participants; (2) Participants who desired to positively express opinions joined the cyber-space discussion; and (3) The satisfaction of participants who expressed opinions in the cyber-space discussion was higher than those of participants who expressed opinions in the real-space discussion and those who did not express opinions in both the cyber- and real-space discussions. Overall, active behaviors in the cyber-space discussion were associated with participants' satisfaction with the entire discussion, suggesting that cyberspace provided useful alternative opportunities to express opinions for audiences who used to listen to conventional unidirectional discussions passively. In addition, a complementary relationship exists between participation in the cyber-space and real-space discussions. These findings can serve to create a user-friendly discussion environment.
Yang LI Zhuang MIAO Ming HE Yafei ZHANG Hang LI
How to represent images into highly compact binary codes is a critical issue in many computer vision tasks. Existing deep hashing methods typically focus on designing loss function by using pairwise or triplet labels. However, these methods ignore the attention mechanism in the human visual system. In this letter, we propose a novel Deep Attention Residual Hashing (DARH) method, which directly learns hash codes based on a simple pointwise classification loss function. Compared to previous methods, our method does not need to generate all possible pairwise or triplet labels from the training dataset. Specifically, we develop a new type of attention layer which can learn human eye fixation and significantly improves the representation ability of hash codes. In addition, we embedded the attention layer into the residual network to simultaneously learn discriminative image features and hash codes in an end-to-end manner. Extensive experiments on standard benchmarks demonstrate that our method preserves the instance-level similarity and outperforms state-of-the-art deep hashing methods in the image retrieval application.
Dengchao HE Hongjun ZHANG Wenning HAO Rui ZHANG Huan HAO
The purpose of document modeling is to learn low-dimensional semantic representations of text accurately for Natural Language Processing tasks. In this paper, proposed is a novel attention-based hybrid neural network model, which would extract semantic features of text hierarchically. Concretely, our model adopts a bidirectional LSTM module with word-level attention to extract semantic information for each sentence in text and subsequently learns high level features via a dynamic convolution neural network module. Experimental results demonstrate that our proposed approach is effective and achieve better performance than conventional methods.
Hironori TAKIMOTO Syuhei HITOMI Hitoshi YAMAUCHI Mitsuyoshi KISHIHARA Kensuke OKUBO
It is estimated that 80% of the information entering the human brain is obtained through the eyes. Therefore, it is commonly believed that drawing human attention to particular objects is effective in assisting human activities. In this paper, we propose a novel image modification method for guiding user attention to specific regions of interest by using a novel saliency map model based on spatial frequency components. We modify the frequency components on the basis of the obtained saliency map to decrease the visual saliency outside the specified region. By applying our modification method to an image, human attention can be guided to the specified region because the saliency inside the region is higher than that outside the region. Using gaze measurements, we show that the proposed saliency map matches well with the distribution of actual human attention. Moreover, we evaluate the effectiveness of the proposed modification method by using an eye tracking system.
Takatsugu HIRAYAMA Toshiya OHIRA Kenji MASE
Intelligent information systems captivate people's attention. Examples of such systems include driving support vehicles capable of sensing driver state and communication robots capable of interacting with humans. Modeling how people search visual information is indispensable for designing these kinds of systems. In this paper, we focus on human visual attention, which is closely related to visual search behavior. We propose a computational model to estimate human visual attention while carrying out a visual target search task. Existing models estimate visual attention using the ratio between a representative value of visual feature of a target stimulus and that of distractors or background. The models, however, can not often achieve a better performance for difficult search tasks that require a sequentially spotlighting process. For such tasks, the linear separability effect of a visual feature distribution should be considered. Hence, we introduce this effect to spatially localized activation. Concretely, our top-down model estimates target-specific visual attention using Fisher's variance ratio between a visual feature distribution of a local region in the field of view and that of a target stimulus. We confirm the effectiveness of our computational model through a visual search experiment.
Hironori TAKIMOTO Tatsuhiko KOKUI Hitoshi YAMAUCHI Mitsuyoshi KISHIHARA Kensuke OKUBO
It is commonly believed that improved interaction between humans and electronic device, it is effective to draw the viewer's attention to a particular object. Augmented reality (AR) applications can call attention to real objects by overlaying highlight effects or visual stimuli (such as arrows) on a physical scene. Sometimes, more subtle effects would be desirable, in which case it would be necessary to smoothly and naturally guide the user's gaze without external stimuli. Here, a novel image modification method is proposed for directing a viewer's gaze to specific regions of interest. The proposed method uses saliency analysis and color modulation to create modified images in which the region of interest is the most salient region in the entire image. The proposed saliency map model that is used during saliency analysis reduces computational costs and improves the naturalness of the image using the LAB color space and simplified normalization. During color modulation, the modulation value of each LAB component is determined in order to consider the relationship between the LAB components and the saliency value. With the image obtained in this manner, the viewer's attention is smoothly attracted to a specific region very naturally. Gaze measurements as well as a subjective experiments were conducted to prove the effectiveness of the proposed method. These results show that a viewer's visual attention is indeed attracted toward the specified region without any sense of discomfort or disruption when the proposed method is used.
Ruiyu LIANG Huawei TAO Guichen TANG Qingyun WANG Li ZHAO
A salient feature extraction algorithm is proposed to improve the recognition rate of the speech emotion. Firstly, the spectrogram of the emotional speech is calculated. Secondly, imitating the selective attention mechanism, the color, direction and brightness map of the spectrogram is computed. Each map is normalized and down-sampled to form the low resolution feature matrix. Then, each feature matrix is converted to the row vector and the principal component analysis (PCA) is used to reduce features redundancy to make the subsequent classification algorithm more practical. Finally, the speech emotion is classified with the support vector machine. Compared with the tradition features, the improved recognition rate reaches 15%.
Selective visual attention is an integral mechanism of the human visual system that is often neglected when designing perceptually relevant image and video quality metrics. Disregarding attention mechanisms assumes that all distortions in the visual content impact equally on the overall quality perception, which is typically not the case. Over the past years we have performed several experiments to study the effect of visual attention on quality perception. In addition to gaining a deeper scientific understanding of this matter, we were also able to use this knowledge to further improve various quality prediction models. In this article, I review our work with the aim to increase awareness on the importance of visual attention mechanisms for the effective design of quality prediction models.
Integrating the visual attention (VA) model into an objective image quality metric is a rapidly evolving area in modern image quality assessment (IQA) research due to the significant opportunities the VA information presents. So far, in the literature, it has been suggested to use either a task-free saliency map or a quality-task one for the integration into quality metric. A hybrid integration approach which takes the advantages of both saliency maps is presented in this paper. We compare our hybrid integration scheme with existing integration schemes using simple quality metrics. Results show that the proposed method performs better than the previous techniques in terms of prediction accuracy.
Akihiko KITAMURA Hiroshi NAITO Takahiko KIMURA Kazumitsu SHINOHARA Takashi SASAKI Haruhiko OKUMURA
This study investigated the distribution of attention to frontal space in augmented reality (AR). We conducted two experiments to compare binocular and monocular observation when an AR image was presented. According to a previous study, when participants observed an AR image in monocular presentation, they perceived the AR image as more distant than in binocular vision. Therefore, we predicted that attention would need to be shifted between the AR image and the background in not the monocular observation but the binocular one. This would enable an observer to distribute his/her visual attention across a wider space in the monocular observation. In the experiments, participants performed two tasks concurrently to measure the size of the useful field of view (UFOV). One task was letter/number discrimination in which an AR image was presented in the central field of view (the central task). The other task was luminance change detection in which dots were presented in the peripheral field of view (the peripheral task). Depth difference existed between the AR image and the location of the peripheral task in Experiment 1 but not in Experiment 2. The results of Experiment 1 indicated that the UFOV became wider in the monocular observation than in the binocular observation. In Experiment 2, the size of the UFOV in the monocular observation was equivalent to that in the binocular observation. It becomes difficult for a participant to observe the stimuli on the background in the binocular observation when there is depth difference between the AR image and the background. These results indicate that the monocular presentation in AR is superior to binocular presentation, and even in the best condition for the binocular condition the monocular presentation is equivalent to the binocular presentation in terms of the UFOV.
Xing ZHANG Keli HU Lei WANG Xiaolin ZHANG Yingguan WANG
In this study, we address the problem of salient region detection. Recently, saliency detection with contrast based approaches has shown to give promising results. However, different individual features exhibit different performance. In this paper, we show that the combination of color uniqueness and color spatial distribution is an effective way to detect saliency. A Color Adaptive Thresholding Watershed Fusion Segmentation (CAT-WFS) method is first given to retain boundary information and delete unnecessary details. Based on the segmentation, color uniqueness and color spatial distribution are defined separately. The color uniqueness denotes the color rareness of salient object, while the color spatial distribution represents the color attribute of the background. Aiming at highlighting the salient object and downplaying the background, we combine the two characters to generate the final saliency map. Experimental results demonstrate that the proposed algorithm outperforms existing salient object detection methods.
Rong WANG Zhiliang WANG Xirong MA
For the problem of Indoor Home Scene Classification, this paper proposes the BOW Model of Local Feature Information Gain. The experimental results show that not only the performance is improved but also the computation is reduced. Consequently this method out performs the state-of-the-art approach.
Akisato KIMURA Ryo YONETANI Takatsugu HIRAYAMA
We humans are easily able to instantaneously detect the regions in a visual scene that are most likely to contain something of interest. Exploiting this pre-selection mechanism called visual attention for image and video processing systems would make them more sophisticated and therefore more useful. This paper briefly describes various computational models of human visual attention and their development, as well as related psychophysical findings. In particular, our objective is to carefully distinguish several types of studies related to human visual attention and saliency as a measure of attentiveness, and to provide a taxonomy from several viewpoints such as the main objective, the use of additional cues and mathematical principles. This survey finally discusses possible future directions for research into human visual attention and saliency computation.
Zhenfeng SHI Liyang YU Ahmed A. ABD EL-LATIF Xiamu NIU
Incorporating insights from human visual perception into 3D object processing has become an important research field in computer graphics during the past decades. Many computational models for different applications have been proposed, such as mesh saliency, mesh roughness and mesh skeleton. In this letter, we present a novel Skeleton Modulated Topological Visual Perception Map (SMTPM) integrated with visual attention and visual masking mechanism. A new skeletonisation map is presented and used to modulate the weight of saliency and roughness. Inspired by salient viewpoint selection, a new Loop subdivision stencil decision based rapid viewpoint selection algorithm using our new visual perception is also proposed. Experimental results show that the SMTPM scheme can capture more richer visual perception information and our rapid viewpoint selection achieves high efficiency.
Visually saliency detection provides an alternative methodology to image description in many applications such as adaptive content delivery and image retrieval. One of the main aims of visual attention in computer vision is to detect and segment the salient regions in an image. In this paper, we employ matrix decomposition to detect salient object in nature images. To efficiently eliminate high contrast noise regions in the background, we integrate global context information into saliency detection. Therefore, the most salient region can be easily selected as the one which is globally most isolated. The proposed approach intrinsically provides an alternative methodology to model attention with low implementation complexity. Experiments show that our approach achieves much better performance than that from the existing state-of-art methods.
Xin HE Huiyun JING Qi HAN Xiamu NIU
We propose a novel saliency detection model based on Bayes' theorem. The model integrates the two parts of Bayes' equation to measure saliency, each part of which was considered separately in the previous models. The proposed model measures saliency by computing local kernel density estimation of features in the center-surround region and global kernel density estimation of features at each pixel across the whole image. Under the proposed model, a saliency detection method is presented that extracts DCT (Discrete Cosine Transform) magnitude of local region around each pixel as the feature. Experiments show that the proposed model not only performs competitively on psychological patterns and better than the current state-of-the-art models on human visual fixation data, but also is robust against signal uncertainty.
Hong BAO Song-He FENG De XU Shuoyan LIU
Localized content-based image retrieval (LCBIR) has emerged as a hot topic more recently because in the scenario of CBIR, the user is interested in a portion of the image and the rest of the image is irrelevant. In this paper, we propose a novel region-level relevance feedback method to solve the LCBIR problem. Firstly, the visual attention model is employed to measure the regional saliency of each image in the feedback image set provided by the user. Secondly, the regions in the image set are constructed to form an affinity matrix and a novel propagation energy function is defined which takes both low-level visual features and regional significance into consideration. After the iteration, regions in the positive images with high confident scores are selected as the candidate query set to conduct the next-round retrieval task until the retrieval results are satisfactory. Experimental results conducted on the SIVAL dataset demonstrate the effectiveness of the proposed approach.
Jingjing ZHONG Siwei LUO Jiao WANG
The key problem of object-based attention is the definition of objects, while contour grouping methods aim at detecting the complete boundaries of objects in images. In this paper, we develop a new contour grouping method which shows several characteristics. First, it is guided by the global saliency information. By detecting multiple boundaries in a hierarchical way, we actually construct an object-based attention model. Second, it is optimized by the grouping cost, which is decided both by Gestalt cues of directed tangents and by region saliency. Third, it gives a new definition of Gestalt cues for tangents which includes image information as well as tangent information. In this way, we can improve the robustness of our model against noise. Experiment results are shown in this paper, with a comparison against other grouping model and space-based attention model.