Dongni HU Chengxin CHEN Pengyuan ZHANG Junfeng LI Yonghong YAN Qingwei ZHAO
Recently, automated recognition and analysis of human emotion has attracted increasing attention from multidisciplinary communities. However, it is challenging to utilize the emotional information simultaneously from multiple modalities. Previous studies have explored different fusion methods, but they mainly focused on either inter-modality interaction or intra-modality interaction. In this letter, we propose a novel two-stage fusion strategy named modality attention flow (MAF) to model the intra- and inter-modality interactions simultaneously in a unified end-to-end framework. Experimental results show that the proposed approach outperforms the widely used late fusion methods, and achieves even better performance when the number of stacked MAF blocks increases.
Yong HE Ji LI Xuanhong ZHOU Zewei CHEN Xin LIU
6DoF pose estimation from a monocular RGB image is a challenging but fundamental task. The methods based on unit direction vector-field representation and Hough voting strategy achieved state-of-the-art performance. Nevertheless, they apply the smooth l1 loss to learn the two elements of the unit vector separately, resulting in which is not taken into account that the prior distance between the pixel and the keypoint. While the positioning error is significantly affected by the prior distance. In this work, we propose a Prior Distance Augmented Loss (PDAL) to exploit the prior distance for more accurate vector-field representation. Furthermore, we propose a lightweight channel-level attention module for adaptive feature fusion. Embedding this Adaptive Fusion Attention Module (AFAM) into the U-Net, we build an Attention Voting Network to further improve the performance of our method. We conduct extensive experiments to demonstrate the effectiveness and performance improvement of our methods on the LINEMOD, OCCLUSION and YCB-Video datasets. Our experiments show that the proposed methods bring significant performance gains and outperform state-of-the-art RGB-based methods without any post-refinement.
Yuta UKON Shimpei SATO Atsushi TAKAHASHI
Advanced information-processing services such as computer vision require a high-performance digital circuit to perform high-load processing at high speed. To achieve high-speed processing, several image-processing applications use an approximate computing technique to reduce idle time of the circuit. However, it is difficult to design the high-speed image-processing circuit while controlling the error rate so as not to degrade service quality, and this technique is used for only a few applications. In this paper, we propose a method that achieves high-speed processing effectively in which processing time for each task is changed by roughly detecting its completion. Using this method, a high-speed processing circuit with a low error rate can be designed. The error rate is controllable, and a circuit design method to minimize the error rate is also presented in this paper. To confirm the effectiveness of our proposal, a ripple-carry adder (RCA), 2-dimensional discrete cosine transform (2D-DCT) circuit, and histogram of oriented gradients (HOG) feature calculation circuit are evaluated. Effective clock periods of these circuits obtained by our method with around 1% error rate are improved about 64%, 6%, and 12%, respectively, compared with circuits without error. Furthermore, the impact of the miscalculation on a video monitoring service using an object detection application is investigated. As a result, more than 99% of detection points required to be obtained are detected, and it is confirmed the miscalculation hardly degrades the service quality.
Rizal Setya PERDANA Yoshiteru ISHIDA
Automatic generation of textual stories from visual data representation, known as visual storytelling, is a recent advancement in the problem of images-to-text. Instead of using a single image as input, visual storytelling processes a sequential array of images into coherent sentences. A story contains non-visual concepts as well as descriptions of literal object(s). While previous approaches have applied external knowledge, our approach was to regard the non-visual concept as the semantic correlation between visual modality and textual modality. This paper, therefore, presents new features representation based on a canonical correlation analysis between two modalities. Attention mechanism are adopted as the underlying architecture of the image-to-text problem, rather than standard encoder-decoder models. Canonical Correlation Attention Mechanism (CAAM), the proposed end-to-end architecture, extracts time series correlation by maximizing the cross-modal correlation. Extensive experiments on VIST dataset ( http://visionandlanguage.net/VIST/dataset.html ) were conducted to demonstrate the effectiveness of the architecture in terms of automatic metrics, with additional experiments show the impact of modality fusion strategy.
Chaoran ZHOU Jianping ZHAO Tai MA Xin ZHOU
In Internet applications, when users search for information, the search engines invariably return some invalid webpages that do not contain valid information. These invalid webpages interfere with the users' access to useful information, affect the efficiency of users' information query and occupy Internet resources. Accurate and fast filtering of invalid webpages can purify the Internet environment and provide convenience for netizens. This paper proposes an invalid webpage filtering model (HAIF) based on deep learning and hierarchical attention mechanism. HAIF improves the semantic and sequence information representation of webpage text by concatenating lexical-level embeddings and paragraph-level embeddings. HAIF introduces hierarchical attention mechanism to optimize the extraction of text sequence features and webpage tag features. Among them, the local-level attention layer optimizes the local information in the plain text. By concatenating the input embeddings and the feature matrix after local-level attention calculation, it enriches the representation of information. The tag-level attention layer introduces webpage structural feature information on the attention calculation of different HTML tags, so that HAIF is better applicable to the Internet resource field. In order to evaluate the effectiveness of HAIF in filtering invalid pages, we conducted various experiments. Experimental results demonstrate that, compared with other baseline models, HAIF has improved to various degrees on various evaluation criteria.
Yuanbo FANG Hongliang FU Huawei TAO Ruiyu LIANG Li ZHAO
Speech based deception detection using deep learning is one of the technologies to realize a deception detection system with high recognition rate in the future. Multi-network feature extraction technology can effectively improve the recognition performance of the system, but due to the limited labeled data and the lack of effective feature fusion methods, the performance of the network is limited. Based on this, a novel hybrid network model based on attentional multi-feature fusion (HN-AMFF) is proposed. Firstly, the static features of large amounts of unlabeled speech data are input into DAE for unsupervised training. Secondly, the frame-level features and static features of a small amount of labeled speech data are simultaneously input into the LSTM network and the encoded output part of DAE for joint supervised training. Finally, a feature fusion algorithm based on attention mechanism is proposed, which can get the optimal feature set in the training process. Simulation results show that the proposed feature fusion method is significantly better than traditional feature fusion methods, and the model can achieve advanced performance with only a small amount of labeled data.
Akinori SAKAGUCHI Takashi TAKIMOTO Toshimitsu USHIO
In our previous work, we developed a quadrotor with a tilting frame using the parallel link mechanism. It can tilt its frame in the pitch direction by driving only one servo motor. However, it has a singularity such that the input torque in the pitch direction equals 0 at ±π/2 tilted state. In this letter, we analyze the Hopf bifurcation of the controlled quadrotor around the singularity and show the stable limit cycle occurs in the pitch direction by simulation and experiments.
Kohei NAKAI Takashi MATSUBARA Kuniaki UEHARA
The recent development of neural architecture search (NAS) has enabled us to automatically discover architectures of neural networks with high performance within a few days. Convolutional neural networks extract fruitful features by repeatedly applying standard operations (convolutions and poolings). However, these operations also extract useless or even disturbing features. Attention mechanisms enable neural networks to discard information of no interest, having achieved the state-of-the-art performance. While a variety of attentions for CNNs have been proposed, current NAS methods have paid a little attention to them. In this study, we propose a novel NAS method that searches attentions as well as operations. We examined several patterns to arrange attentions and operations, and found that attentions work better when they have their own search space and follow operations. We demonstrate the superior performance of our method in experiments on CIFAR-10, CIFAR-100, and ImageNet datasets. The found architecture achieved lower classification error rates and required fewer parameters compared to those found by current NAS methods.
Zizheng JI Zhengchao LEI Tingting SHEN Jing ZHANG
The joint representations of knowledge graph have become an important approach to improve the quality of knowledge graph, which is beneficial to machine learning, data mining, and artificial intelligence applications. However, the previous work suffers severely from the noise in text when modeling the text information. To overcome this problem, this paper mines the high-quality reference sentences of the entities in the knowledge graph, to enhance the representation ability of the entities. A novel framework for joint representation learning of knowledge graphs and text information based on reference sentence noise-reduction is proposed, which embeds the entity, the relations, and the words into a unified vector space. The proposed framework consists of knowledge graph representation learning module, textual relation representation learning module, and textual entity representation learning module. Experiments on entity prediction, relation prediction, and triple classification tasks are conducted, results show that the proposed framework can significantly improve the performance of mining and fusing the text information. Especially, compared with the state-of-the-art method[15], the proposed framework improves the metric of H@10 by 5.08% and 3.93% in entity prediction task and relation prediction task, respectively, and improves the metric of accuracy by 5.08% in triple classification task.
Mobile edge computing (MEC) is a new computing paradigm, which provides computing support for resource-constrained user equipments (UEs). In this letter, we design an effective incentive framework to encourage MEC operators to provide computing service for UEs. The problem of jointly allocating communication and computing resources to maximize the revenue of MEC operators is studied. Based on auction theory, we design a multi-round iterative auction (MRIA) algorithm to solve the problem. Extensive simulations have been conducted to evaluate the performance of the proposed algorithm and it is shown that the proposed algorithm can significantly improve the overall revenue of MEC operators.
Mingming YANG Min ZHANG Kehai CHEN Rui WANG Tiejun ZHAO
Attention mechanism, which selectively focuses on source-side information to learn a context vector for generating target words, has been shown to be an effective method for neural machine translation (NMT). In fact, generating target words depends on not only the source-side information but also the target-side information. Although the vanilla NMT can acquire target-side information implicitly by recurrent neural networks (RNN), RNN cannot adequately capture the global relationship between target-side words. To solve this problem, this paper proposes a novel target-attention approach to capture this information, thus enhancing target word predictions in NMT. Specifically, we propose three variants of target-attention model to directly obtain the global relationship among target words: 1) a forward target-attention model that uses a target attention mechanism to incorporate previous historical target words into the prediction of the current target word; 2) a reverse target-attention model that adopts a reverse RNN model to obtain the entire reverse target words information, and then to combine with source context information to generate target sequence; 3) a bidirectional target-attention model that combines the forward target-attention model and reverse target-attention model together, which can make full use of target words to further improve the performance of NMT. Our methods can be integrated into both RNN based NMT and self-attention based NMT, and help NMT get global target-side information to improve translation performance. Experiments on the NIST Chinese-to-English and the WMT English-to-German translation tasks show that the proposed models achieve significant improvements over state-of-the-art baselines.
Based on the License Assisted Access (LAA) small cell architecture, the LAA coexisting with Wi-Fi heterogeneous networks provide LTE mobile users with high bandwidth efficiency as the unlicensed channels are shared among LAA and Wi-Fi. However, the LAA and Wi-Fi will affect each other when both systems are using the same unlicensed channel in the heterogeneous networks. In such a network, unlicensed band allocation for LAA and Wi-Fi is an important issue that may affect the quality of service (QoS) of both systems significantly. In this paper, we propose an analytical model and conduct simulation experiments to study two allocations for the unlicensed band: unlicensed full allocation (UFA), unlicensed time-division allocation (UTA), and the corresponding buffering mechanism for the LAA data packets. We evaluate the performance for these unlicensed band allocations schemes in terms of the acceptance rate of both LAA and Wi-Fi packet data in LAA buffer queue. Our study provides guidelines for designing channel occupation phase and the buffer size of LAA small cell.
Umme Aymun SIDDIQUA Abu Nowshed CHY Masaki AONO
Stance detection in twitter aims at mining user stances expressed in a tweet towards a single or multiple target entities. Detecting and analyzing user stances from massive opinion-oriented twitter posts provide enormous opportunities to journalists, governments, companies, and other organizations. Most of the prior studies have explored the traditional deep learning models, e.g., long short-term memory (LSTM) and gated recurrent unit (GRU) for detecting stance in tweets. However, compared to these traditional approaches, recently proposed densely connected bidirectional LSTM and nested LSTMs architectures effectively address the vanishing-gradient and overfitting problems as well as dealing with long-term dependencies. In this paper, we propose a neural network model that adopts the strengths of these two LSTM variants to learn better long-term dependencies, where each module coupled with an attention mechanism that amplifies the contribution of important elements in the final representation. We also employ a multi-kernel convolution on top of them to extract the higher-level tweet representations. Results of extensive experiments on single and multi-target benchmark stance detection datasets show that our proposed method achieves substantial improvement over the current state-of-the-art deep learning based methods.
Jinna LV Bin WU Yunlei ZHANG Yunpeng XIAO
Recently, social relation analysis receives an increasing amount of attention from text to image data. However, social relation analysis from video is an important problem, which is lacking in the current literature. There are still some challenges: 1) it is hard to learn a satisfactory mapping function from low-level pixels to high-level social relation space; 2) how to efficiently select the most relevant information from noisy and unsegmented video. In this paper, we present an Attentive Sequences Recurrent Network model, called ASRN, to deal with the above challenges. First, in order to explore multiple clues, we design a Multiple Feature Attention (MFA) mechanism to fuse multiple visual features (i.e. image, motion, body, and face). Through this manner, we can generate an appropriate mapping function from low-level video pixels to high-level social relation space. Second, we design a sequence recurrent network based on Global and Local Attention (GLA) mechanism. Specially, an attention mechanism is used in GLA to integrate global feature with local sequence feature to select more relevant sequences for the recognition task. Therefore, the GLA module can better deal with noisy and unsegmented video. At last, extensive experiments on the SRIV dataset demonstrate the performance of our ASRN model.
Goichiro HANAOKA Takahiro MATSUDA Jacob C. N. SCHULDT
Key encapsulation mechanism (KEM) combiners, recently formalized by Giacon, Heuer, and Poettering (PKC'18), enable hedging against insecure KEMs or weak parameter choices by combining ingredient KEMs into a single KEM that remains secure assuming just one of the underlying ingredient KEMs is secure. This seems particularly relevant when considering quantum-resistant KEMs which are often based on arguably less well-understood hardness assumptions and parameter choices. We propose a new simple KEM combiner based on a one-time secure message authentication code (MAC) and two-time correlated input secure hash. Instantiating the correlated input secure hash with a t-wise independent hash for an appropriate value of t, yields a KEM combiner based on a strictly weaker additional primitive than the standard model construction of Giaon et al. and furthermore removes the need to do n full passes over the encapsulation, where n is the number of ingredient KEMs, which Giacon et al. highlight as a disadvantage of their scheme. However, unlike Giacon et al., our construction requires the public key of the combined KEM to include a hash key, and furthermore requires a MAC tag to be added to the encapsulation of the combined KEM.
Wei BAI Yuli ZHANG Meng WANG Jin CHEN Han JIANG Zhan GAO Donglin JIAO
This paper investigates the spectrum allocation problem. Under the current spectrum management mode, large amount of spectrum resource is wasted due to uncertainty of user's demand. To reduce the impact of uncertainty, a presale mechanism is designed based on spectrum pool. In this mechanism, the spectrum manager provides spectrum resource at a favorable price for presale aiming at sharing with user the risk caused by uncertainty of demand. Because of the hierarchical characteristic, we build a spectrum market Stackelberg game, in which the manager acts as leader and user as follower. Then proof of the uniqueness and optimality of Stackelberg Equilibrium is given. Simulation results show the presale mechanism can promote profits for both sides and reduce temporary scheduling.
Li HUANG Xiao ZHENG Shuai DING Zhi LIU Jun HUANG
The Cuckoo Search (CS) is apt to be trapped in local optimum relating to complex target functions. This drawback has been recognized as the bottleneck of its widespread use. This paper, with the purpose of improving CS, puts forward a Cuckoo Search algorithm featuring Multi-Learning Strategies (LSCS). In LSCS, the Converted Learning Module, which features the Comprehensive Learning Strategy and Optimal Learning Strategy, tries to make a coordinated cooperation between exploration and exploitation, and the switching in this part is decided by the transition probability Pc. When the nest fails to be renewed after m iterations, the Elite Learning Perturbation Module provides extra diversity for the current nest, and it can avoid stagnation. The Boundary Handling Approach adjusted by Gauss map is utilized to reset the location of nest beyond the boundary. The proposed algorithm is evaluated by two different tests: Test Group A(ten simple unimodal and multimodal functions) and Test Group B(the CEC2013 test suite). Experiments results show that LSCS demonstrates significant advantages in terms of convergence speed and optimization capability in solving complex problems.
Rui SUN Huihui WANG Jun ZHANG Xudong ZHANG
As a research hotspot and difficulty in the field of computer vision, pedestrian detection has been widely used in intelligent driving and traffic monitoring. The popular detection method at present uses region proposal network (RPN) to generate candidate regions, and then classifies the regions. But the RPN produces many erroneous candidate areas, causing region proposals for false positives to increase. This letter uses improved residual attention network to capture the visual attention map of images, then normalized to get the attention score map. The attention score map is used to guide the RPN network to generate more precise candidate regions containing potential target objects. The region proposals, confidence scores, and features generated by the RPN are used to train a cascaded boosted forest classifier to obtain the final results. The experimental results show that our proposed approach achieves highly competitive results on the Caltech and ETH datasets.
Yuhei FUKUI Aleksandar SHURBEVSKI Hiroshi NAGAMOCHI
In the obnoxious facility game, we design mechanisms that output a location of an undesirable facility based on the locations of players reported by themselves. The benefit of a player is defined to be the distance between her location and the facility. A player may try to manipulate the output of the mechanism by strategically misreporting her location. We wish to design a λ-group strategy-proof mechanism i.e., for every group of players, at least one player in the group cannot gain strictly more than λ times her primary benefit by having the entire group change their reports simultaneously. In this paper, we design a k-candidate λ-group strategy-proof mechanism for the obnoxious facility game in the metric defined by k half lines with a common endpoint such that each candidate is a point in each of the half-lines at the same distance to the common endpoint as other candidates. Then, we show that the benefit ratio of the mechanism is at most 1+2/(k-1)λ. Finally, we prove that the bound is nearly tight.
Yue XIE Ruiyu LIANG Zhenlin LIANG Li ZHAO
Despite the widespread use of deep learning for speech emotion recognition, they are severely restricted due to the information loss in the high layer of deep neural networks, as well as the degradation problem. In order to efficiently utilize information and solve degradation, attention-based dense long short-term memory (LSTM) is proposed for speech emotion recognition. LSTM networks with the ability to process time series such as speech are constructed into which attention-based dense connections are introduced. That means the weight coefficients are added to skip-connections of each layer to distinguish the difference of the emotional information between layers and avoid the interference of redundant information from the bottom layer to the effective information from the top layer. The experiments demonstrate that proposed method improves the recognition performance by 12% and 7% on eNTERFACE and IEMOCAP corpus respectively.