Mengbo ZHANG Lunwen WANG Yanqing FENG Haibo YIN
Spectrum sensing is the first task performed by cognitive radio (CR) networks. In this paper we propose a spectrum sensing algorithm for orthogonal frequency division multiplex (OFDM) signal based on deep learning and covariance matrix graph. The advantage of deep learning in image processing is applied to the spectrum sensing of OFDM signals. We start by building the spectrum sensing model of OFDM signal, and then analyze structural characteristics of covariance matrix (CM). Once CM has been normalized and transformed into a gray level representation, the gray scale map of covariance matrix (GSM-CM) is established. Then, the convolutional neural network (CNN) is designed based on the LeNet-5 network, which is used to learn the training data to obtain more abstract features hierarchically. Finally, the test data is input into the trained spectrum sensing network model, based on which spectrum sensing of OFDM signals is completed. Simulation results show that this method can complete the spectrum sensing task by taking advantage of the GSM-CM model, which has better spectrum sensing performance for OFDM signals under low SNR than existing methods.
Yundong LI Weigang ZHAO Xueyan ZHANG Qichen ZHOU
Crack detection is a vital task to maintain a bridge's health and safety condition. Traditional computer-vision based methods easily suffer from disturbance of noise and clutters for a real bridge inspection. To address this limitation, we propose a two-stage crack detection approach based on Convolutional Neural Networks (CNN) in this letter. A predictor of small receptive field is exploited in the first detection stage, while another predictor of large receptive field is used to refine the detection results in the second stage. Benefiting from data fusion of confidence maps produced by both predictors, our method can predict the probability belongs to cracked areas of each pixel accurately. Experimental results show that the proposed method is superior to an up-to-date method on real concrete surface images.
Taku NAKAHARA Kazunori URUMA Tomohiro TAKAHASHI Toshihiro FURUKAWA
Recently, the demand for the digitization of manga is increased. Then, in the case of an old manga where the original pictures have been lost, we have to digitize it from comics. However, the show-through phenomenon would be caused by scanning of the comics since it is represented as the double sided images. This letter proposes the manga show-through cancellation method based on the deep convolutional neural network (CNN). Numerical results show that the effectiveness of the proposed method.
A fusion framework between CNN and RNN is proposed dedicatedly for air-writing recognition. By modeling the air-writing using both spatial and temporal features, the proposed network can learn more information than existing techniques. Performance of the proposed network is evaluated by using the alphabet and numeric datasets in the public database namely the 6DMG. Average accuracy of the proposed fusion network outperforms other techniques, i.e. 99.25% and 99.83% are observed in the alphabet gesture and the numeric gesture, respectively. Simplified structure of RNN is also proposed, which can attain about two folds speed-up of ordinary BLSTM network. It is also confirmed that only the distance between consecutive sampling points is enough to attain high recognition performance.
The research on inertial sensor based human action detection and recognition (HADR) is a new area in machine learning. We propose a novel time sequence based interval convolutional neutral networks framework for HADR by combining interesting interval proposals generator and interval-based classifier. Experiments demonstrate the good performance of our method.
Ting ZHANG Huihui BAI Mengmeng ZHANG Yao ZHAO
Multiple description (MD) coding is an attractive framework for robust information transmission over non-prioritized and unpredictable networks. In this paper, a novel MD image coding scheme is proposed based on convolutional neural networks (CNNs), which aims to improve the reconstructed quality of side and central decoders. For this purpose initially, a given image is encoded into two independent descriptions by sub-sampling. Such a design can make the proposed method compatible with the existing image coding standards. At the decoder, in order to achieve high-quality of side and central image reconstruction, three CNNs, including two side decoder sub-networks and one central decoder sub-network, are adopted into an end-to-end reconstruction framework. Experimental results show the improvement achieved by the proposed scheme in terms of both peak signal-to-noise ratio values and subjective quality. The proposed method demonstrates better rate central and side distortion performance.
Hang CUI Shoichi HIRASAWA Hiroaki KOBAYASHI Hiroyuki TAKIZAWA
Sparse Matrix-Vector multiplication (SpMV) is a computational kernel widely used in many applications. Because of the importance, many different implementations have been proposed to accelerate this computational kernel. The performance characteristics of those SpMV implementations are quite different, and it is basically difficult to select the implementation that has the best performance for a given sparse matrix without performance profiling. One existing approach to the SpMV best-code selection problem is by using manually-predefined features and a machine learning model for the selection. However, it is generally hard to manually define features that can perfectly express the characteristics of the original sparse matrix necessary for the code selection. Besides, some information loss would happen by using this approach. This paper hence presents an effective deep learning mechanism for SpMV code selection best suited for a given sparse matrix. Instead of using manually-predefined features of a sparse matrix, a feature image and a deep learning network are used to map each sparse matrix to the implementation, which is expected to have the best performance, in advance of the execution. The benefits of using the proposed mechanism are discussed by calculating the prediction accuracy and the performance. According to the evaluation, the proposed mechanism can select an optimal or suboptimal implementation for an unseen sparse matrix in the test data set in most cases. These results demonstrate that, by using deep learning, a whole sparse matrix can be used to do the best implementation prediction, and the prediction accuracy achieved by the proposed mechanism is higher than that of using predefined features.
Zhengxue CHENG Masaru TAKEUCHI Kenji KANAI Jiro KATTO
Image quality assessment (IQA) is an inherent problem in the field of image processing. Recently, deep learning-based image quality assessment has attracted increased attention, owing to its high prediction accuracy. In this paper, we propose a fully-blind and fast image quality predictor (FFIQP) using convolutional neural networks including two strategies. First, we propose a distortion clustering strategy based on the distribution function of intermediate-layer results in the convolutional neural network (CNN) to make IQA fully blind. Second, by analyzing the relationship between image saliency information and CNN prediction error, we utilize a pre-saliency map to skip the non-salient patches for IQA acceleration. Experimental results verify that our method can achieve the high accuracy (0.978) with subjective quality scores, outperforming existing IQA methods. Moreover, the proposed method is highly computationally appealing, achieving flexible complexity performance by assigning different thresholds in the saliency map.
Junyang ZHANG Yang GUO Xiao HU Rongzhen LI
In recent years, deep learning based image recognition, speech recognition, text translation and other related applications have brought great convenience to people's lives. With the advent of the era of internet of everything, how to run a computationally intensive deep learning algorithm on a limited resources edge device is a major challenge. For an edge oriented computing vector processor, combined with a specific neural network model, a new data layout method for putting the input feature maps in DDR, rearrangement of the convolutional kernel parameters in the nuclear memory bank is proposed. Aiming at the difficulty of parallelism of two-dimensional matrix convolution, a method of parallelizing the matrix convolution calculation in the third dimension is proposed, by setting the vector register with zero as the initial value of the max pooling to fuse the rectified linear unit (ReLU) activation function and pooling operations to reduce the repeated access to intermediate data. On the basis of single core implementation, a multi-core implementation scheme of Inception structure is proposed. Finally, based on the proposed vectorization method, we realize five kinds of neural network models, namely, AlexNet, VGG16, VGG19, GoogLeNet, ResNet18, and performance statistics and analysis based on CPU, gtx1080TI and FT2000 are presented. Experimental results show that the vector processor has better computing advantages than CPU and GPU, and can calculate large-scale neural network model in real time.
Yinan LIU Qingbo WU Liangzhi TANG Linfeng XU
In this paper, we propose a novel self-supervised learning of video representation which is capable to anticipate the video category by only reading its short clip. The key idea is that we employ the Siamese convolutional network to model the self-supervised feature learning as two different image matching problems. By using frame encoding, the proposed video representation could be extracted from different temporal scales. We refine the training process via a motion-based temporal segmentation strategy. The learned representations for videos can be not only applied to action anticipation, but also to action recognition. We verify the effectiveness of the proposed approach on both action anticipation and action recognition using two datasets namely UCF101 and HMDB51. The experiments show that we can achieve comparable results with the state-of-the-art self-supervised learning methods on both tasks.
Yoshikatsu NAKAJIMA Hideo SAITO
We propose a novel object recognition system that is able to (i) work in real-time while reconstructing segmented 3D maps and simultaneously recognize objects in a scene, (ii) manage various kinds of objects, including those with smooth surfaces and those with a large number of categories, utilizing a CNN for feature extraction, and (iii) maintain high accuracy no matter how the camera moves by distributing the viewpoints for each object uniformly and aggregating recognition results from each distributed viewpoint as the same weight. Through experiments, the advantages of our system with respect to current state-of-the-art object recognition approaches are demonstrated on the UW RGB-D Dataset and Scenes and on our own scenes prepared to verify the effectiveness of the Viewpoint-Class-based approach.
Recently, mobile applications for recording everyday meals draw much attention for self dietary. However, most of the applications return food calorie values simply associated with the estimated food categories, or need for users to indicate the rough amount of foods manually. In fact, it has not been achieved to estimate food calorie from a food photo with practical accuracy, and it remains an unsolved problem. Then, in this paper, we propose estimating food calorie from a food photo by simultaneous learning of food calories, categories, ingredients and cooking directions using deep learning. Since there exists a strong correlation between food calories and food categories, ingredients and cooking directions information in general, we expect that simultaneous training of them brings performance boosting compared to independent single training. To this end, we use a multi-task CNN. In addition, in this research, we construct two kinds of datasets that is a dataset of calorie-annotated recipe collected from Japanese recipe sites on the Web and a dataset collected from an American recipe site. In the experiments, we trained both multi-task and single-task CNNs, and compared them. As a result, a multi-task CNN achieved the better performance on both food category estimation and food calorie estimation than single-task CNNs. For the Japanese recipe dataset, by introducing a multi-task CNN, 0.039 were improved on the correlation coefficient, while for the American recipe dataset, 0.090 were raised compared to the result by the single-task CNN. In addition, we showed that the proposed multi-task CNN based method outperformed search-based methods proposed before.
Xianxu HOU Jiasong ZHU Ke SUN Linlin SHEN Guoping QIU
Motivated by the observation that certain convolutional channels of a Convolutional Neural Network (CNN) exhibit object specific responses, we seek to discover and exploit the convolutional channels of a CNN in which neurons are activated by the presence of specific objects in the input image. A method for explicitly fine-tuning a pre-trained CNN to induce object specific channel (OSC) and systematically identifying it for the human faces has been developed. In this paper, we introduce a multi-scale approach to constructing robust face heatmaps based on OSC features for rapidly filtering out non-face regions thus significantly improving search efficiency for face detection. We show that multi-scale OSC can be used to develop simple and compact face detectors in unconstrained settings with state of the art performance.
In this letter, we propose a sequential convolutional residual network, where we first analyze a tangled network architecture using simplified equations and determine the critical point to untangle the complex network architecture. Although the residual network shows good performance, the learning efficiency is not better than expected at deeper layers because the network is excessively intertwined. To solve this problem, we propose a network in which the information is transmitted sequentially. In this network architecture, the neighboring layer output adds the input of the current layer and iteratively passes its result to the next sequential layer. Thus, the proposed network can improve the learning efficiency and performance by successfully mitigating the complexity in deep networks. We show that the proposed network performs well on the Cifar-10 and Cifar-100 datasets. In particular, we prove that the proposed method is superior to the baseline method as the depth increases.
Yande XIANG Jiahui LUO Taotao ZHU Sheng WANG Xiaoyan XIANG Jianyi MENG
Arrhythmia classification based on electrocardiogram (ECG) is crucial in automatic cardiovascular disease diagnosis. The classification methods used in the current practice largely depend on hand-crafted manual features. However, extracting hand-crafted manual features may introduce significant computational complexity, especially in the transform domains. In this study, an accurate method for patient-specific ECG beat classification is proposed, which adopts morphological features and timing information. As to the morphological features of heartbeat, an attention-based two-level 1-D CNN is incorporated in the proposed method to extract different grained features automatically by focusing on various parts of a heartbeat. As to the timing information, the difference between previous and post RR intervels is computed as a dynamic feature. Both the extracted morphological features and the interval difference are used by multi-layer perceptron (MLP) for classifing ECG signals. In addition, to reduce memory storage of ECG data and denoise to some extent, an adaptive heartbeat normalization technique is adopted which includes amplitude unification, resolution modification, and signal difference. Based on the MIT-BIH arrhythmia database, the proposed classification method achieved sensitivity Sen=93.4% and positive predictivity Ppr=94.9% in ventricular ectopic beat (VEB) detection, sensitivity Sen=86.3% and positive predictivity Ppr=80.0% in supraventricular ectopic beat (SVEB) detection, and overall accuracy OA=97.8% under 6-bit ECG signal resolution. Compared with the state-of-the-art automatic ECG classification methods, these results show that the proposed method acquires comparable accuracy of heartbeat classification though ECG signals are represented by lower resolution.
Jinhua WANG Weiqiang WANG Guangmei XU Hongzhe LIU
In this paper, we describe the direct learning of an end-to-end mapping between under-/over-exposed images and well-exposed images. The mapping is represented as a deep convolutional neural network (CNN) that takes multiple-exposure images as input and outputs a high-quality image. Our CNN has a lightweight structure, yet gives state-of-the-art fusion quality. Furthermore, we know that for a given pixel, the influence of the surrounding pixels gradually increases as the distance decreases. If the only pixels considered are those in the convolution kernel neighborhood, the final result will be affected. To overcome this problem, the size of the convolution kernel is often increased. However, this also increases the complexity of the network (too many parameters) and the training time. In this paper, we present a method in which a number of sub-images of the source image are obtained using the same CNN model, providing more neighborhood information for the convolution operation. Experimental results demonstrate that the proposed method achieves better performance in terms of both objective evaluation and visual quality.
Object detection has been a hot topic of image processing, computer vision and pattern recognition. In recent years, training a model from labeled images using machine learning technique becomes popular. However, the relationship between training samples is usually ignored by existing approaches. To address this problem, a novel approach is proposed, which trains Siamese convolutional neural network on feature pairs and finely tunes the network driven by a small amount of training samples. Since the proposed method considers not only the discriminative information between objects and background, but also the relationship between intraclass features, it outperforms the state-of-arts on real images.
Xiaoqing YE Jiamao LI Han WANG Xiaolin ZHANG
Accurate stereo matching remains a challenging problem in case of weakly-textured areas, discontinuities and occlusions. In this letter, a novel stereo matching method, consisting of leveraging feature ensemble network to compute matching cost, error detection network to predict outliers and priority-based occlusion disambiguation for refinement, is presented. Experiments on the Middlebury benchmark demonstrate that the proposed method yields competitive results against the state-of-the-art algorithms.
Yang LI Zhuang MIAO Jiabao WANG Yafei ZHANG Hang LI
The latest deep hashing methods perform hash codes learning and image feature learning simultaneously by using pairwise or triplet labels. However, generating all possible pairwise or triplet labels from the training dataset can quickly become intractable, where the majority of those samples may produce small costs, resulting in slow convergence. In this letter, we propose a novel deep discriminative supervised hashing method, called DDSH, which directly learns hash codes based on a new combined loss function. Compared to previous methods, our method can take full advantages of the annotated data in terms of pairwise similarity and image identities. Extensive experiments on standard benchmarks demonstrate that our method preserves the instance-level similarity and outperforms state-of-the-art deep hashing methods in the image retrieval application. Remarkably, our 16-bits binary representation can surpass the performance of existing 48-bits binary representation, which demonstrates that our method can effectively improve the speed and precision of large scale image retrieval systems.
Donghyun YOO Youngjoong KO Jungyun SEO
In this paper, we propose a deep learning based model for classifying speech-acts using a convolutional neural network (CNN). The model uses some bigram features including parts-of-speech (POS) tags and dependency-relation bigrams, which represent syntactic structural information in utterances. Previous classification approaches using CNN have commonly exploited word embeddings using morpheme unigrams. However, the proposed model first extracts two different bigram features that well reflect the syntactic structure of utterances and then represents them as a vector representation using a word embedding technique. As a result, the proposed model using bigram embeddings achieves an accuracy of 89.05%. Furthermore, the accuracy of this model is relatively 2.8% higher than that of competitive models in previous studies.