Lie GUO Yibing ZHAO Jiandong GAO
The commonly used object detection algorithm based on convolutional neural network is difficult to meet the real-time requirement on embedded platform due to its large size of model, large amount of calculation, and long inference time. It is necessary to use model compression to reduce the amount of network calculation and increase the speed of network inference. This paper conducts compression of vehicle and pedestrian detection network by pruning and removing redundant parameters. The vehicle and pedestrian detection network is trained based on YOLOv3 model by using K-means++ to cluster the anchor boxes. The detection accuracy is improved by changing the proportion of categorical losses and regression losses for each category in the loss function because of the unbalanced number of targets in the dataset. A layer and channel pruning algorithm is proposed by combining global channel pruning thresholds and L1 norm, which can reduce the time cost of the network layer transfer process and the amount of computation. Network layer fusion based on TensorRT is performed and inference is performed using half-precision floating-point to improve the speed of inference. Results show that the vehicle and pedestrian detection compression network pruned 84% channels and 15 Shortcut modules can reduce the size by 32% and the amount of calculation by 17%. While the network inference time can be decreased to 21 ms, which is 1.48 times faster than the network pruned 84% channels.
Xianyu WANG Cong LI Heyi LI Rui ZHANG Zhifeng LIANG Hai WANG
Visual object tracking is always a challenging task in computer vision. During the tracking, the shape and appearance of the target may change greatly, and because of the lack of sufficient training samples, most of the online learning tracking algorithms will have performance bottlenecks. In this paper, an improved real-time algorithm based on deep learning features is proposed, which combines multi-feature fusion, multi-scale estimation, adaptive updating of target model and re-detection after target loss. The effectiveness and advantages of the proposed algorithm are proved by a large number of comparative experiments with other excellent algorithms on large benchmark datasets.
Yongtang BAO Pengfei ZHOU Yue QI Zhihui WANG Qing FAN
A frontal and realistic face image was synthesized from a single profile face image. It has a wide range of applications in face recognition. Although the frontal face method based on deep learning has made substantial progress in recent years, there is still no guarantee that the generated face has identity consistency and illumination consistency in a significant posture. This paper proposes a novel pixel-based feature regression generative adversarial network (PFR-GAN), which can learn to recover local high-frequency details and preserve identity and illumination frontal face images in an uncontrolled environment. We first propose a Reslu block to obtain richer feature representation and improve the convergence speed of training. We then introduce a feature conversion module to reduce the artifacts caused by face rotation discrepancy, enhance image generation quality, and preserve more high-frequency details of the profile image. We also construct a 30,000 face pose dataset to learn about various uncontrolled field environments. Our dataset includes ages of different races and wild backgrounds, allowing us to handle other datasets and obtain better results. Finally, we introduce a discriminator used for recovering the facial structure of the frontal face images. Quantitative and qualitative experimental results show our PFR-GAN can generate high-quality and high-fidelity frontal face images, and our results are better than the state-of-art results.
In this paper, we propose improved Generative Adversarial Networks with attention module in Generator, which can enhance the effectiveness of Generator. Furthermore, recent work has shown that Generator conditioning affects GAN performance. Leveraging this insight, we explored the effect of different normalization (spectral normalization, instance normalization) on Generator and Discriminator. Moreover, an enhanced loss function called Wasserstein Divergence distance, can alleviate the problem of difficult to train module in practice.
In recent years, driver's visual attention has been actively studied for driving automation technology. However, the number of models is few to perceive an insight understanding of driver's attention in various moments. All attention models process multi-level image representations by a two-stream/multi-stream network, increasing the computational cost due to an increment of model parameters. However, multi-level image representation such as optical flow plays a vital role in tasks involving videos. Therefore, to reduce the computational cost of a two-stream network and use multi-level image representation, this work proposes a single stream driver's visual attention model for a critical situation. The experiment was conducted using a publicly available critical driving dataset named BDD-A. Qualitative results confirm the effectiveness of the proposed model. Moreover, quantitative results highlight that the proposed model outperforms state-of-the-art visual attention models according to CC and SIM. Extensive ablation studies verify the presence of optical flow in the model, the position of optical flow in the spatial network, the convolution layers to process optical flow, and the computational cost compared to a two-stream model.
Conghui LI Quanlin ZHONG Baoyin LI
In recent years, the applications of deep learning have facilitated the development of green intelligent transportation system (ITS), and carbon dioxide estimation has been one of important issues in green ITS. Furthermore, the carbon dioxide estimation could be modelled as the fuel consumption estimation. Therefore, a clustering-based neural network is proposed to analyze clusters in accordance with fuel consumption behaviors and obtains the estimated fuel consumption and the estimated carbon dioxide. In experiments, the mean absolute percentage error (MAPE) of the proposed method is only 5.61%, and the performance of the proposed method is higher than other methods.
Masaaki MIYASHITA Norihiko SHINOMIYA Daisuke KASAMATSU Genya ISHIGAKI
Online social networks have increased their impact on the real world, which motivates information senders to control the propagation process of information to promote particular actions of online users. However, the existing works on information provisioning seem to oversimplify the users' decision-making process that involves information reception, internal actions of social networks, and external actions of social networks. In particular, characterizing the best practices of information provisioning that promotes the users' external actions is a complex task due to the complexity of the propagation process in OSNs, even when the variation of information is limited. Therefore, we propose a new information diffusion model that distinguishes user behaviors inside and outside of OSNs, and formulate an optimization problem to maximize the number of users who take the external actions by providing information over multiple rounds. Also, we define a robust provisioning policy for the problem, which selects a message sequence to maximize the expected number of desired users under the probabilistic uncertainty of OSN settings. Our experiment results infer that there could exist an information provisioning policy that achieves nearly-optimal solutions in different types of OSNs. Furthermore, we empirically demonstrate that the proposed robust policy can be such a universally optimal solution.
Jingjing YANG Yuchun GUO Yishuai CHEN
Microservice architecture has been widely adopted for large-scale applications because of its benefits of scalability, flexibility, and reliability. However, microservice architecture also proposes new challenges in diagnosing root causes of performance degradation. Existing methods rely on labeled data and suffer a high computation burden. This paper proposes MicroState, an unsupervised and lightweight method to pinpoint the root cause with detailed descriptions. We decompose root cause diagnosis into element location and detailed reason identification. To mitigate the impact of element heterogeneity and dynamic invocations, MicroState generates elements' invoked states, quantifies elements' abnormality by warping-based state comparison, and infers the anomalous group. MicroState locates the root cause element with the consideration of anomaly frequency and persistency. To locate the anomalous metric from diverse metrics, MicroState extracts metrics' trend features and evaluates metrics' abnormality based on their trend feature variation, which reduces the reliance on anomaly detectors. Our experimental evaluation based on public data of the Artificial intelligence for IT Operations Challenge (AIOps Challenge 2020) shows that MicroState locates root cause elements with 87% precision and diagnoses anomaly reasons accurately.
Hiroshi YAMAMOTO Shota NISHIURA Yoshihiro HIGASHIURA
In order to improve crop production and efficiency of farming operations, an IoT (Internet of Things) system for remote monitoring has been attracting a lot of attention. The existing studies have proposed agricultural sensing systems such that environmental information is collected from many sensor nodes installed in farmland through wireless communications (e.g., Wi-Fi, ZigBee). Especially, Low-Power Wide-Area (LPWA) is a focus as a candidate for wireless communication that enables the support of vast farmland for a long time. However, it is difficult to achieve long distance communication even when using the LPWA because a clear line of sight is difficult to keep due to many obstacles such as crops and agricultural machinery in the farmland. In addition, a sensor node cannot run permanently on batteries because the battery capacity is not infinite. On the other hand, an Unmanned Aerial Vehicle (UAV) that can move freely and stably in the sky has been leveraged for agricultural sensor network systems. By utilizing a UAV as the gateway of the sensor network, the gateway can move to the appropriate location to ensure a clear line of sight from the sensor nodes. In addition, the coverage area of the sensor network can be expanded as the UAV travels over a wide area even when short-range and ultra-low-power wireless communication (e.g., Bluetooth Low Energy (BLE)) is adopted. Furthermore, various wireless technologies (e.g., wireless power transfer, wireless positioning) that have the possibility to improve the coverage area and the lifetime of the sensor network have become available. Therefore, in this study, we propose and develop two kinds of new agricultural sensing systems utilizing a UAV and various wireless technologies. The objective of the proposed system is to provide the solution for achieving the wide-area and long-term sensing for the vast farmland. Depending on which problem is in a priority, the proposed system chooses one of two designs. The first design of the system attempts to achieve the wide-area sensing, and so it is based on the LPWA for wireless communication. In the system, to efficiently collect the environmental information, the UAV autonomously travels to search for the locations to maintain the good communication properties of the LPWA to the sensor nodes dispersed over a wide area of farmland. In addition, the second design attempts to achieve the long-term sensing, so it is based on BLE, a typical short-range and ultra-low-power wireless communication technology. In this design, the UAV autonomously flies to the location of sensor nodes and supplies power to them using a wireless power transfer technology for achieving a battery-less sensor node. Through experimental evaluations using a prototype system, it is confirmed that the combination of the UAV and various wireless technologies has the possibility to achieve a wide-area and long-term sensing system for monitoring vast farmland.
Yosuke OBE Hiroaki YAMAMOTO Hiroshi FUJIWARA
Let us consider a regular expression r of length m and a text string T of length n over an alphabet Σ. Then, the RE minimal substring search problem is to find all minimal substrings of T matching r. Yamamoto proposed O(mn) time and O(m) space algorithm using a Thompson automaton. In this paper, we improve Yamamoto's algorithm by introducing parallelism. The proposed algorithm runs in O(mn) time in the worst case and in O(mn/p) time in the best case, where p denotes the number of processors. Besides, we show a parameter related to the parallel time of the proposed algorithm. We evaluate the algorithm experimentally.
The high-precision indoor positioning technology has gradually become one of the research hotspots in indoor mobile robots. Relax and Recover (RAR) is an indoor positioning algorithm using distance observations. The algorithm restores the robot's trajectory through curve fitting and does not require time synchronization of observations. The positioning can be successful with few observations. However, the algorithm has the disadvantages of poor resistance to gross errors and cannot be used for real-time positioning. In this paper, while retaining the advantages of the original algorithm, the RAR algorithm is improved with the adaptive Kalman filter (AKF) based on the innovation sequence to improve the anti-gross error performance of the original algorithm. The improved algorithm can be used for real-time navigation and positioning. The experimental validation found that the improved algorithm has a significant improvement in accuracy when compared to the original RAR. When comparing to the extended Kalman filter (EKF), the accuracy is also increased by 12.5%, which can be used for high-precision positioning of indoor mobile robots.
Tianbin WANG Ruiyang HUANG Nan HU Huansha WANG Guanghan CHU
Chinese Named Entity Recognition is the fundamental technology in the field of the Chinese Natural Language Process. It is extensively adopted into information extraction, intelligent question answering, and knowledge graph. Nevertheless, due to the diversity and complexity of Chinese, most Chinese NER methods fail to sufficiently capture the character granularity semantics, which affects the performance of the Chinese NER. In this work, we propose DSKE-Chinese NER: Chinese Named Entity Recognition based on Dictionary Semantic Knowledge Enhancement. We novelly integrate the semantic information of character granularity into the vector space of characters and acquire the vector representation containing semantic information by the attention mechanism. In addition, we verify the appropriate number of semantic layers through the comparative experiment. Experiments on public Chinese datasets such as Weibo, Resume and MSRA show that the model outperforms character-based LSTM baselines.
Kenya SAKAMOTO Shizuka SHIRAI Noriko TAKEMURA Jason ORLOSKY Hiroyuki NAGATAKI Mayumi UEDA Yuki URANISHI Haruo TAKEMURA
This study explores significant eye-gaze features that can be used to estimate subjective difficulty while reading educational comics. Educational comics have grown rapidly as a promising way to teach difficult topics using illustrations and texts. However, comics include a variety of information on one page, so automatically detecting learners' states such as subjective difficulty is difficult with approaches such as system log-based detection, which is common in the Learning Analytics field. In order to solve this problem, this study focused on 28 eye-gaze features, including the proposal of three new features called “Variance in Gaze Convergence,” “Movement between Panels,” and “Movement between Tiles” to estimate two degrees of subjective difficulty. We then ran an experiment in a simulated environment using Virtual Reality (VR) to accurately collect gaze information. We extracted features in two unit levels, page- and panel-units, and evaluated the accuracy with each pattern in user-dependent and user-independent settings, respectively. Our proposed features achieved an average F1 classification-score of 0.721 and 0.742 in user-dependent and user-independent models at panel unit levels, respectively, trained by a Support Vector Machine (SVM).
In this paper, we propose a new method for non-dominant limb training. The method is that a learner aims at a motion which is generated by reversing his/her own motion of dominant limb, when he/she tries to train himself/herself for non-dominant limb training. In addition, we designed and developed interface for the new method which can select feedback types. One is an interface using AR and sound, and the other is an interface using AR and vibration. We found that vibration feedback was effective for non-dominant hand training of pitching motion, while sound feedback was not so effective as vibration.
Wenkai LIU Cuizhu QIN Menglong WU Wenle BAI Hongxia DONG
Pose estimation is a research hot spot in computer vision tasks and the key to computer perception of human activities. The core concept of human pose estimation involves describing the motion of the human body through major joint points. Large receptive fields and rich spatial information facilitate the keypoint localization task, and how to capture features on a larger scale and reintegrate them into the feature space is a challenge for pose estimation. To address this problem, we propose a multi-scale convergence network (MSCNet) with a large receptive field and rich spatial information. The structure of the MSCNet is based on an hourglass network that captures information at different scales to present a consistent understanding of the whole body. The multi-scale receptive field (MSRF) units provide a large receptive field to obtain rich contextual information, which is then selectively enhanced or suppressed by the Squeeze-Excitation (SE) attention mechanism to flexibly perform the pose estimation task. Experimental results show that MSCNet scores 73.1% AP on the COCO dataset, an 8.8% improvement compared to the mainstream CMUPose method. Compared to the advanced CPN, the MSCNet has 68.2% of the computational complexity and only 55.4% of the number of parameters.
Yue XIE Ruiyu LIANG Zhenlin LIANG Xiaoyan ZHAO Wenhao ZENG
To enhance the emotion feature and improve the performance of speech emotion recognition, an attention mechanism is employed to recognize the important information in both time and feature dimensions. In the time dimension, multi-heads attention is modified with the last state of the long short-term memory (LSTM)'s output to match the time accumulation characteristic of LSTM. In the feature dimension, scaled dot-product attention is replaced with additive attention that refers to the method of the state update of LSTM to construct multi-heads attention. This means that a nonlinear change replaces the linear mapping in classical multi-heads attention. Experiments on IEMOCAP datasets demonstrate that the attention mechanism could enhance emotional information and improve the performance of speech emotion recognition.
Min Ho KWAK Youngwoo KIM Kangin LEE Jae Young CHOI
This letter proposes a novel lightweight deep learning object detector named LW-YOLOv4-tiny, which incorporates the convolution block feature addition module (CBFAM). The novelty of LW-YOLOv4-tiny is the use of channel-wise convolution and element-wise addition in the CBFAM instead of utilizing the concatenation of different feature maps. The model size and computation requirement are reduced by up to 16.9 Mbytes, 5.4 billion FLOPs (BFLOPS), and 11.3 FPS, which is 31.9%, 22.8%, and 30% smaller and faster than the most recent version of YOLOv4-tiny. From the MSCOCO2017 and PASCAL VOC2012 benchmarks, LW-YOLOv4-tiny achieved 40.2% and 69.3% mAP, respectively.
Meng ZHAO Junfeng WU Hong YU Haiqing LI Jingwen XU Siqi CHENG Lishuai GU Juan MENG
Accurate fish detection is of great significance in aquaculture. However, the non-uniform strong reflection in aquaculture ponds will affect the precision of fish detection. This paper combines YOLOv4 and CVAE to accurately detect fishes in the image with non-uniform strong reflection, in which the reflection in the image is removed at first and then the reflection-removed image is provided for fish detecting. Firstly, the improved YOLOv4 is applied to detect and mask the strong reflective region, to locate and label the reflective region for the subsequent reflection removal. Then, CVAE is combined with the improved YOLOv4 for inferring the priori distribution of the Reflection region and restoring the Reflection region by the distribution so that the reflection can be removed. For further improving the quality of the reflection-removed images, the adversarial learning is appended to CVAE. Finally, YOLOV4 is used to detect fishes in the high quality image. In addition, a new image dataset of pond cultured takifugu rubripes is constructed,, which includes 1000 images with fishes annotated manually, also a synthetic dataset including 2000 images with strong reflection is created and merged with the generated dataset for training and verifying the robustness of the proposed method. Comprehensive experiments are performed to compare the proposed method with the state-of-the-art fish detecting methods without reflection removal on the generated dataset. The results show that the fish detecting precision and recall of the proposed method are improved by 2.7% and 2.4% respectively.
Chengkai CAI Kenta IWAI Takanobu NISHIURA
The acquisition of distant sound has always been a hot research topic. Since sound is caused by vibration, one of the best methods for measuring distant sound is to use a laser Doppler vibrometer (LDV). This laser has high directivity, that enables it to acquire sound from far away, which is of great practical use for disaster relief and other situations. However, due to the vibration characteristics of the irradiated object itself and the reflectivity of its surface (or other reasons), the acquired sound is often lacking frequency components in certain frequency bands and is mixed with obvious noise. Therefore, when using LDV to acquire distant speech, if we want to recognize the actual content of the speech, it is necessary to enhance the acquired speech signal in some way. Conventional speech enhancement methods are not generally applicable due to the various types of degradation in observed speech. Moreover, while several speech enhancement methods for LDV have been proposed, they are only effective when the irradiated object is known. In this paper, we present a speech enhancement method for LDV that can deal with unknown irradiated objects. The proposed method is composed of noise reduction, pitch detection, power spectrum envelope estimation, power spectrum reconstruction, and phase estimation. Experimental results demonstrate the effectiveness of our method for enhancing the acquired speech with unknown irradiated objects.
In this letter, we consider the problem of joint selection of transmitters and receivers in a distributed multi-input multi-output radar network for localization. Different from previous works, we consider a more mathematically challenging but generalized situation that the transmitting signals are not perfectly orthogonal. Taking Cramér Rao lower bound as performance metric, we propose a scheme of joint selection of transmitters and receivers (JSTR) aiming at optimizing the localization performance under limited number of nodes. We propose a bi-convex relaxation to replace the resultant NP hard non-convex problem. Using the bi-convexity, the surrogate problem can be efficiently resolved by nonlinear alternating direction method of multipliers. Simulation results reveal that the proposed algorithm has very close performance compared with the computationally intensive but global optimal exhaustive search method.