Haochen LYU Jianjun LI Yin YE Chin-Chen CHANG
The purpose of Facial Beauty Prediction (FBP) is to automatically assess facial attractiveness based on human aesthetics. Most neural network-based prediction methods do not consider the ranking information in the task. For scoring tasks like facial beauty prediction, there is abundant ranking information both between images and within images. Reasonable utilization of these information during training can greatly improve the performance of the model. In this paper, we propose a novel end-to-end Convolutional Neural Network (CNN) model based on ranking information of images, incorporating a Rank Module and an Adaptive Weight Module. We also design pairwise ranking loss functions to fully leverage the ranking information of images. Considering training efficiency and model inference capability, we choose ResNet-50 as the backbone network. We conduct experiments on the SCUT-FBP5500 dataset and the results show that our model achieves a new state-of-the-art performance. Furthermore, ablation experiments show that our approach greatly contributes to improving the model performance. Finally, the Rank Module with the corresponding ranking loss is plug-and-play and can be extended to any CNN model and any task with ranking information. Code is available at https://github.com/nehcoah/Rank-Info-Net.
Images captured in low-light environments have low visibility and high noise, which will seriously affect subsequent visual tasks such as target detection and face recognition. Therefore, low-light image enhancement is of great significance in obtaining high-quality images and is a challenging problem in computer vision tasks. A low-light enhancement model, LLFormer, based on the Vision Transformer, uses axis-based multi-head self-attention and a cross-layer attention fusion mechanism to reduce the complexity and achieve feature extraction. This algorithm can enhance images well. However, the calculation of the attention mechanism is complex and the number of parameters is large, which limits the application of the model in practice. In response to this problem, a lightweight module, PoolFormer, is used to replace the attention module with spatial pooling, which can increase the parallelism of the network and greatly reduce the number of model parameters. To suppress image noise and improve visual effects, a new loss function is constructed for model optimization. The experiment results show that the proposed method not only reduces the number of parameters by 49%, but also performs better in terms of image detail restoration and noise suppression compared with the baseline model. On the LOL dataset, the PSNR and SSIM were 24.098dB and 0.8575 respectively. On the MIT-Adobe FiveK dataset, the PSNR and SSIM were 27.060dB and 0.9490. The evaluation results on the two datasets are better than the current mainstream low-light enhancement algorithms.
Wen LIU Yixiao SHAO Shihong ZHAI Zhao YANG Peishuai CHEN
Automatic continuous tracking of objects involved in a construction project is required for such tasks as productivity assessment, unsafe behavior recognition, and progress monitoring. Many computer-vision-based tracking approaches have been investigated and successfully tested on construction sites; however, their practical applications are hindered by the tracking accuracy limited by the dynamic, complex nature of construction sites (i.e. clutter with background, occlusion, varying scale and pose). To achieve better tracking performance, a novel deep-learning-based tracking approach called the Multi-Domain Convolutional Neural Networks (MD-CNN) is proposed and investigated. The proposed approach consists of two key stages: 1) multi-domain representation of learning; and 2) online visual tracking. To evaluate the effectiveness and feasibility of this approach, it is applied to a metro project in Wuhan China, and the results demonstrate good tracking performance in construction scenarios with complex background. The average distance error and F-measure for the MDNet are 7.64 pixels and 67, respectively. The results demonstrate that the proposed approach can be used by site managers to monitor and track workers for hazard prevention in construction sites.
Fairuz SAFWAN MAHAD Masakazu IWAMURA Koichi KISE
3D reconstruction methods using neural networks are popular and have been studied extensively. However, the resulting models typically lack detail, reducing the quality of the 3D reconstruction. This is because the network is not designed to capture the fine details of the object. Therefore, in this paper, we propose two networks designed to capture both the coarse and fine details of the object to improve the reconstruction of the detailed parts of the object. To accomplish this, we design two networks. The first network uses a multi-scale architecture with skip connections to associate and merge features from other levels. For the second network, we design a multi-branch deep generative network that separately learns the local features, generic features, and the intermediate features through three different tailored components. In both network architectures, the principle entails allowing the network to learn features at different levels that can reconstruct the fine parts and the overall shape of the reconstructed 3D model. We show that both of our methods outperformed state-of-the-art approaches.
Wen SHAO Rei KAWAKAMI Takeshi NAEMURA
Previous studies on anomaly detection in videos have trained detectors in which reconstruction and prediction tasks are performed on normal data so that frames on which their task performance is low will be detected as anomalies during testing. This paper proposes a new approach that involves sorting video clips, by using a generative network structure. Our approach learns spatial contexts from appearances and temporal contexts from the order relationship of the frames. Experiments were conducted on four datasets, and we categorized the anomalous sequences by appearance and motion. Evaluations were conducted not only on each total dataset but also on each of the categories. Our method improved detection performance on both anomalies with different appearance and different motion from normality. Moreover, combining our approach with a prediction method produced improvements in precision at a high recall.
Xiang SHEN Dezhi HAN Chin-Chen CHANG Liang ZONG
Visual Question Answering (VQA) is multi-task research that requires simultaneous processing of vision and text. Recent research on the VQA models employ a co-attention mechanism to build a model between the context and the image. However, the features of questions and the modeling of the image region force irrelevant information to be calculated in the model, thus affecting the performance. This paper proposes a novel dual self-guided attention with sparse question networks (DSSQN) to address this issue. The aim is to avoid having irrelevant information calculated into the model when modeling the internal dependencies on both the question and image. Simultaneously, it overcomes the coarse interaction between sparse question features and image features. First, the sparse question self-attention (SQSA) unit in the encoder calculates the feature with the highest weight. From the self-attention learning of question words, the question features of larger weights are reserved. Secondly, sparse question features are utilized to guide the focus on image features to obtain fine-grained image features, and to also prevent irrelevant information from being calculated into the model. A dual self-guided attention (DSGA) unit is designed to improve modal interaction between questions and images. Third, the sparse question self-attention of the parameter δ is optimized to select these question-related object regions. Our experiments with VQA 2.0 benchmark datasets demonstrate that DSSQN outperforms the state-of-the-art methods. For example, the accuracy of our proposed model on the test-dev and test-std is 71.03% and 71.37%, respectively. In addition, we show through visualization results that our model can pay more attention to important features than other advanced models. At the same time, we also hope that it can promote the development of VQA in the field of artificial intelligence (AI).
Fairuz Safwan MAHAD Masakazu IWAMURA Koichi KISE
Neural network-based three-dimensional (3D) reconstruction methods have produced promising results. However, they do not pay particular attention to reconstructing detailed parts of objects. This occurs because the network is not designed to capture the fine details of objects. In this paper, we propose a network designed to capture both the coarse and fine details of objects to improve the reconstruction of the fine parts of objects.
Yusuke HARA Xueting WANG Toshihiko YAMASAKI
Video inpainting is a task of filling missing regions in videos. In this task, it is important to efficiently use information from other frames and generate plausible results with sufficient temporal consistency. In this paper, we present a video inpainting method jointly using affine transformation and deformable convolutions for frame alignment. The former is responsible for frame-scale rough alignment and the latter performs pixel-level fine alignment. Our model does not depend on 3D convolutions, which limits the temporal window, or troublesome flow estimation. The proposed method achieves improved object removal results and better PSNR and SSIM values compared with previous learning-based methods.
Karthikeyan PANJAPPAGOUNDER RAJAMANICKAM Sakthivel PERIYASAMY
Background subtraction algorithms generate a background model of the monitoring scene and compare the background model with the current video frame to detect foreground objects. In general, most of the background subtraction algorithms fail to detect foreground objects when the scene illumination changes. An entropy based background subtraction algorithm is proposed to address this problem. The proposed method adapts to illumination changes by updating the background model according to differences in entropy value between the current frame and the previous frame. This entropy based background modeling can efficiently handle both sudden and gradual illumination variations. The proposed algorithm is tested in six video sequences and compared with four algorithms to demonstrate its efficiency in terms of F-score, similarity and frame rate.
Ming LI Li SHI Xudong CHEN Sidan DU Yang LI
The large computational complexity makes stereo matching a big challenge in real-time application scenario. The problem of stereo matching in a video sequence is slightly different with that in a still image because there exists temporal correlation among video frames. However, no existing method considered temporal consistency of disparity for algorithm acceleration. In this work, we proposed a scheme called the dynamic disparity range (DDR) to optimize matching cost calculation and cost aggregation steps by narrowing disparity searching range, and a scheme called temporal cost aggregation path to optimize the cost aggregation step. Based on the schemes, we proposed the DDR-SGM and the DDR-MCCNN algorithms for the stereo matching in video sequences. Evaluation results showed that the proposed algorithms significantly reduced the computational complexity with only very slight loss of accuracy. We proved that the proposed optimizations for the stereo matching are effective and the temporal consistency in stereo video is highly useful for either improving accuracy or reducing computational complexity.
Junfeng SHI Wenming MA Peng SONG
To learn the working situation of rod-pumped wells under ground, we always need to analyze dynamometer diagrams, which are generated by the load sensor and displacement sensor. Rod-pumped wells are usually located in the places with extreme weather, and these sensors are installed on some special oil equipments in the open air. As time goes by, sensors are prone to generating unstable and incorrect data. Unfortunately, load sensors are too expensive to frequently reinstall. Therefore, the resulting dynamometer diagrams sometimes cannot make an accurate diagnosis. Instead, as an absolutely necessary equipment of the rod-pumped well, the electric motor has much longer life and cannot be easily impacted by the weather. The electric power curve during a swabbing period can also reflect the working situation under ground, but is much harder to explain than the dynamometer diagram. This letter presented a novel deep learning architecture, which can transform the electric power curve into the dimensionless dynamometer diagram image. We conduct our experiments on a real-world dataset, and the results show that our method can get an impressive transformation accuracy.
Shanlin XIAO Tsuyoshi ISSHIKI Dongju LI Hiroaki KUNIEDA
Object detection is an essential and expensive process in many computer vision systems. Standard off-the-shelf embedded processors are hard to achieve performance-power balance for implementation of object detection applications. In this work, we explore an Application Specific Instruction set Processor (ASIP) for object detection using Histogram of Oriented Gradients (HOG) feature. Algorithm simplifications are adopted to reduce memory bandwidth requirements and mathematical complexity without losing reliability. Also, parallel histogram generation and on-the-fly Support Vector Machine (SVM) calculation architecture are employed to reduce the necessary cycle counts. The HOG algorithm on the proposed ASIP was accelerated by a factor of 63x compared to the pure software implementation. The ASIP was synthesized for a standard 90nm CMOS library, with a silicon area of 1.31mm2 and 47.8mW power consumption at a 200MHz frequency. Our object detection processor can achieve 42 frames-per-second (fps) on VGA video. The evaluation and implementation results show that the proposed ASIP is both area-efficient and power-efficient while being competitive with commercial CPUs/DSPs. Furthermore, our ASIP exhibits comparable performance even with hard-wire designs.
Bin YAO Lifeng HE Shiying KANG Xiao ZHAO Yuyan CHAO
The Euler number of a binary image is an important topological property for pattern recognition, image analysis, and computer vision. A famous method for computing the Euler number of a binary image is by counting certain patterns of bit-quads in the image, which has been improved by scanning three rows once to process two bit-quads simultaneously. This paper studies the bit-quad-based Euler number computing problem. We show that for a bit-quad-based Euler number computing algorithm, with the increase of the number of bit-quads being processed simultaneously, on the one hand, the average number of pixels to be checked for processing a bit-quad will decrease in theory, and on the other hand, the length of the codes for implementing the algorithm will increase, which will make the algorithm less efficient in practice. Experimental results on various types of images demonstrated that scanning five rows once and processing four bit-quads simultaneously is the optimal tradeoff, and that the optimal bit-quad-based Euler number computing algorithm is more efficient than other Euler number computing algorithms.
Shanlin XIAO Tsuyoshi ISSHIKI Dongju LI Hiroaki KUNIEDA
Object detection is at the heart of nearly all the computer vision systems. Standard off-the-shelf embedded processors are hard to meet the trade-offs among performance, power consumption and flexibility required by object detection applications. Therefore, this paper presents an Application Specific Instruction set Processor (ASIP) for object detection using AdaBoost-based learning algorithm with Haar-like features as weak classifiers. Algorithm optimizations are employed to reduce memory bandwidth requirements without losing reliability. In the proposed ASIP, Single Instruction Multiple Data (SIMD) architecture is adopted for fully exploiting data-level parallelism inherent to the target algorithm. With adding pipeline stages, application-specific hardware components and custom instructions, the AdaBoost algorithm is accelerated by a factor of 13.7x compared to the optimized pure software implementation. Compared with ARM946 and TMS320C64+, our ASIP shows 32x and 7x better throughput, 10x and 224x better area efficiency, 6.8x and 18.8x better power efficiency, respectively. Furthermore, compared to hard-wired designs, evaluation results show an advantage of the proposed architecture in terms of chip area efficiency while maintain a reliable performance and achieve real-time object detection at 32fps on VGA video.
Inter-person occlusion handling is a critical issue in the field of tracking, and it has been extensively researched. Several state-of-the-art methods have been proposed, such as focusing on the appearance of the targets or utilizing knowledge of the scene. In contrast with the approaches proposed in the literature, we propose to address this issue using a social interaction model, which allows us to explore spatio-temporal information pertaining to the targets involved in the occlusion situation. Our experimental results show promising results compared with those obtained using other methods.
Bin YAO Lifeng HE Shiying KANG Xiao ZHAO Yuyan CHAO
The Euler number is an important topological property in a binary image, and it can be computed by counting certain bit-quads in the binary image. This paper proposes a further improved bit-quad-based algorithm for computing the Euler number. By scanning image rows two by two and utilizing the information obtained while processing the previous pixels, the number of pixels to be checked for processing a bit-quad can be decreased from 2 to 1.5. Experimental results demonstrated that our proposed algorithm significantly outperforms conventional Euler number computing algorithms.
Lifeng HE Bin YAO Xiao ZHAO Yun YANG Yuyan CHAO Atsushi OHTA
This paper proposes a graph-theory-based Euler number computing algorithm. According to the graph theory and the analysis of a mask's configuration, the Euler number of a binary image in our algorithm is calculated by counting four patterns of the mask. Unlike most conventional Euler number computing algorithms, we do not need to do any processing of the background pixels. Experimental results demonstrated that our algorithm is much more efficient than conventional Euler number computing algorithms.
Lifeng HE Xiao ZHAO Bin YAO Yun YANG Yuyan CHAO
This paper proposes an efficient two-scan labeling algorithm for binary hexagonal images. Unlike conventional labeling algorithms, which process pixels one by one in the first scan, our algorithm processes pixels two by two. We show that using our algorithm, we can check a smaller number of pixels. Experimental results demonstrated that our method is more efficient than the algorithm extended straightly from the corresponding labeling algorithm for rectangle binary images.
Bin YAO Hua WU Yun YANG Yuyan CHAO Atsushi OHTA Haruki KAWANAKA Lifeng HE
The Euler number of a binary image is an important topological property for pattern recognition, and can be calculated by counting certain bit-quads in the image. This paper proposes an efficient strategy for improving the bit-quad-based Euler number computing algorithm. By use of the information obtained when processing the previous bit quad, the number of times that pixels must be checked in processing a bit quad decreases from 4 to 2. Experiments demonstrate that an algorithm with our strategy significantly outperforms conventional Euler number computing algorithms.
Alexis ANDRE Suguru SAITO Masayuki NAKAJIMA
We propose a sketch-based modeling system where all user input is performed from a unique viewpoint. The strokes drawn by the user must not then be restricted to the drawing plane: their orientation in the 3D space is automatically determined by the system. The desired surface is reconstructed from a grid made of two groups of similar lines, that are considered co-planar. The orientation of the two sets of planes is determined by assuming that at the intersection of a representative line of each group, those two lines are perpendicular.