Location and feature representation of object's parts play key roles in fine-grained visual recognition. To promote the final recognition accuracy without any bounding boxes/part annotations, many studies adopt object location networks to propose bounding boxes/part annotations with only category labels, and then crop the images into partial images to help the classification network make the final decision. In our work, to propose more informative partial images and effectively extract discriminative features from the original and partial images, we propose a two-stage approach that can fuse the original features and partial features by evaluating and ranking the information of partial images. Experimental results show that our proposed approach achieves excellent performance on two benchmark datasets, which demonstrates its effectiveness.
Farzin MATIN Yoosoo JEONG Hanhoon PARK
Multiscale retinex is one of the most popular image enhancement methods. However, its control parameters, such as Gaussian kernel sizes, gain, and offset, should be tuned carefully according to the image contents. In this letter, we propose a new method that optimizes the parameters using practical swarm optimization and multi-objective function. The method iteratively verifies the visual quality (i.e. brightness, contrast, and colorfulness) of the enhanced image using a multi-objective function while subtly adjusting the parameters. Experimental results shows that the proposed method achieves better image quality qualitatively and quantitatively compared with other image enhancement methods.
Xiaoxuan GUO Renxi GONG Haibo BAO Zhenkun LU
It is well known that the large-scale access of wind power to the power system will affect the economic and environmental objectives of power generation scheduling, and also bring new challenges to the traditional deterministic power generation scheduling because of the intermittency and randomness of wind power. In order to deal with these problems, a multiobjective optimization dispatch method of wind-thermal power system is proposed. The method can be described as follows: A multiobjective interval power generation scheduling model of wind-thermal power system is firstly established by describing the wind speed on wind farm as an interval variable, and the minimization of fuel cost and pollution gas emission cost of thermal power unit is chosen as the objective functions. And then, the optimistic and pessimistic Pareto frontiers of the multi-objective interval power generation scheduling are obtained by utilizing an improved normal boundary intersection method with a normal boundary intersection (NBI) combining with a bilevel optimization method to solve the model. Finally, the optimistic and pessimistic compromise solutions is determined by a distance evaluation method. The calculation results of the 16-unit 174-bus system show that by the proposed method, a uniform optimistic and pessimistic Pareto frontier can be obtained, the analysis of the impact of wind speed interval uncertainty on the economic and environmental indicators can be quantified. In addition, it has been verified that the Pareto front in the actual scenario is distributed between the optimistic and pessimistic Pareto front, and the influence of different wind power access levels on the optimistic and pessimistic Pareto fronts is analyzed.
Faster R-CNN uses a region proposal network which consists of a single scale convolution filter and fully connected networks to localize detected regions. However, using a single scale filter is not enough to detect full regions of characters. In this letter, we propose a simple but effective way, i.e., utilizing variously sized convolution filters, to accurately detect Chinese characters of multiple scales in documents. We experimentally verified that our method improved IoU by 4% and detection rate by 3% than the previous single scale Faster R-CNN method.
Xinyu ZHU Jun ZHANG Gengsheng CHEN
Recent top-performing object detectors usually depend on a two-stage approach, which benefits from its region proposal and refining practice but suffers low detection speed. By contrast, one-stage approaches have the advantage of high efficiency while sacrifice their accuracies to some extent. In this paper, we propose a novel single-shot object detection network which inherits the merits of both. Motivated by the idea of semantic enrichment to the convolutional features within a typical deep detector, we propose two novel modules: 1) by modeling the semantic interactions between channels and the long-range dependencies between spatial positions, the self-attending module generates both channel and position attention, and enhance the original convolutional features in a self-guided manner; 2) leveraging the class-discriminative localization ability of classification-trained CNN, the semantic activating module learns a semantic meaningful convolutional response which augments low-level convolutional features with strong class-specific semantic information. The so called self-attending and semantic activating network (ASAN) achieves better accuracy than two-stage methods and is able to fulfil real-time processing. Comprehensive experiments on PASCAL VOC indicates that ASAN achieves state-of-the-art detection performance with high efficiency.
Rui CHEN Ying TONG Ruiyu LIANG
Deep neural networks have achieved great success in visual tracking by learning a generic representation and leveraging large amounts of training data to improve performance. Most generic object trackers are trained from scratch online and do not benefit from a large number of videos available for offline training. We present a real-time generic object tracker capable of incorporating temporal information into its model, learning from many examples offline and quickly updating online. During the training process, the pre-trained weight of convolution layer is updated lagging behind, and the input video sequence length is gradually increased for fast convergence. Furthermore, only the hidden states in recurrent network are updated to guarantee the real-time tracking speed. The experimental results show that the proposed tracking method is capable of tracking objects at 150 fps with higher predicting overlap rate, and achieves more robustness in multiple benchmarks than state-of-the-art performance.
In this work, we address a joint energy efficiency (EE) and throughput optimization problem in interweave cognitive radio networks (CRNs) subject to scheduling, power, and stability constraints, which could be solved through traffic admission control, channel allocation, and power allocation. Specifically, the joint objective is to concurrently optimize the system EE and the throughput of secondary user (SU), while satisfying the minimum throughput requirement of primary user (PU), the throughput constraint of SU, and the scheduling and power control constraints that must be considered. To achieve these goals, our algorithm independently and simultaneously makes control decisions on admission and transmission to maximize a joint utility of EE and throughput under time-varying conditions of channel and traffic without a priori knowledge. Specially, the proposed scheduling algorithm has polynomial time efficiency, and the power control algorithms as well as the admission control algorithm involved are simply threshold-based and thus very computationally efficient. Finally, numerical analyses show that our proposals achieve both system stability and optimal utility.
Huyen T. T. TRAN Trang H. HOANG Phu N. MINH Nam PHAM NGOC Truong CONG THANG
Thanks to the ability to bring immersive experiences to users, Virtual Reality (VR) technologies have been gaining popularity in recent years. A key component in VR systems is omnidirectional content, which can provide 360-degree views of scenes. However, at a given time, only a portion of the full omnidirectional content, called viewport, is displayed corresponding to the user's current viewing direction. In this work, we first develop Weighted-Viewport PSNR (W-VPSNR), an objective quality metric for quality assessment of omnidirectional content. The proposed metric takes into account the foveation feature of the human visual system. Then, we build a subjective database consisting of 72 stimuli with spatial varying viewport quality. By using this database, an evaluation of the proposed metric and four conventional metrics is conducted. Experiment results show that the W-VPSNR metric well correlates with the mean opinion scores (MOS) and outperforms the conventional metrics. Also, it is found that the conventional metrics do not perform well for omnidirectional content.
Baojun ZHAO Boya ZHAO Linbo TANG Baoxian WANG
Towards involving the convolutional neural networks into the object detection field, many computer vision tasks have achieved favorable successes. In order to adapt targets with various scales, deep feature pyramid is widely used, since the traditional object detection methods detect different objects in Gaussian image pyramid. However, due to the mismatching between the anchors and the feature distributions of targets, the accurate detection for targets with various scales is still a challenge. Considering the differences between the theoretical receptive field and effective receptive field, we propose a novel anchor generation method, which takes the effective receptive field as the standard. The proposed method is evaluated on the PASCAL VOC dataset and shows the favorable results.
Songlin DU Yuhao XU Tingting HU Takeshi IKENAGA
High frame rate and ultra-low delay matching system plays an important role in various human-machine interactive applications, which demands better performance in matching deformable and out-of-plane rotating objects. Although many algorithms have been proposed for deformation tracking and matching, few of them are suitable for hardware implementation due to complicated operations and large time consumption. This paper proposes a hardware-oriented template update and recovery method for high frame rate and ultra-low delay deformation matching system. In the proposed method, the new template is generated in real time by partially updating the template descriptor and adding new keypoints simultaneously with the matching process in pixels (proposal #1), which avoids the large inter-frame delay. The size and shape of region of interest (ROI) are made flexible and the Hamming threshold used for brute-force matching is adjusted according to pixel position and the flexible ROI (proposal #2), which solves the problem of template drift. The template is recovered by the previous one with a relative center-shifting vector when it is judged as lost via region-wise difference check (proposal #3). Evaluation results indicate that the proposed method successfully achieves the real-time processing of 784fps at the resolution of 640×480 on field-programmable gate array (FPGA), with a delay of 0.808ms/frame, as well as achieves satisfactory deformation matching results in comparison with other general methods.
Houari SABIRIN Hitoshi NISHIMURA Sei NAITO
A multi-camera setup for a surveillance system enables a larger coverage area, especially when a single camera has limited monitoring capability due to certain obstacles. Therefore, for large-scale coverage, multiple cameras are the best option. In this paper, we present a method for detecting multiple objects using several cameras with large overlapping views as this allows synchronization of object identification from a number of views. The proposed method uses a graph structure that is robust enough to represent any detected moving objects by defining their vertices and edges to determine their relationships. By evaluating these object features, represented as a set of attributes in a graph, we can perform lightweight multiple object detection using several cameras, as well as performing object tracking within each camera's field of view and between two cameras. By evaluating each vertex hierarchically as a subgraph, we can further observe the features of the detected object and perform automatic separation of occluding objects. Experimental results show that the proposed method would improve the accuracy of object tracking by reducing the occurrences of incorrect identification compared to individual camera-based tracking.
For the multi-objective time series search problem, Hasegawa and Itoh [Theoretical Computer Science, Vol.78, pp.58-66, 2018] presented the best possible online algorithm balanced price policy for any monotone function f:Rk→R. Specifically the competitive ratio with respect to the monotone function f(c1,...,ck)=(c1+…+ck)/k is referred to as the arithmetic mean component competitive ratio. Hasegawa and Itoh derived the explicit representation of the arithmetic mean component competitive ratio for k=2, but it has not been known for any integer k≥3. In this paper, we derive the explicit representations of the arithmetic mean component competitive ratio for k=3 and k=4, respectively. On the other hand, we show that it is computationally difficult to derive the explicit representation of the arithmetic mean component competitive ratio for arbitrary integer k in a way similar to the cases for k=2, 3, and 4.
Fei GUO Yuan YANG Yang XIAO Yong GAO Ningmei YU
Currently, visual perceptions generated by visual prosthesis are low resolution with unruly color and restricted grayscale. This severely restricts the ability of prosthetic implant to complete visual tasks in daily scenes. Some studies explore existing image processing techniques to improve the percepts of objects in prosthetic vision. However, most of them extract the moving objects and optimize the visual percepts in general dynamic scenes. The application of visual prosthesis in daily life scenes with high dynamic is greatly limited. Hence, in this study, a novel unsupervised moving object segmentation model is proposed to automatically extract the moving objects in high dynamic scene. In this model, foreground cues with spatiotemporal edge features and background cues with boundary-prior are exploited, the moving object proximity map are generated in dynamic scene according to the manifold ranking function. Moreover, the foreground and background cues are ranked simultaneously, and the moving objects are extracted by the two ranking maps integration. The evaluation experiment indicates that the proposed method can uniformly highlight the moving object and keep good boundaries in high dynamic scene with other methods. Based on this model, two optimization strategies are proposed to improve the perception of moving objects under simulated prosthetic vision. Experimental results demonstrate that the introduction of optimization strategies based on the moving object segmentation model can efficiently segment and enhance moving objects in high dynamic scene, and significantly improve the recognition performance of moving objects for the blind.
Chengcheng JIANG Xinyu ZHU Chao LI Gengsheng CHEN
Pre-trained CNNs on ImageNet have been widely used in object tracking for feature extraction. However, due to the domain mismatch between image classification and object tracking, the submergence of the target-specific features by noise largely decreases the expression ability of the convolutional features, resulting in an inefficient tracking. In this paper, we propose a robust tracking algorithm with low-dimensional target-specific feature extraction. First, a novel cascaded PCA module is proposed to have an explicit extraction of the low-dimensional target-specific features, which makes the new appearance model more effective and efficient. Next, a fast particle filter process is raised to further accelerate the whole tracking pipeline by sharing convolutional computation with a ROI-Align layer. Moreover, a classification-score guided scheme is used to update the appearance model for adapting to target variations while at the same time avoiding the model drift that caused by the object occlusion. Experimental results on OTB100 and Temple Color128 show that, the proposed algorithm has achieved a superior performance among real-time trackers. Besides, our algorithm is competitive with the state-of-the-art trackers in precision while runs at a real-time speed.
Masayuki SHIMODA Shimpei SATO Hiroki NAKAHARA
We propose an object detector using a sliding window method for an event-driven camera which outputs a subtracted frame (usually a binary value) when changes are detected in captured images. Since sliding window skips unchanged portions of the output, the number of target object area candidates decreases dramatically, which means that our system operates faster and with lower power consumption than a system using a straightforward sliding window approach. Since the event-driven camera output consists of binary precision frames, an all binarized convolutional neural network (ABCNN) can be available, which means that it allows all convolutional layers to share the same binarized convolutional circuit, thereby reducing the area requirement. We implemented our proposed method on the Xilinx Inc. Zedboard and then evaluated it using the PETS 2009 dataset. The results showed that our system outperformed BCNN system from the viewpoint of detection performance, hardware requirement, and computation time. Also, we showed that FPGA is an ideal method for our system than mobile GPU. From these results, our proposed system is more suitable for the embedded systems based on stationary cameras (such as security cameras).
Suofei ZHANG Bin KANG Lin ZHOU
Instance features based deep learning methods prompt the performances of high speed object tracking systems by directly comparing target with its template during training and tracking. However, from the perspective of human vision system, prior knowledge of target also plays key role during the process of tracking. To integrate both semantic knowledge and instance features, we propose a convolutional network based object tracking framework to simultaneously output bounding boxes based on different prior knowledge as well as confidences of corresponding Assumptions. Experimental results show that our proposed approach retains both higher accuracy and efficiency than other leading methods on tracking tasks covering most daily objects.
Jaihyun PARK Bonhwa KU Youngsaeng JIN Hanseok KO
Side scan sonar using low frequency can quickly search a wide range, but the images acquired are of low quality. The image super resolution (SR) method can mitigate this problem. The SR method typically uses sparse coding, but accurately estimating sparse coefficients incurs substantial computational costs. To reduce processing time, we propose a region-selective sparse coding based SR system that emphasizes object regions. In particular, the region that contains interesting objects is detected for side scan sonar based underwater images so that the subsequent sparse coding based SR process can be selectively applied. Effectiveness of the proposed method is verified by the reduced processing time required for image reconstruction yet preserving the same level of visual quality as conventional methods.
Iku OHAMA Takuya KIDA Hiroki ARIMURA
Latent variable models for relational data enable us to extract the co-cluster structure underlying observed relational data. The Infinite Relational Model (IRM) is a well-known relational model for discovering co-cluster structures with an unknown number of clusters. The IRM and several related models commonly assume that the link probability between two objects depends only on their cluster assignment. However, relational models based on this assumption often lead us to extract many non-informative and unexpected clusters. This is because the cluster structures underlying real-world relationships are often blurred by biases of individual objects. To overcome this problem, we propose a multi-layered framework, which extracts a clear de-blurred co-cluster structure in the presence of object biases. Then, we propose the Multi-Layered Infinite Relational Model (MLIRM) which is a special instance of the proposed framework incorporating the IRM as a co-clustering model. Furthermore, we reveal that some relational models can be regarded as special cases of the MLIRM. We derive an efficient collapsed Gibbs sampler to perform posterior inference for the MLIRM. Experiments conducted using real-world datasets have confirmed that the proposed model successfully extracts clear and interpretable cluster structures from real-world relational data.
Yusuke INOUE Takatsugu ONO Koji INOUE
On-line object tracking (OLOT) has been a core technology in computer vision, and its importance has been increasing rapidly. Because this technology is utilized for battery-operated products, energy consumption must be minimized. This paper describes a method of adaptive frame-rate optimization to satisfy that requirement. An energy trade-off occurs between image capturing and object tracking. Therefore, the method optimizes the frame-rate based on always changed object speed for minimizing the total energy while taking into account the trade-off. Simulation results show a maximum energy reduction of 50.0%, and an average reduction of 35.9% without serious tracking accuracy degradation.
Peng GAO Yipeng MA Chao LI Ke SONG Yan ZHANG Fei WANG Liyi XIAO
Most state-of-the-art discriminative tracking approaches are based on either template appearance models or statistical appearance models. Despite template appearance models have shown excellent performance, they perform poorly when the target appearance changes rapidly. In contrast, statistic appearance models are insensitive to fast target state changes, but they yield inferior tracking results in challenging scenarios such as illumination variations and background clutters. In this paper, we propose an adaptive object tracking approach with complementary models based on template and statistical appearance models. Both of these models are unified via our novel combination strategy. In addition, we introduce an efficient update scheme to improve the performance of our approach. Experimental results demonstrate that our approach achieves superior performance at speeds that far exceed the frame-rate requirement on recent tracking benchmarks.