1-2hit |
Hitoshi NISHIMURA Satoshi KOMORITA Yasutomo KAWANISHI Hiroshi MURASE
Multiple human tracking is a fundamental problem in understanding the context of a visual scene. Although both accuracy and speed are required in real-world applications, recent tracking methods based on deep learning focus on accuracy and require a substantial amount of running time. We aim to improve tracking running speeds by performing human detections at certain frame intervals because it accounts for most of the running time. The question is how to maintain accuracy while skipping human detection. In this paper, we propose a method that interpolates the detection results by using an optical flow, which is based on the fact that someone's appearance does not change much between adjacent frames. To maintain the tracking accuracy, we introduce robust interest point detection within the human regions and a tracking termination metric defined by the distribution of the interest points. On the MOT17 and MOT20 datasets in the MOTChallenge, the proposed SDOF-Tracker achieved the best performance in terms of total running time while maintaining the MOTA metric. Our code is available at https://github.com/hitottiez/sdof-tracker.
Jianfeng XU Satoshi KOMORITA Kei KAWAMURA
We propose a framework for the integration of heterogeneous networks in human pose estimation (HPE) with the aim of balancing accuracy and computational complexity. Although many existing methods can improve the accuracy of HPE using multiple frames in videos, they also increase the computational complexity. The key difference here is that the proposed heterogeneous framework has various networks for different types of frames, while existing methods use the same networks for all frames. In particular, we propose to divide the video frames into two types, including key frames and non-key frames, and adopt three networks including slow networks, fast networks, and transfer networks in our heterogeneous framework. For key frames, a slow network is used that has high accuracy but high computational complexity. For non-key frames that follow a key frame, we propose to warp the heatmap of a slow network from a key frame via a transfer network and fuse it with a fast network that has low accuracy but low computational complexity. Furthermore, when extending to the usage of long-term frames where a large number of non-key frames follow a key frame, the temporal correlation decreases. Therefore, when necessary, we use an additional transfer network that warps the heatmap from a neighboring non-key frame. The experimental results on PoseTrack 2017 and PoseTrack 2018 datasets demonstrate that the proposed FSPose achieves a better balance between accuracy and computational complexity than the competitor method. Our source code is available at https://github.com/Fenax79/fspose.