Robust Visual Tracking Using Hierarchical Vision Transformer with Shifted Windows Multi-Head Self-Attention

Peng GAO; Xin-Yue ZHANG; Xiao-Li YANG; Jian-Cheng NI; Fei WANG

doi:10.1587/transinf.2023EDL8053

IEICE TRANSACTIONS on Information

Robust Visual Tracking Using Hierarchical Vision Transformer with Shifted Windows Multi-Head Self-Attention

Peng GAO, Xin-Yue ZHANG, Xiao-Li YANG, Jian-Cheng NI, Fei WANG

Full Text Views

0

Cite this

Summary :

Despite Siamese trackers attracting much attention due to their scalability and efficiency in recent years, researchers have ignored the background appearance, which leads to their inapplicability in recognizing arbitrary target objects with various variations, especially in complex scenarios with background clutter and distractors. In this paper, we present a simple yet effective Siamese tracker, where the shifted windows multi-head self-attention is produced to learn the characteristics of a specific given target object for visual tracking. To validate the effectiveness of our proposed tracker, we use the Swin Transformer as the backbone network and introduced an auxiliary feature enhancement network. Extensive experimental results on two evaluation datasets demonstrate that the proposed tracker outperforms other baselines.

Publication: IEICE TRANSACTIONS on Information Vol.E107-D No.1 pp.161-164

Publication Date: 2024/01/01

Publicized: 2023/10/20

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2023EDL8053

Type of Manuscript: LETTER

Category: Image Recognition, Computer Vision

Authors

Peng GAO
  Qufu Normal University
Xin-Yue ZHANG
  Qufu Normal University
Xiao-Li YANG
  Qufu Normal University
Jian-Cheng NI
  Qufu Normal University
Fei WANG
  Harbin Institute of Technology

Keyword

Siamese network, visual tracking, vision transformer, self-attention

Cite this

Copy

Peng GAO, Xin-Yue ZHANG, Xiao-Li YANG, Jian-Cheng NI, Fei WANG, "Robust Visual Tracking Using Hierarchical Vision Transformer with Shifted Windows Multi-Head Self-Attention" in IEICE TRANSACTIONS on Information, vol. E107-D, no. 1, pp. 161-164, January 2024, doi: 10.1587/transinf.2023EDL8053.
Abstract: Despite Siamese trackers attracting much attention due to their scalability and efficiency in recent years, researchers have ignored the background appearance, which leads to their inapplicability in recognizing arbitrary target objects with various variations, especially in complex scenarios with background clutter and distractors. In this paper, we present a simple yet effective Siamese tracker, where the shifted windows multi-head self-attention is produced to learn the characteristics of a specific given target object for visual tracking. To validate the effectiveness of our proposed tracker, we use the Swin Transformer as the backbone network and introduced an auxiliary feature enhancement network. Extensive experimental results on two evaluation datasets demonstrate that the proposed tracker outperforms other baselines.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2023EDL8053/_p

Copy

@ARTICLE{e107-d_1_161,
author={Peng GAO, Xin-Yue ZHANG, Xiao-Li YANG, Jian-Cheng NI, Fei WANG, },
journal={IEICE TRANSACTIONS on Information},
title={Robust Visual Tracking Using Hierarchical Vision Transformer with Shifted Windows Multi-Head Self-Attention},
year={2024},
volume={E107-D},
number={1},
pages={161-164},
abstract={Despite Siamese trackers attracting much attention due to their scalability and efficiency in recent years, researchers have ignored the background appearance, which leads to their inapplicability in recognizing arbitrary target objects with various variations, especially in complex scenarios with background clutter and distractors. In this paper, we present a simple yet effective Siamese tracker, where the shifted windows multi-head self-attention is produced to learn the characteristics of a specific given target object for visual tracking. To validate the effectiveness of our proposed tracker, we use the Swin Transformer as the backbone network and introduced an auxiliary feature enhancement network. Extensive experimental results on two evaluation datasets demonstrate that the proposed tracker outperforms other baselines.},
keywords={},
doi={10.1587/transinf.2023EDL8053},
ISSN={1745-1361},
month={January},}

Copy

TY - JOUR
TI - Robust Visual Tracking Using Hierarchical Vision Transformer with Shifted Windows Multi-Head Self-Attention
T2 - IEICE TRANSACTIONS on Information
SP - 161
EP - 164
AU - Peng GAO
AU - Xin-Yue ZHANG
AU - Xiao-Li YANG
AU - Jian-Cheng NI
AU - Fei WANG
PY - 2024
DO - 10.1587/transinf.2023EDL8053
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E107-D
IS - 1
JA - IEICE TRANSACTIONS on Information
Y1 - January 2024
AB - Despite Siamese trackers attracting much attention due to their scalability and efficiency in recent years, researchers have ignored the background appearance, which leads to their inapplicability in recognizing arbitrary target objects with various variations, especially in complex scenarios with background clutter and distractors. In this paper, we present a simple yet effective Siamese tracker, where the shifted windows multi-head self-attention is produced to learn the characteristics of a specific given target object for visual tracking. To validate the effectiveness of our proposed tracker, we use the Swin Transformer as the backbone network and introduced an auxiliary feature enhancement network. Extensive experimental results on two evaluation datasets demonstrate that the proposed tracker outperforms other baselines.
ER -

IEICE TRANSACTIONS on Information