Spatial-Temporal Aggregated Shuffle Attention for Video Instance Segmentation of Traffic Scene

Chongren ZHAO; Yinhui ZHANG; Zifen HE; Yunnan DENG; Ying HUANG; Guangchen CHEN

doi:10.1587/transinf.2022EDP7147

IEICE TRANSACTIONS on Information

Spatial-Temporal Aggregated Shuffle Attention for Video Instance Segmentation of Traffic Scene

Chongren ZHAO, Yinhui ZHANG, Zifen HE, Yunnan DENG, Ying HUANG, Guangchen CHEN

Full Text Views

0

Cite this

Summary :

Aiming at the problem of spatial focus regions distribution dispersion and dislocation in feature pyramid networks and insufficient feature dependency acquisition in both spatial and channel dimensions, this paper proposes a spatial-temporal aggregated shuffle attention for video instance segmentation (STASA-VIS). First, an mixed subsampling (MS) module to embed activating features from the low-level target area of feature pyramid into the high-level is designed, so as to aggregate spatial information on target area. Taking advantage of the coherent information in video frames, STASA-VIS uses the first ones of every 5 video frames as the key-frames and then propagates the keyframe feature maps of the pyramid layers forward in the time domain, and fuses with the non-keyframe mixed subsampled features to achieve time-domain consistent feature aggregation. Finally, STASA-VIS embeds shuffle attention in the backbone to capture the pixel-level pairwise relationship and dimensional dependencies among the channels and reduce the computation. Experimental results show that the segmentation accuracy of STASA-VIS reaches 41.2%, and the test speed reaches 34FPS, which is better than the state-of-the-art one stage video instance segmentation (VIS) methods in accuracy and achieves real-time segmentation.

Publication: IEICE TRANSACTIONS on Information Vol.E106-D No.2 pp.240-251

Publication Date: 2023/02/01

Publicized: 2022/11/24

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2022EDP7147

Type of Manuscript: PAPER

Category: Image Processing and Video Processing

Authors

Chongren ZHAO
  Kunming University of Science and Technology
Yinhui ZHANG
  Kunming University of Science and Technology
Zifen HE
  Kunming University of Science and Technology
Yunnan DENG
  Kunming University of Science and Technology
Ying HUANG
  Kunming University of Science and Technology
Guangchen CHEN
  Kunming University of Science and Technology

Keyword

traffic scene, video instance segmentation, mixed subsampling, spatial-temporal aggregation, shuffle attention

Cite this

Copy

Chongren ZHAO, Yinhui ZHANG, Zifen HE, Yunnan DENG, Ying HUANG, Guangchen CHEN, "Spatial-Temporal Aggregated Shuffle Attention for Video Instance Segmentation of Traffic Scene" in IEICE TRANSACTIONS on Information, vol. E106-D, no. 2, pp. 240-251, February 2023, doi: 10.1587/transinf.2022EDP7147.
Abstract: Aiming at the problem of spatial focus regions distribution dispersion and dislocation in feature pyramid networks and insufficient feature dependency acquisition in both spatial and channel dimensions, this paper proposes a spatial-temporal aggregated shuffle attention for video instance segmentation (STASA-VIS). First, an mixed subsampling (MS) module to embed activating features from the low-level target area of feature pyramid into the high-level is designed, so as to aggregate spatial information on target area. Taking advantage of the coherent information in video frames, STASA-VIS uses the first ones of every 5 video frames as the key-frames and then propagates the keyframe feature maps of the pyramid layers forward in the time domain, and fuses with the non-keyframe mixed subsampled features to achieve time-domain consistent feature aggregation. Finally, STASA-VIS embeds shuffle attention in the backbone to capture the pixel-level pairwise relationship and dimensional dependencies among the channels and reduce the computation. Experimental results show that the segmentation accuracy of STASA-VIS reaches 41.2%, and the test speed reaches 34FPS, which is better than the state-of-the-art one stage video instance segmentation (VIS) methods in accuracy and achieves real-time segmentation.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2022EDP7147/_p

Copy

@ARTICLE{e106-d_2_240,
author={Chongren ZHAO, Yinhui ZHANG, Zifen HE, Yunnan DENG, Ying HUANG, Guangchen CHEN, },
journal={IEICE TRANSACTIONS on Information},
title={Spatial-Temporal Aggregated Shuffle Attention for Video Instance Segmentation of Traffic Scene},
year={2023},
volume={E106-D},
number={2},
pages={240-251},
abstract={Aiming at the problem of spatial focus regions distribution dispersion and dislocation in feature pyramid networks and insufficient feature dependency acquisition in both spatial and channel dimensions, this paper proposes a spatial-temporal aggregated shuffle attention for video instance segmentation (STASA-VIS). First, an mixed subsampling (MS) module to embed activating features from the low-level target area of feature pyramid into the high-level is designed, so as to aggregate spatial information on target area. Taking advantage of the coherent information in video frames, STASA-VIS uses the first ones of every 5 video frames as the key-frames and then propagates the keyframe feature maps of the pyramid layers forward in the time domain, and fuses with the non-keyframe mixed subsampled features to achieve time-domain consistent feature aggregation. Finally, STASA-VIS embeds shuffle attention in the backbone to capture the pixel-level pairwise relationship and dimensional dependencies among the channels and reduce the computation. Experimental results show that the segmentation accuracy of STASA-VIS reaches 41.2%, and the test speed reaches 34FPS, which is better than the state-of-the-art one stage video instance segmentation (VIS) methods in accuracy and achieves real-time segmentation.},
keywords={},
doi={10.1587/transinf.2022EDP7147},
ISSN={1745-1361},
month={February},}

Copy

TY - JOUR
TI - Spatial-Temporal Aggregated Shuffle Attention for Video Instance Segmentation of Traffic Scene
T2 - IEICE TRANSACTIONS on Information
SP - 240
EP - 251
AU - Chongren ZHAO
AU - Yinhui ZHANG
AU - Zifen HE
AU - Yunnan DENG
AU - Ying HUANG
AU - Guangchen CHEN
PY - 2023
DO - 10.1587/transinf.2022EDP7147
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E106-D
IS - 2
JA - IEICE TRANSACTIONS on Information
Y1 - February 2023
AB - Aiming at the problem of spatial focus regions distribution dispersion and dislocation in feature pyramid networks and insufficient feature dependency acquisition in both spatial and channel dimensions, this paper proposes a spatial-temporal aggregated shuffle attention for video instance segmentation (STASA-VIS). First, an mixed subsampling (MS) module to embed activating features from the low-level target area of feature pyramid into the high-level is designed, so as to aggregate spatial information on target area. Taking advantage of the coherent information in video frames, STASA-VIS uses the first ones of every 5 video frames as the key-frames and then propagates the keyframe feature maps of the pyramid layers forward in the time domain, and fuses with the non-keyframe mixed subsampled features to achieve time-domain consistent feature aggregation. Finally, STASA-VIS embeds shuffle attention in the backbone to capture the pixel-level pairwise relationship and dimensional dependencies among the channels and reduce the computation. Experimental results show that the segmentation accuracy of STASA-VIS reaches 41.2%, and the test speed reaches 34FPS, which is better than the state-of-the-art one stage video instance segmentation (VIS) methods in accuracy and achieves real-time segmentation.
ER -

IEICE TRANSACTIONS on Information