Action spotting is a key component in high-level video understanding. The large number of similar frames poses a challenge for recognizing actions in videos. In this paper we use frame saliency to represent the importance of frames for guiding the model to focus on keyframes. We propose the frame saliency weighting module to improve frame saliency and video representation at the same time. Our proposed model contains two encoders, for pre-action and post-action time windows, to encode video context. We validate our design choices and the generality of proposed method in extensive experiments. On the public SoccerNet-v2 dataset, the method achieves an average mAP of 57.3%, improving over the state of the art. Using embedding features obtained from multiple feature extractors, the average mAP further increases to 75%. We show that reducing the model size by over 90% does not significantly impact performance. Additionally, we use ablation studies to prove the effective of saliency weighting module. Further, we show that our frame saliency weighting strategy is applicable to existing methods on more general action datasets, such as SoccerNet-v1, ActivityNet v1.3, and UCF101.
Yuzhi SHI
Chubu University
Takayoshi YAMASHITA
Chubu University
Tsubasa HIRAKAWA
Chubu University
Hironobu FUJIYOSHI
Chubu University
Mitsuru NAKAZAWA
Rakuten Group, Inc.
Yeongnam CHAE
Rakuten Group, Inc.
Björn STENGER
Rakuten Group, Inc.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copy
Yuzhi SHI, Takayoshi YAMASHITA, Tsubasa HIRAKAWA, Hironobu FUJIYOSHI, Mitsuru NAKAZAWA, Yeongnam CHAE, Björn STENGER, "Efficient Action Spotting Using Saliency Feature Weighting" in IEICE TRANSACTIONS on Information,
vol. E107-D, no. 1, pp. 105-114, January 2024, doi: 10.1587/transinf.2022EDP7210.
Abstract: Action spotting is a key component in high-level video understanding. The large number of similar frames poses a challenge for recognizing actions in videos. In this paper we use frame saliency to represent the importance of frames for guiding the model to focus on keyframes. We propose the frame saliency weighting module to improve frame saliency and video representation at the same time. Our proposed model contains two encoders, for pre-action and post-action time windows, to encode video context. We validate our design choices and the generality of proposed method in extensive experiments. On the public SoccerNet-v2 dataset, the method achieves an average mAP of 57.3%, improving over the state of the art. Using embedding features obtained from multiple feature extractors, the average mAP further increases to 75%. We show that reducing the model size by over 90% does not significantly impact performance. Additionally, we use ablation studies to prove the effective of saliency weighting module. Further, we show that our frame saliency weighting strategy is applicable to existing methods on more general action datasets, such as SoccerNet-v1, ActivityNet v1.3, and UCF101.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2022EDP7210/_p
Copy
@ARTICLE{e107-d_1_105,
author={Yuzhi SHI, Takayoshi YAMASHITA, Tsubasa HIRAKAWA, Hironobu FUJIYOSHI, Mitsuru NAKAZAWA, Yeongnam CHAE, Björn STENGER, },
journal={IEICE TRANSACTIONS on Information},
title={Efficient Action Spotting Using Saliency Feature Weighting},
year={2024},
volume={E107-D},
number={1},
pages={105-114},
abstract={Action spotting is a key component in high-level video understanding. The large number of similar frames poses a challenge for recognizing actions in videos. In this paper we use frame saliency to represent the importance of frames for guiding the model to focus on keyframes. We propose the frame saliency weighting module to improve frame saliency and video representation at the same time. Our proposed model contains two encoders, for pre-action and post-action time windows, to encode video context. We validate our design choices and the generality of proposed method in extensive experiments. On the public SoccerNet-v2 dataset, the method achieves an average mAP of 57.3%, improving over the state of the art. Using embedding features obtained from multiple feature extractors, the average mAP further increases to 75%. We show that reducing the model size by over 90% does not significantly impact performance. Additionally, we use ablation studies to prove the effective of saliency weighting module. Further, we show that our frame saliency weighting strategy is applicable to existing methods on more general action datasets, such as SoccerNet-v1, ActivityNet v1.3, and UCF101.},
keywords={},
doi={10.1587/transinf.2022EDP7210},
ISSN={1745-1361},
month={January},}
Copy
TY - JOUR
TI - Efficient Action Spotting Using Saliency Feature Weighting
T2 - IEICE TRANSACTIONS on Information
SP - 105
EP - 114
AU - Yuzhi SHI
AU - Takayoshi YAMASHITA
AU - Tsubasa HIRAKAWA
AU - Hironobu FUJIYOSHI
AU - Mitsuru NAKAZAWA
AU - Yeongnam CHAE
AU - Björn STENGER
PY - 2024
DO - 10.1587/transinf.2022EDP7210
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E107-D
IS - 1
JA - IEICE TRANSACTIONS on Information
Y1 - January 2024
AB - Action spotting is a key component in high-level video understanding. The large number of similar frames poses a challenge for recognizing actions in videos. In this paper we use frame saliency to represent the importance of frames for guiding the model to focus on keyframes. We propose the frame saliency weighting module to improve frame saliency and video representation at the same time. Our proposed model contains two encoders, for pre-action and post-action time windows, to encode video context. We validate our design choices and the generality of proposed method in extensive experiments. On the public SoccerNet-v2 dataset, the method achieves an average mAP of 57.3%, improving over the state of the art. Using embedding features obtained from multiple feature extractors, the average mAP further increases to 75%. We show that reducing the model size by over 90% does not significantly impact performance. Additionally, we use ablation studies to prove the effective of saliency weighting module. Further, we show that our frame saliency weighting strategy is applicable to existing methods on more general action datasets, such as SoccerNet-v1, ActivityNet v1.3, and UCF101.
ER -