FastPicker: Adaptive independent two-stage video-to-video summarization for efficient action recognition

Saghir Alfasly; Jian Lu; Chen Xu; Zaid Al-Huda; Qingtang Jiang; Zhaosong Lu; Charles K. Chui

doi:10.1016/j.neucom.2022.10.037

FastPicker: Adaptive independent two-stage video-to-video summarization for efficient action recognition

Saghir Alfasly, Jian Lu, Chen Xu, Zaid Al-Huda, Qingtang Jiang, Zhaosong Lu, Charles K. Chui

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Video datasets suffer from huge inter-frame redundancy, which prevents deep networks from learning effectively and increases computational costs. Therefore, several methods adopt random/uniform frame sampling or key-frame selection techniques. Unfortunately, most of the learnable frame selection methods are customized for specific models and lack generality, independence, and scalability. In this paper, we propose a novel two-stage video-to-video summarization method termed FastPicker, which can efficiently select the most discriminative and representative frames for better action recognition. Independently, the discriminative frames are selected in the first stage based on the inter-frame motion computation, whereas the representative frames are selected in the second stage using a novel Transformer-based model. Learnable frame embeddings are proposed to estimate each frame contribution to the final video classification certainty. Consequently, the frames with the largest contributions are the most representative. The proposed method is carefully evaluated by summarizing several action recognition datasets and using them to train various deep models with several backbones. The experimental results demonstrate a remarkable performance boost on Kinetics400, Something-Something-v2, ActivityNet-1.3, UCF-101, and HMDB51 datasets, e.g., FastPicker downsizes Kinetics400 by 78.7% of its size while improving the human activity recognition.

Original language	English (US)
Pages (from-to)	231-244
Number of pages	14
Journal	Neurocomputing
Volume	516
DOIs	https://doi.org/10.1016/j.neucom.2022.10.037
State	Published - Jan 7 2023
Externally published	Yes

Bibliographical note

Funding Information:
We thank Ahmed Elazab, Zeyad Qasem, and Murtadha Ahmed for the fruitful discussion. This work was supported in part by the National Natural Science Foundation of China under grants U21A20455, 61972265, 11871348 and 61872429, the Natural Science Foundation of Guangdong Province of China under grant 2020B1515310008, the Educational Commission of Guangdong Province of China under grant 2019KZDZX1007, the Simons Foundation under grant 353185, and ARO under grant W911NF2110218.

Publisher Copyright:
© 2022 Elsevier B.V.

Keywords

Action recognition
Deep learning
Discriminative frame selection
Representative frame selection
Video-to-video summarization

Access

10.1016/j.neucom.2022.10.037

OpenUrl availability

Full text

Cite this

@article{4b144fa6b11c4807a19078fea1bb4579,

title = "FastPicker: Adaptive independent two-stage video-to-video summarization for efficient action recognition",

abstract = "Video datasets suffer from huge inter-frame redundancy, which prevents deep networks from learning effectively and increases computational costs. Therefore, several methods adopt random/uniform frame sampling or key-frame selection techniques. Unfortunately, most of the learnable frame selection methods are customized for specific models and lack generality, independence, and scalability. In this paper, we propose a novel two-stage video-to-video summarization method termed FastPicker, which can efficiently select the most discriminative and representative frames for better action recognition. Independently, the discriminative frames are selected in the first stage based on the inter-frame motion computation, whereas the representative frames are selected in the second stage using a novel Transformer-based model. Learnable frame embeddings are proposed to estimate each frame contribution to the final video classification certainty. Consequently, the frames with the largest contributions are the most representative. The proposed method is carefully evaluated by summarizing several action recognition datasets and using them to train various deep models with several backbones. The experimental results demonstrate a remarkable performance boost on Kinetics400, Something-Something-v2, ActivityNet-1.3, UCF-101, and HMDB51 datasets, e.g., FastPicker downsizes Kinetics400 by 78.7% of its size while improving the human activity recognition.",

keywords = "Action recognition, Deep learning, Discriminative frame selection, Representative frame selection, Video-to-video summarization",

author = "Saghir Alfasly and Jian Lu and Chen Xu and Zaid Al-Huda and Qingtang Jiang and Zhaosong Lu and Chui, {Charles K.}",

note = "Funding Information: We thank Ahmed Elazab, Zeyad Qasem, and Murtadha Ahmed for the fruitful discussion. This work was supported in part by the National Natural Science Foundation of China under grants U21A20455, 61972265, 11871348 and 61872429, the Natural Science Foundation of Guangdong Province of China under grant 2020B1515310008, the Educational Commission of Guangdong Province of China under grant 2019KZDZX1007, the Simons Foundation under grant 353185, and ARO under grant W911NF2110218. Publisher Copyright: {\textcopyright} 2022 Elsevier B.V.",

year = "2023",

month = jan,

day = "7",

doi = "10.1016/j.neucom.2022.10.037",

language = "English (US)",

volume = "516",

pages = "231--244",

journal = "Neurocomputing",

issn = "0925-2312",

publisher = "Elsevier",

}

TY - JOUR

T1 - FastPicker

T2 - Adaptive independent two-stage video-to-video summarization for efficient action recognition

AU - Alfasly, Saghir

AU - Lu, Jian

AU - Xu, Chen

AU - Al-Huda, Zaid

AU - Jiang, Qingtang

AU - Lu, Zhaosong

AU - Chui, Charles K.

N1 - Funding Information: We thank Ahmed Elazab, Zeyad Qasem, and Murtadha Ahmed for the fruitful discussion. This work was supported in part by the National Natural Science Foundation of China under grants U21A20455, 61972265, 11871348 and 61872429, the Natural Science Foundation of Guangdong Province of China under grant 2020B1515310008, the Educational Commission of Guangdong Province of China under grant 2019KZDZX1007, the Simons Foundation under grant 353185, and ARO under grant W911NF2110218. Publisher Copyright: © 2022 Elsevier B.V.

PY - 2023/1/7

Y1 - 2023/1/7

N2 - Video datasets suffer from huge inter-frame redundancy, which prevents deep networks from learning effectively and increases computational costs. Therefore, several methods adopt random/uniform frame sampling or key-frame selection techniques. Unfortunately, most of the learnable frame selection methods are customized for specific models and lack generality, independence, and scalability. In this paper, we propose a novel two-stage video-to-video summarization method termed FastPicker, which can efficiently select the most discriminative and representative frames for better action recognition. Independently, the discriminative frames are selected in the first stage based on the inter-frame motion computation, whereas the representative frames are selected in the second stage using a novel Transformer-based model. Learnable frame embeddings are proposed to estimate each frame contribution to the final video classification certainty. Consequently, the frames with the largest contributions are the most representative. The proposed method is carefully evaluated by summarizing several action recognition datasets and using them to train various deep models with several backbones. The experimental results demonstrate a remarkable performance boost on Kinetics400, Something-Something-v2, ActivityNet-1.3, UCF-101, and HMDB51 datasets, e.g., FastPicker downsizes Kinetics400 by 78.7% of its size while improving the human activity recognition.

AB - Video datasets suffer from huge inter-frame redundancy, which prevents deep networks from learning effectively and increases computational costs. Therefore, several methods adopt random/uniform frame sampling or key-frame selection techniques. Unfortunately, most of the learnable frame selection methods are customized for specific models and lack generality, independence, and scalability. In this paper, we propose a novel two-stage video-to-video summarization method termed FastPicker, which can efficiently select the most discriminative and representative frames for better action recognition. Independently, the discriminative frames are selected in the first stage based on the inter-frame motion computation, whereas the representative frames are selected in the second stage using a novel Transformer-based model. Learnable frame embeddings are proposed to estimate each frame contribution to the final video classification certainty. Consequently, the frames with the largest contributions are the most representative. The proposed method is carefully evaluated by summarizing several action recognition datasets and using them to train various deep models with several backbones. The experimental results demonstrate a remarkable performance boost on Kinetics400, Something-Something-v2, ActivityNet-1.3, UCF-101, and HMDB51 datasets, e.g., FastPicker downsizes Kinetics400 by 78.7% of its size while improving the human activity recognition.

KW - Action recognition

KW - Deep learning

KW - Discriminative frame selection

KW - Representative frame selection

KW - Video-to-video summarization

UR - http://www.scopus.com/inward/record.url?scp=85143804559&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85143804559&partnerID=8YFLogxK

U2 - 10.1016/j.neucom.2022.10.037

DO - 10.1016/j.neucom.2022.10.037

M3 - Article

AN - SCOPUS:85143804559

SN - 0925-2312

VL - 516

SP - 231

EP - 244

JO - Neurocomputing

JF - Neurocomputing

ER -

FastPicker: Adaptive independent two-stage video-to-video summarization for efficient action recognition

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this