Self-distillation for few-shot image captioning

Xianyu Chen; Ming Jiang; Qi Zhao

doi:10.1109/WACV48630.2021.00059

Self-distillation for few-shot image captioning

Xianyu Chen, Ming Jiang, Qi Zhao

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

11 Scopus citations

Abstract

The development of large-scale image-captioning datasets is expensive, while the abundance of unpaired images and text corpus can potentially help reduce the efforts of manual annotation. In this paper, we study the few-shot image captioning problem that only requires a small amount of annotated image-caption pairs. We propose an ensemble- based self-distillation method that allows image captioning models to be trained with unpaired images and captions. The ensemble consists of multiple base models trained with different data samples in each iteration. For learning from unpaired images, we generate multiple pseudo captions with the ensemble and allocate different weights according to their confidence levels. For learning from unpaired captions, we propose a simple yet effective pseudo feature generation method based on Gradient Descent. The pseudo captions and pseudo features from the ensemble are used to train the base models in future iterations. The proposed method is general over different image captioning models and datasets. Our experiments demonstrate significant performance improvements and meaningful captions generated with only 1% of paired training data. Source code is available at https://github.com/chenxy99/SD-FSIC.

Original language	English (US)
Title of host publication	Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	545-555
Number of pages	11
ISBN (Electronic)	9780738142661
DOIs	https://doi.org/10.1109/WACV48630.2021.00059
State	Published - Jan 2021
Event	2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021 - Virtual, Online, United States Duration: Jan 5 2021 → Jan 9 2021

Publication series

Name	Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021

Conference

Conference	2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021
Country/Territory	United States
City	Virtual, Online
Period	1/5/21 → 1/9/21

Bibliographical note

Publisher Copyright:
© 2021 IEEE.

Access

10.1109/WACV48630.2021.00059

OpenUrl availability

Full text

Cite this

Chen, X., Jiang, M., & Zhao, Q. (2021). Self-distillation for few-shot image captioning. In Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021 (pp. 545-555). (Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/WACV48630.2021.00059

Self-distillation for few-shot image captioning. / Chen, Xianyu; Jiang, Ming ; Zhao, Qi.
Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021. Institute of Electrical and Electronics Engineers Inc., 2021. p. 545-555 (Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Chen, X, Jiang, M & Zhao, Q 2021, Self-distillation for few-shot image captioning. in Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021. Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Institute of Electrical and Electronics Engineers Inc., pp. 545-555, 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021, Virtual, Online, United States, 1/5/21. https://doi.org/10.1109/WACV48630.2021.00059

@inproceedings{8994f7f5469f448aae85823e12ce22d8,

title = "Self-distillation for few-shot image captioning",

abstract = "The development of large-scale image-captioning datasets is expensive, while the abundance of unpaired images and text corpus can potentially help reduce the efforts of manual annotation. In this paper, we study the few-shot image captioning problem that only requires a small amount of annotated image-caption pairs. We propose an ensemble- based self-distillation method that allows image captioning models to be trained with unpaired images and captions. The ensemble consists of multiple base models trained with different data samples in each iteration. For learning from unpaired images, we generate multiple pseudo captions with the ensemble and allocate different weights according to their confidence levels. For learning from unpaired captions, we propose a simple yet effective pseudo feature generation method based on Gradient Descent. The pseudo captions and pseudo features from the ensemble are used to train the base models in future iterations. The proposed method is general over different image captioning models and datasets. Our experiments demonstrate significant performance improvements and meaningful captions generated with only 1% of paired training data. Source code is available at https://github.com/chenxy99/SD-FSIC.",

author = "Xianyu Chen and Ming Jiang and Qi Zhao",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE.; 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021 ; Conference date: 05-01-2021 Through 09-01-2021",

year = "2021",

month = jan,

doi = "10.1109/WACV48630.2021.00059",

language = "English (US)",

series = "Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "545--555",

booktitle = "Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021",

}

TY - GEN

T1 - Self-distillation for few-shot image captioning

AU - Chen, Xianyu

AU - Jiang, Ming

AU - Zhao, Qi

PY - 2021/1

Y1 - 2021/1

N2 - The development of large-scale image-captioning datasets is expensive, while the abundance of unpaired images and text corpus can potentially help reduce the efforts of manual annotation. In this paper, we study the few-shot image captioning problem that only requires a small amount of annotated image-caption pairs. We propose an ensemble- based self-distillation method that allows image captioning models to be trained with unpaired images and captions. The ensemble consists of multiple base models trained with different data samples in each iteration. For learning from unpaired images, we generate multiple pseudo captions with the ensemble and allocate different weights according to their confidence levels. For learning from unpaired captions, we propose a simple yet effective pseudo feature generation method based on Gradient Descent. The pseudo captions and pseudo features from the ensemble are used to train the base models in future iterations. The proposed method is general over different image captioning models and datasets. Our experiments demonstrate significant performance improvements and meaningful captions generated with only 1% of paired training data. Source code is available at https://github.com/chenxy99/SD-FSIC.

AB - The development of large-scale image-captioning datasets is expensive, while the abundance of unpaired images and text corpus can potentially help reduce the efforts of manual annotation. In this paper, we study the few-shot image captioning problem that only requires a small amount of annotated image-caption pairs. We propose an ensemble- based self-distillation method that allows image captioning models to be trained with unpaired images and captions. The ensemble consists of multiple base models trained with different data samples in each iteration. For learning from unpaired images, we generate multiple pseudo captions with the ensemble and allocate different weights according to their confidence levels. For learning from unpaired captions, we propose a simple yet effective pseudo feature generation method based on Gradient Descent. The pseudo captions and pseudo features from the ensemble are used to train the base models in future iterations. The proposed method is general over different image captioning models and datasets. Our experiments demonstrate significant performance improvements and meaningful captions generated with only 1% of paired training data. Source code is available at https://github.com/chenxy99/SD-FSIC.

UR - http://www.scopus.com/inward/record.url?scp=85116124396&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85116124396&partnerID=8YFLogxK

U2 - 10.1109/WACV48630.2021.00059

DO - 10.1109/WACV48630.2021.00059

M3 - Conference contribution

AN - SCOPUS:85116124396

T3 - Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021

SP - 545

EP - 555

BT - Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021

Y2 - 5 January 2021 through 9 January 2021

ER -

Self-distillation for few-shot image captioning

Abstract

Publication series

Conference

Bibliographical note

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this