Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

Haoyu He; Xingjian Shi; Jonas Mueller; Sheng Zha; Mu Li; George Karypis

Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

Haoyu He, Xingjian Shi, Jonas Mueller, Sheng Zha, Mu Li, George Karypis

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Knowledge Distillation (KD) offers a natural way to reduce the latency and memory/energy usage of massive pretrained models that have come to dominate Natural Language Processing (NLP) in recent years. While numerous sophisticated variants of KD algorithms have been proposed for NLP applications, the key factors underpinning the optimal distillation performance are often confounded and remain unclear. We aim to identify how different components in the KD pipeline affect the resulting performance and how much the optimal KD pipeline varies across different datasets/tasks, such as the data augmentation policy, the loss function, and the intermediate representation for transferring the knowledge between teacher and student. To tease apart their effects, we propose Distiller, a meta KD framework that systematically combines a broad range of techniques across different stages of the KD pipeline, which enables us to quantify each component’s contribution. Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MIα objective functions with better bias/variance trade-off for estimating the MI between the teacher and the student. On a diverse set of NLP datasets, the best Distiller configurations are identified via large-scale hyper-parameter optimization. Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation, MI-α performs the best, and 3) data augmentation provides a large boost for small training datasets or small student networks. Moreover, we find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm that can recommend a good KD pipeline for a new dataset.

Original language	English (US)
Title of host publication	SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP
Editors	Nafise Sadat Moosavi, Iryna Gurevych, Angela Fan, Thomas Wolf, Yufang Hou, Ana Marasovic, Sujith Ravi
Publisher	Association for Computational Linguistics (ACL)
Pages	119-133
Number of pages	15
ISBN (Electronic)	9781955917018
State	Published - 2021
Externally published	Yes
Event	2nd Workshop on Simple and Efficient Natural Language Processing, SustaiNLP 2021 - Virtual, Online Duration: Nov 10 2021 → …

Publication series

Name	SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP

Conference

Conference	2nd Workshop on Simple and Efficient Natural Language Processing, SustaiNLP 2021
City	Virtual, Online
Period	11/10/21 → …

Bibliographical note

Publisher Copyright:
© 201 Association for Computational Linguistics.

OpenUrl availability

Full text

Cite this

He, H., Shi, X., Mueller, J., Zha, S., Li, M., & Karypis, G. (2021). Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing. In N. S. Moosavi, I. Gurevych, A. Fan, T. Wolf, Y. Hou, A. Marasovic, & S. Ravi (Eds.), SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP (pp. 119-133). (SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP). Association for Computational Linguistics (ACL).

Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing. / He, Haoyu; Shi, Xingjian; Mueller, Jonas et al.
SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP. ed. / Nafise Sadat Moosavi; Iryna Gurevych; Angela Fan; Thomas Wolf; Yufang Hou; Ana Marasovic; Sujith Ravi. Association for Computational Linguistics (ACL), 2021. p. 119-133 (SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

He, H, Shi, X, Mueller, J, Zha, S, Li, M & Karypis, G 2021, Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing. in NS Moosavi, I Gurevych, A Fan, T Wolf, Y Hou, A Marasovic & S Ravi (eds), SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP. SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP, Association for Computational Linguistics (ACL), pp. 119-133, 2nd Workshop on Simple and Efficient Natural Language Processing, SustaiNLP 2021, Virtual, Online, 11/10/21.

He H, Shi X, Mueller J, Zha S, Li M, Karypis G. Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing. In Moosavi NS, Gurevych I, Fan A, Wolf T, Hou Y, Marasovic A, Ravi S, editors, SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP. Association for Computational Linguistics (ACL). 2021. p. 119-133. (SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP).

He, Haoyu ; Shi, Xingjian ; Mueller, Jonas et al. / Distiller : A Systematic Study of Model Distillation Methods in Natural Language Processing. SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP. editor / Nafise Sadat Moosavi ; Iryna Gurevych ; Angela Fan ; Thomas Wolf ; Yufang Hou ; Ana Marasovic ; Sujith Ravi. Association for Computational Linguistics (ACL), 2021. pp. 119-133 (SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP).

@inproceedings{ce2a48ecd3d44ff3be8b0c6a8440a722,

title = "Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing",

abstract = "Knowledge Distillation (KD) offers a natural way to reduce the latency and memory/energy usage of massive pretrained models that have come to dominate Natural Language Processing (NLP) in recent years. While numerous sophisticated variants of KD algorithms have been proposed for NLP applications, the key factors underpinning the optimal distillation performance are often confounded and remain unclear. We aim to identify how different components in the KD pipeline affect the resulting performance and how much the optimal KD pipeline varies across different datasets/tasks, such as the data augmentation policy, the loss function, and the intermediate representation for transferring the knowledge between teacher and student. To tease apart their effects, we propose Distiller, a meta KD framework that systematically combines a broad range of techniques across different stages of the KD pipeline, which enables us to quantify each component{\textquoteright}s contribution. Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MIα objective functions with better bias/variance trade-off for estimating the MI between the teacher and the student. On a diverse set of NLP datasets, the best Distiller configurations are identified via large-scale hyper-parameter optimization. Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation, MI-α performs the best, and 3) data augmentation provides a large boost for small training datasets or small student networks. Moreover, we find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm that can recommend a good KD pipeline for a new dataset.",

author = "Haoyu He and Xingjian Shi and Jonas Mueller and Sheng Zha and Mu Li and George Karypis",

note = "Publisher Copyright: {\textcopyright} 201 Association for Computational Linguistics.; 2nd Workshop on Simple and Efficient Natural Language Processing, SustaiNLP 2021 ; Conference date: 10-11-2021",

year = "2021",

language = "English (US)",

series = "SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP",

publisher = "Association for Computational Linguistics (ACL)",

pages = "119--133",

editor = "Moosavi, {Nafise Sadat} and Iryna Gurevych and Angela Fan and Thomas Wolf and Yufang Hou and Ana Marasovic and Sujith Ravi",

booktitle = "SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP",

}

TY - GEN

T1 - Distiller

T2 - 2nd Workshop on Simple and Efficient Natural Language Processing, SustaiNLP 2021

AU - He, Haoyu

AU - Shi, Xingjian

AU - Mueller, Jonas

AU - Zha, Sheng

AU - Li, Mu

AU - Karypis, George

N1 - Publisher Copyright: © 201 Association for Computational Linguistics.

PY - 2021

Y1 - 2021

N2 - Knowledge Distillation (KD) offers a natural way to reduce the latency and memory/energy usage of massive pretrained models that have come to dominate Natural Language Processing (NLP) in recent years. While numerous sophisticated variants of KD algorithms have been proposed for NLP applications, the key factors underpinning the optimal distillation performance are often confounded and remain unclear. We aim to identify how different components in the KD pipeline affect the resulting performance and how much the optimal KD pipeline varies across different datasets/tasks, such as the data augmentation policy, the loss function, and the intermediate representation for transferring the knowledge between teacher and student. To tease apart their effects, we propose Distiller, a meta KD framework that systematically combines a broad range of techniques across different stages of the KD pipeline, which enables us to quantify each component’s contribution. Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MIα objective functions with better bias/variance trade-off for estimating the MI between the teacher and the student. On a diverse set of NLP datasets, the best Distiller configurations are identified via large-scale hyper-parameter optimization. Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation, MI-α performs the best, and 3) data augmentation provides a large boost for small training datasets or small student networks. Moreover, we find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm that can recommend a good KD pipeline for a new dataset.

AB - Knowledge Distillation (KD) offers a natural way to reduce the latency and memory/energy usage of massive pretrained models that have come to dominate Natural Language Processing (NLP) in recent years. While numerous sophisticated variants of KD algorithms have been proposed for NLP applications, the key factors underpinning the optimal distillation performance are often confounded and remain unclear. We aim to identify how different components in the KD pipeline affect the resulting performance and how much the optimal KD pipeline varies across different datasets/tasks, such as the data augmentation policy, the loss function, and the intermediate representation for transferring the knowledge between teacher and student. To tease apart their effects, we propose Distiller, a meta KD framework that systematically combines a broad range of techniques across different stages of the KD pipeline, which enables us to quantify each component’s contribution. Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MIα objective functions with better bias/variance trade-off for estimating the MI between the teacher and the student. On a diverse set of NLP datasets, the best Distiller configurations are identified via large-scale hyper-parameter optimization. Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation, MI-α performs the best, and 3) data augmentation provides a large boost for small training datasets or small student networks. Moreover, we find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm that can recommend a good KD pipeline for a new dataset.

UR - http://www.scopus.com/inward/record.url?scp=85137148185&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85137148185&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85137148185

T3 - SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP

SP - 119

EP - 133

BT - SustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP

A2 - Moosavi, Nafise Sadat

A2 - Gurevych, Iryna

A2 - Fan, Angela

A2 - Wolf, Thomas

A2 - Hou, Yufang

A2 - Marasovic, Ana

A2 - Ravi, Sujith

PB - Association for Computational Linguistics (ACL)

Y2 - 10 November 2021

ER -

Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

Abstract

Publication series

Conference

Bibliographical note

OpenUrl availability

Other files and links

Fingerprint

Cite this