Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing

Haoyu He, Xingjian Shi, Jonas Mueller, Sheng Zha, Mu Li, George Karypis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Scopus citations

Abstract

Knowledge Distillation (KD) offers a natural way to reduce the latency and memory/energy usage of massive pretrained models that have come to dominate Natural Language Processing (NLP) in recent years. While numerous sophisticated variants of KD algorithms have been proposed for NLP applications, the key factors underpinning the optimal distillation performance are often confounded and remain unclear. We aim to identify how different components in the KD pipeline affect the resulting performance and how much the optimal KD pipeline varies across different datasets/tasks, such as the data augmentation policy, the loss function, and the intermediate representation for transferring the knowledge between teacher and student. To tease apart their effects, we propose Distiller, a meta KD framework that systematically combines a broad range of techniques across different stages of the KD pipeline, which enables us to quantify each component’s contribution. Within Distiller, we unify commonly used objectives for distillation of intermediate representations under a universal mutual information (MI) objective and propose a class of MIα objective functions with better bias/variance trade-off for estimating the MI between the teacher and the student. On a diverse set of NLP datasets, the best Distiller configurations are identified via large-scale hyper-parameter optimization. Our experiments reveal the following: 1) the approach used to distill the intermediate representations is the most important factor in KD performance, 2) among different objectives for intermediate distillation, MI-α performs the best, and 3) data augmentation provides a large boost for small training datasets or small student networks. Moreover, we find that different datasets/tasks prefer different KD algorithms, and thus propose a simple AutoDistiller algorithm that can recommend a good KD pipeline for a new dataset.

Original languageEnglish (US)
Title of host publicationSustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP
EditorsNafise Sadat Moosavi, Iryna Gurevych, Angela Fan, Thomas Wolf, Yufang Hou, Ana Marasovic, Sujith Ravi
PublisherAssociation for Computational Linguistics (ACL)
Pages119-133
Number of pages15
ISBN (Electronic)9781955917018
StatePublished - 2021
Externally publishedYes
Event2nd Workshop on Simple and Efficient Natural Language Processing, SustaiNLP 2021 - Virtual, Online
Duration: Nov 10 2021 → …

Publication series

NameSustaiNLP 2021 - 2nd Workshop on Simple and Efficient Natural Language Processing, Proceedings of SustaiNLP

Conference

Conference2nd Workshop on Simple and Efficient Natural Language Processing, SustaiNLP 2021
CityVirtual, Online
Period11/10/21 → …

Bibliographical note

Publisher Copyright:
© 201 Association for Computational Linguistics.

Fingerprint

Dive into the research topics of 'Distiller: A Systematic Study of Model Distillation Methods in Natural Language Processing'. Together they form a unique fingerprint.

Cite this