Crossing the “Cookie Theft” Corpus Chasm: Applying What BERT Learns From Outside Data to the ADReSS Challenge Dementia Detection Task

Yue Guo; Changye Li; Carol Roan; Serguei Pakhomov; Trevor Cohen

doi:10.3389/fcomp.2021.642517

Crossing the “Cookie Theft” Corpus Chasm: Applying What BERT Learns From Outside Data to the ADReSS Challenge Dementia Detection Task

Yue Guo, Changye Li, Carol Roan, Serguei Pakhomov, Trevor Cohen

Pharmaceutical Care and Health Systems

Research output: Contribution to journal › Article › peer-review

19 Scopus citations

Abstract

Large amounts of labeled data are a prerequisite to training accurate and reliable machine learning models. However, in the medical domain in particular, this is also a stumbling block as accurately labeled data are hard to obtain. DementiaBank, a publicly available corpus of spontaneous speech samples from a picture description task widely used to study Alzheimer's disease (AD) patients' language characteristics and for training classification models to distinguish patients with AD from healthy controls, is relatively small—a limitation that is further exacerbated when restricting to the balanced subset used in the Alzheimer's Dementia Recognition through Spontaneous Speech (ADReSS) challenge. We build on previous work showing that the performance of traditional machine learning models on DementiaBank can be improved by the addition of normative data from other sources, evaluating the utility of such extrinsic data to further improve the performance of state-of-the-art deep learning based methods on the ADReSS challenge dementia detection task. To this end, we developed a new corpus of professionally transcribed recordings from the Wisconsin Longitudinal Study (WLS), resulting in 1366 additional Cookie Theft Task transcripts, increasing the available training data by an order of magnitude. Using these data in conjunction with DementiaBank is challenging because the WLS metadata corresponding to these transcripts do not contain dementia diagnoses. However, cognitive status of WLS participants can be inferred from results of several cognitive tests including semantic verbal fluency available in WLS data. In this work, we evaluate the utility of using the WLS ‘controls’ (participants without indications of abnormal cognitive status), and these data in conjunction with inferred ‘cases’ (participants with such indications) for training deep learning models to discriminate between language produced by patients with dementia and healthy controls. We find that incorporating WLS data during training a BERT model on ADReSS data improves its performance on the ADReSS dementia detection task, supporting the hypothesis that incorporating WLS data adds value in this context. We also demonstrate that weighted cost functions and additional prediction targets may be effective ways to address issues arising from class imbalance and confounding effects due to data provenance.

Original language	English (US)
Article number	642517
Journal	Frontiers in Computer Science
Volume	3
DOIs	https://doi.org/10.3389/fcomp.2021.642517
State	Published - Apr 16 2021

Bibliographical note

Funding Information:
This research was supported by Administrative Supplement R01 LM011563 S1 from the National Library of Medicine, and R21 AG069792 from the National Institute of Aging. Since 1991, the WLS has been supported principally by the National Institute on Aging (AG-9775, AG-21079, and AG-033285), with additional support from the Vilas Estate Trust, the National Science Foundation, the Spencer Foundation, and the Graduate School of the University of Wisconsin-Madison.

Publisher Copyright:
© Copyright © 2021 Guo, Li, Roan, Pakhomov and Cohen.

Keywords

Alzheimer's disease
BERT
dementia diagnosis
machine learning
natural language processing

Access

10.3389/fcomp.2021.642517

OpenUrl availability

Full text

Cite this

@article{2e0c57c5bb24483b98eca3d38a8fe555,

title = "Crossing the “Cookie Theft” Corpus Chasm: Applying What BERT Learns From Outside Data to the ADReSS Challenge Dementia Detection Task",

abstract = "Large amounts of labeled data are a prerequisite to training accurate and reliable machine learning models. However, in the medical domain in particular, this is also a stumbling block as accurately labeled data are hard to obtain. DementiaBank, a publicly available corpus of spontaneous speech samples from a picture description task widely used to study Alzheimer's disease (AD) patients' language characteristics and for training classification models to distinguish patients with AD from healthy controls, is relatively small—a limitation that is further exacerbated when restricting to the balanced subset used in the Alzheimer's Dementia Recognition through Spontaneous Speech (ADReSS) challenge. We build on previous work showing that the performance of traditional machine learning models on DementiaBank can be improved by the addition of normative data from other sources, evaluating the utility of such extrinsic data to further improve the performance of state-of-the-art deep learning based methods on the ADReSS challenge dementia detection task. To this end, we developed a new corpus of professionally transcribed recordings from the Wisconsin Longitudinal Study (WLS), resulting in 1366 additional Cookie Theft Task transcripts, increasing the available training data by an order of magnitude. Using these data in conjunction with DementiaBank is challenging because the WLS metadata corresponding to these transcripts do not contain dementia diagnoses. However, cognitive status of WLS participants can be inferred from results of several cognitive tests including semantic verbal fluency available in WLS data. In this work, we evaluate the utility of using the WLS {\textquoteleft}controls{\textquoteright} (participants without indications of abnormal cognitive status), and these data in conjunction with inferred {\textquoteleft}cases{\textquoteright} (participants with such indications) for training deep learning models to discriminate between language produced by patients with dementia and healthy controls. We find that incorporating WLS data during training a BERT model on ADReSS data improves its performance on the ADReSS dementia detection task, supporting the hypothesis that incorporating WLS data adds value in this context. We also demonstrate that weighted cost functions and additional prediction targets may be effective ways to address issues arising from class imbalance and confounding effects due to data provenance.",

keywords = "Alzheimer's disease, BERT, dementia diagnosis, machine learning, natural language processing",

author = "Yue Guo and Changye Li and Carol Roan and Serguei Pakhomov and Trevor Cohen",

note = "Funding Information: This research was supported by Administrative Supplement R01 LM011563 S1 from the National Library of Medicine, and R21 AG069792 from the National Institute of Aging. Since 1991, the WLS has been supported principally by the National Institute on Aging (AG-9775, AG-21079, and AG-033285), with additional support from the Vilas Estate Trust, the National Science Foundation, the Spencer Foundation, and the Graduate School of the University of Wisconsin-Madison. Publisher Copyright: {\textcopyright} Copyright {\textcopyright} 2021 Guo, Li, Roan, Pakhomov and Cohen.",

year = "2021",

month = apr,

day = "16",

doi = "10.3389/fcomp.2021.642517",

language = "English (US)",

volume = "3",

journal = "Frontiers in Computer Science",

issn = "2624-9898",

publisher = "Frontiers Media S. A.",

}

TY - JOUR

T1 - Crossing the “Cookie Theft” Corpus Chasm

T2 - Applying What BERT Learns From Outside Data to the ADReSS Challenge Dementia Detection Task

AU - Guo, Yue

AU - Li, Changye

AU - Roan, Carol

AU - Pakhomov, Serguei

AU - Cohen, Trevor

N1 - Funding Information: This research was supported by Administrative Supplement R01 LM011563 S1 from the National Library of Medicine, and R21 AG069792 from the National Institute of Aging. Since 1991, the WLS has been supported principally by the National Institute on Aging (AG-9775, AG-21079, and AG-033285), with additional support from the Vilas Estate Trust, the National Science Foundation, the Spencer Foundation, and the Graduate School of the University of Wisconsin-Madison. Publisher Copyright: © Copyright © 2021 Guo, Li, Roan, Pakhomov and Cohen.

PY - 2021/4/16

Y1 - 2021/4/16

N2 - Large amounts of labeled data are a prerequisite to training accurate and reliable machine learning models. However, in the medical domain in particular, this is also a stumbling block as accurately labeled data are hard to obtain. DementiaBank, a publicly available corpus of spontaneous speech samples from a picture description task widely used to study Alzheimer's disease (AD) patients' language characteristics and for training classification models to distinguish patients with AD from healthy controls, is relatively small—a limitation that is further exacerbated when restricting to the balanced subset used in the Alzheimer's Dementia Recognition through Spontaneous Speech (ADReSS) challenge. We build on previous work showing that the performance of traditional machine learning models on DementiaBank can be improved by the addition of normative data from other sources, evaluating the utility of such extrinsic data to further improve the performance of state-of-the-art deep learning based methods on the ADReSS challenge dementia detection task. To this end, we developed a new corpus of professionally transcribed recordings from the Wisconsin Longitudinal Study (WLS), resulting in 1366 additional Cookie Theft Task transcripts, increasing the available training data by an order of magnitude. Using these data in conjunction with DementiaBank is challenging because the WLS metadata corresponding to these transcripts do not contain dementia diagnoses. However, cognitive status of WLS participants can be inferred from results of several cognitive tests including semantic verbal fluency available in WLS data. In this work, we evaluate the utility of using the WLS ‘controls’ (participants without indications of abnormal cognitive status), and these data in conjunction with inferred ‘cases’ (participants with such indications) for training deep learning models to discriminate between language produced by patients with dementia and healthy controls. We find that incorporating WLS data during training a BERT model on ADReSS data improves its performance on the ADReSS dementia detection task, supporting the hypothesis that incorporating WLS data adds value in this context. We also demonstrate that weighted cost functions and additional prediction targets may be effective ways to address issues arising from class imbalance and confounding effects due to data provenance.

AB - Large amounts of labeled data are a prerequisite to training accurate and reliable machine learning models. However, in the medical domain in particular, this is also a stumbling block as accurately labeled data are hard to obtain. DementiaBank, a publicly available corpus of spontaneous speech samples from a picture description task widely used to study Alzheimer's disease (AD) patients' language characteristics and for training classification models to distinguish patients with AD from healthy controls, is relatively small—a limitation that is further exacerbated when restricting to the balanced subset used in the Alzheimer's Dementia Recognition through Spontaneous Speech (ADReSS) challenge. We build on previous work showing that the performance of traditional machine learning models on DementiaBank can be improved by the addition of normative data from other sources, evaluating the utility of such extrinsic data to further improve the performance of state-of-the-art deep learning based methods on the ADReSS challenge dementia detection task. To this end, we developed a new corpus of professionally transcribed recordings from the Wisconsin Longitudinal Study (WLS), resulting in 1366 additional Cookie Theft Task transcripts, increasing the available training data by an order of magnitude. Using these data in conjunction with DementiaBank is challenging because the WLS metadata corresponding to these transcripts do not contain dementia diagnoses. However, cognitive status of WLS participants can be inferred from results of several cognitive tests including semantic verbal fluency available in WLS data. In this work, we evaluate the utility of using the WLS ‘controls’ (participants without indications of abnormal cognitive status), and these data in conjunction with inferred ‘cases’ (participants with such indications) for training deep learning models to discriminate between language produced by patients with dementia and healthy controls. We find that incorporating WLS data during training a BERT model on ADReSS data improves its performance on the ADReSS dementia detection task, supporting the hypothesis that incorporating WLS data adds value in this context. We also demonstrate that weighted cost functions and additional prediction targets may be effective ways to address issues arising from class imbalance and confounding effects due to data provenance.

KW - Alzheimer's disease

KW - BERT

KW - dementia diagnosis

KW - machine learning

KW - natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85117882916&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85117882916&partnerID=8YFLogxK

U2 - 10.3389/fcomp.2021.642517

DO - 10.3389/fcomp.2021.642517

M3 - Article

AN - SCOPUS:85117882916

SN - 2624-9898

VL - 3

JO - Frontiers in Computer Science

JF - Frontiers in Computer Science

M1 - 642517

ER -

Crossing the “Cookie Theft” Corpus Chasm: Applying What BERT Learns From Outside Data to the ADReSS Challenge Dementia Detection Task

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this