Nlp methods for extraction of symptoms from unstructured data for use in prognostic covid-19 analytic models

Greg M. Silverman; Himanshu S. Sahoo; Nicholas E. Ingraham; Monica Lupei; Michael A. Puskarich; Michael Usher; James Dries; Raymond L. Finzel; Eric Murray; John Sartori; Gyorgy Simon; Rui Zhang; Genevieve B. Melton; Christopher J. Tignanelli; Serguei V.S. Pakhomov

doi:10.1613/JAIR.1.12631

Nlp methods for extraction of symptoms from unstructured data for use in prognostic covid-19 analytic models

Greg M. Silverman, Himanshu S. Sahoo, Nicholas E. Ingraham, Monica Lupei, Michael A. Puskarich, Michael Usher, James Dries, Raymond L. Finzel, Eric Murray, John Sartori, Gyorgy Simon, Rui Zhang, Genevieve B. Melton, Christopher J. Tignanelli, Serguei V.S. Pakhomov

Research output: Contribution to journal › Article › peer-review

17 Scopus citations

Abstract

Statistical modeling of outcomes based on a patient's presenting symptoms (symptomatology) can help deliver high quality care and allocate essential resources, which is especially important during the COVID-19 pandemic. Patient symptoms are typically found in unstructured notes, and thus not readily available for clinical decision making. In an attempt to fill this gap, this study compared two methods for symptom extraction from Emergency Department (ED) admission notes. Both methods utilized a lexicon derived by expanding The Center for Disease Control and Prevention's (CDC) Symptoms of Coronavirus list. The first method utilized a word2vec model to expand the lexicon using a dictionary mapping to the Unified Medical Language System (UMLS). The second method utilized the expanded lexicon as a rule-based gazetteer and the UMLS. These methods were evaluated against a manually annotated reference (f1-score of 0.87 for UMLS-based ensemble; and 0.85 for rule-based gazetteer with UMLS). Through analyses of associations of extracted symptoms used as features against various outcomes, salient risks among the population of COVID-19 patients, including increased risk of in-hospital mortality (OR 1.85, p-value < 0.001), were identigied for patients presenting with dyspnea. Disparities between English and non-English speaking patients were also identified, the most salient being a concerning finding of opposing risk signals between fatigue and in-hospital mortality (non-English: OR 1.95, p-value = 0.02; English: OR 0.63, p-value = 0.01). While use of symptomatology for modeling of outcomes is not unique, unlike previous studies this study showed that models built using symptoms with the outcome of in-hospital mortality were not significantly different from models using data collected during an in-patient encounter (AUC of 0.9 with 95% CI of [0.88, 0.91] using only vital signs; AUC of 0.87 with 95% CI of [0.85, 0.88] using only symptoms). These findings indicate that prognostic models based on symptomatology could aid in extending COVID-19 patient care through telemedicine, replacing the need for in-person options. The methods presented in this study have potential for use in development of symptomatology-based models for other diseases, including for the study of Post-Acute Sequelae of COVID-19 (PASC).

Original language	English (US)
Pages (from-to)	429-474
Number of pages	46
Journal	Journal of Artificial Intelligence Research
Volume	72
DOIs	https://doi.org/10.1613/JAIR.1.12631
State	Published - 2021

Bibliographical note

Publisher Copyright:
© 2021 AI Access Foundation. All rights reserved.

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access

10.1613/JAIR.1.12631

OpenUrl availability

Full text

Cite this

Silverman, G. M., Sahoo, H. S., Ingraham, N. E., Lupei, M., Puskarich, M. A., Usher, M., Dries, J., Finzel, R. L., Murray, E., Sartori, J., Simon, G., Zhang, R., Melton, G. B., Tignanelli, C. J., & Pakhomov, S. V. S. (2021). Nlp methods for extraction of symptoms from unstructured data for use in prognostic covid-19 analytic models. Journal of Artificial Intelligence Research, 72, 429-474. https://doi.org/10.1613/JAIR.1.12631

Silverman, GM, Sahoo, HS, Ingraham, NE , Lupei, M , Puskarich, MA , Usher, M, Dries, J, Finzel, RL, Murray, E, Sartori, J , Simon, G , Zhang, R , Melton, GB , Tignanelli, CJ & Pakhomov, SVS 2021, 'Nlp methods for extraction of symptoms from unstructured data for use in prognostic covid-19 analytic models', Journal of Artificial Intelligence Research, vol. 72, pp. 429-474. https://doi.org/10.1613/JAIR.1.12631

@article{a1675076acbb4a69a4c9a4a2ac2c9669,

title = "Nlp methods for extraction of symptoms from unstructured data for use in prognostic covid-19 analytic models",

abstract = "Statistical modeling of outcomes based on a patient's presenting symptoms (symptomatology) can help deliver high quality care and allocate essential resources, which is especially important during the COVID-19 pandemic. Patient symptoms are typically found in unstructured notes, and thus not readily available for clinical decision making. In an attempt to fill this gap, this study compared two methods for symptom extraction from Emergency Department (ED) admission notes. Both methods utilized a lexicon derived by expanding The Center for Disease Control and Prevention's (CDC) Symptoms of Coronavirus list. The first method utilized a word2vec model to expand the lexicon using a dictionary mapping to the Unified Medical Language System (UMLS). The second method utilized the expanded lexicon as a rule-based gazetteer and the UMLS. These methods were evaluated against a manually annotated reference (f1-score of 0.87 for UMLS-based ensemble; and 0.85 for rule-based gazetteer with UMLS). Through analyses of associations of extracted symptoms used as features against various outcomes, salient risks among the population of COVID-19 patients, including increased risk of in-hospital mortality (OR 1.85, p-value < 0.001), were identigied for patients presenting with dyspnea. Disparities between English and non-English speaking patients were also identified, the most salient being a concerning finding of opposing risk signals between fatigue and in-hospital mortality (non-English: OR 1.95, p-value = 0.02; English: OR 0.63, p-value = 0.01). While use of symptomatology for modeling of outcomes is not unique, unlike previous studies this study showed that models built using symptoms with the outcome of in-hospital mortality were not significantly different from models using data collected during an in-patient encounter (AUC of 0.9 with 95% CI of [0.88, 0.91] using only vital signs; AUC of 0.87 with 95% CI of [0.85, 0.88] using only symptoms). These findings indicate that prognostic models based on symptomatology could aid in extending COVID-19 patient care through telemedicine, replacing the need for in-person options. The methods presented in this study have potential for use in development of symptomatology-based models for other diseases, including for the study of Post-Acute Sequelae of COVID-19 (PASC).",

author = "Silverman, {Greg M.} and Sahoo, {Himanshu S.} and Ingraham, {Nicholas E.} and Monica Lupei and Puskarich, {Michael A.} and Michael Usher and James Dries and Finzel, {Raymond L.} and Eric Murray and John Sartori and Gyorgy Simon and Rui Zhang and Melton, {Genevieve B.} and Tignanelli, {Christopher J.} and Pakhomov, {Serguei V.S.}",

year = "2021",

doi = "10.1613/JAIR.1.12631",

language = "English (US)",

volume = "72",

pages = "429--474",

journal = "Journal of Artificial Intelligence Research",

issn = "1076-9757",

publisher = "Morgan Kaufmann Publishers, Inc.",

}

TY - JOUR

T1 - Nlp methods for extraction of symptoms from unstructured data for use in prognostic covid-19 analytic models

AU - Silverman, Greg M.

AU - Sahoo, Himanshu S.

AU - Ingraham, Nicholas E.

AU - Lupei, Monica

AU - Puskarich, Michael A.

AU - Usher, Michael

AU - Dries, James

AU - Finzel, Raymond L.

AU - Murray, Eric

AU - Sartori, John

AU - Simon, Gyorgy

AU - Zhang, Rui

AU - Melton, Genevieve B.

AU - Tignanelli, Christopher J.

AU - Pakhomov, Serguei V.S.

PY - 2021

Y1 - 2021

N2 - Statistical modeling of outcomes based on a patient's presenting symptoms (symptomatology) can help deliver high quality care and allocate essential resources, which is especially important during the COVID-19 pandemic. Patient symptoms are typically found in unstructured notes, and thus not readily available for clinical decision making. In an attempt to fill this gap, this study compared two methods for symptom extraction from Emergency Department (ED) admission notes. Both methods utilized a lexicon derived by expanding The Center for Disease Control and Prevention's (CDC) Symptoms of Coronavirus list. The first method utilized a word2vec model to expand the lexicon using a dictionary mapping to the Unified Medical Language System (UMLS). The second method utilized the expanded lexicon as a rule-based gazetteer and the UMLS. These methods were evaluated against a manually annotated reference (f1-score of 0.87 for UMLS-based ensemble; and 0.85 for rule-based gazetteer with UMLS). Through analyses of associations of extracted symptoms used as features against various outcomes, salient risks among the population of COVID-19 patients, including increased risk of in-hospital mortality (OR 1.85, p-value < 0.001), were identigied for patients presenting with dyspnea. Disparities between English and non-English speaking patients were also identified, the most salient being a concerning finding of opposing risk signals between fatigue and in-hospital mortality (non-English: OR 1.95, p-value = 0.02; English: OR 0.63, p-value = 0.01). While use of symptomatology for modeling of outcomes is not unique, unlike previous studies this study showed that models built using symptoms with the outcome of in-hospital mortality were not significantly different from models using data collected during an in-patient encounter (AUC of 0.9 with 95% CI of [0.88, 0.91] using only vital signs; AUC of 0.87 with 95% CI of [0.85, 0.88] using only symptoms). These findings indicate that prognostic models based on symptomatology could aid in extending COVID-19 patient care through telemedicine, replacing the need for in-person options. The methods presented in this study have potential for use in development of symptomatology-based models for other diseases, including for the study of Post-Acute Sequelae of COVID-19 (PASC).

AB - Statistical modeling of outcomes based on a patient's presenting symptoms (symptomatology) can help deliver high quality care and allocate essential resources, which is especially important during the COVID-19 pandemic. Patient symptoms are typically found in unstructured notes, and thus not readily available for clinical decision making. In an attempt to fill this gap, this study compared two methods for symptom extraction from Emergency Department (ED) admission notes. Both methods utilized a lexicon derived by expanding The Center for Disease Control and Prevention's (CDC) Symptoms of Coronavirus list. The first method utilized a word2vec model to expand the lexicon using a dictionary mapping to the Unified Medical Language System (UMLS). The second method utilized the expanded lexicon as a rule-based gazetteer and the UMLS. These methods were evaluated against a manually annotated reference (f1-score of 0.87 for UMLS-based ensemble; and 0.85 for rule-based gazetteer with UMLS). Through analyses of associations of extracted symptoms used as features against various outcomes, salient risks among the population of COVID-19 patients, including increased risk of in-hospital mortality (OR 1.85, p-value < 0.001), were identigied for patients presenting with dyspnea. Disparities between English and non-English speaking patients were also identified, the most salient being a concerning finding of opposing risk signals between fatigue and in-hospital mortality (non-English: OR 1.95, p-value = 0.02; English: OR 0.63, p-value = 0.01). While use of symptomatology for modeling of outcomes is not unique, unlike previous studies this study showed that models built using symptoms with the outcome of in-hospital mortality were not significantly different from models using data collected during an in-patient encounter (AUC of 0.9 with 95% CI of [0.88, 0.91] using only vital signs; AUC of 0.87 with 95% CI of [0.85, 0.88] using only symptoms). These findings indicate that prognostic models based on symptomatology could aid in extending COVID-19 patient care through telemedicine, replacing the need for in-person options. The methods presented in this study have potential for use in development of symptomatology-based models for other diseases, including for the study of Post-Acute Sequelae of COVID-19 (PASC).

UR - http://www.scopus.com/inward/record.url?scp=85118309451&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85118309451&partnerID=8YFLogxK

U2 - 10.1613/JAIR.1.12631

DO - 10.1613/JAIR.1.12631

M3 - Article

AN - SCOPUS:85118309451

SN - 1076-9757

VL - 72

SP - 429

EP - 474

JO - Journal of Artificial Intelligence Research

JF - Journal of Artificial Intelligence Research

ER -

Nlp methods for extraction of symptoms from unstructured data for use in prognostic covid-19 analytic models

Abstract

Bibliographical note

UN SDGs

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this