A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative

Elena Casiraghi; Rachel Wong; Margaret Hall; Ben Coleman; Marco Notaro; Michael D. Evans; Jena S. Tronieri; Hannah Blau; Bryan Laraway; Tiffany J. Callahan; Lauren E. Chan; Carolyn T. Bramante; John B. Buse; Richard A. Moffitt; Til Stürmer; Steven G. Johnson; Yu Raymond Shao; Justin Reese; Peter N. Robinson; Alberto Paccanaro; Giorgio Valentini; Jared D. Huling; Kenneth J. Wilkins

doi:10.1016/j.jbi.2023.104295

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative

Elena Casiraghi, Rachel Wong, Margaret Hall, Ben Coleman, Marco Notaro, Michael D. Evans, Jena S. Tronieri, Hannah Blau, Bryan Laraway, Tiffany J. Callahan, Lauren E. Chan, Carolyn T. Bramante, John B. Buse, Richard A. Moffitt, Til Stürmer, Steven G. Johnson, Yu Raymond Shao, Justin Reese, Peter N. Robinson, Alberto PaccanaroGiorgio Valentini, Jared D. Huling, Kenneth J. Wilkins

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients’ predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm's parameters and data-related modeling choices are also both crucial and challenging. In this paper we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. Extensive experiments show that our approach can effectively highlight the most promising and performant missing-data handling strategy for our case study. Moreover, our methodology allowed a better understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.

Original language	English (US)
Article number	104295
Journal	Journal of Biomedical Informatics
Volume	139
DOIs	https://doi.org/10.1016/j.jbi.2023.104295
State	Published - Mar 2023

Bibliographical note

Publisher Copyright:
© 2023

Keywords

COVID-19 severity assessment
Clinical informatics
Diabetic patients
Evaluation framework
Multiple Imputation

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access

10.1016/j.jbi.2023.104295

OpenUrl availability

Full text

Cite this

Casiraghi, E., Wong, R., Hall, M., Coleman, B., Notaro, M., Evans, M. D., Tronieri, J. S., Blau, H., Laraway, B., Callahan, T. J., Chan, L. E., Bramante, C. T., Buse, J. B., Moffitt, R. A., Stürmer, T., Johnson, S. G., Raymond Shao, Y., Reese, J., Robinson, P. N., ... Wilkins, K. J. (2023). A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative. Journal of Biomedical Informatics, 139, Article 104295. https://doi.org/10.1016/j.jbi.2023.104295

Casiraghi, E, Wong, R, Hall, M, Coleman, B, Notaro, M, Evans, MD, Tronieri, JS, Blau, H, Laraway, B, Callahan, TJ, Chan, LE, Bramante, CT, Buse, JB, Moffitt, RA, Stürmer, T, Johnson, SG, Raymond Shao, Y, Reese, J, Robinson, PN, Paccanaro, A, Valentini, G, Huling, JD & Wilkins, KJ 2023, 'A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative', Journal of Biomedical Informatics, vol. 139, 104295. https://doi.org/10.1016/j.jbi.2023.104295

@article{efe26b35de6e4d848adbcbeb6a7c0aa7,

title = "A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative",

abstract = "Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients{\textquoteright} predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm's parameters and data-related modeling choices are also both crucial and challenging. In this paper we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. Extensive experiments show that our approach can effectively highlight the most promising and performant missing-data handling strategy for our case study. Moreover, our methodology allowed a better understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.",

keywords = "COVID-19 severity assessment, Clinical informatics, Diabetic patients, Evaluation framework, Multiple Imputation",

author = "Elena Casiraghi and Rachel Wong and Margaret Hall and Ben Coleman and Marco Notaro and Evans, {Michael D.} and Tronieri, {Jena S.} and Hannah Blau and Bryan Laraway and Callahan, {Tiffany J.} and Chan, {Lauren E.} and Bramante, {Carolyn T.} and Buse, {John B.} and Moffitt, {Richard A.} and Til St{\"u}rmer and Johnson, {Steven G.} and {Raymond Shao}, Yu and Justin Reese and Robinson, {Peter N.} and Alberto Paccanaro and Giorgio Valentini and Huling, {Jared D.} and Wilkins, {Kenneth J.}",

note = "Publisher Copyright: {\textcopyright} 2023",

year = "2023",

month = mar,

doi = "10.1016/j.jbi.2023.104295",

language = "English (US)",

volume = "139",

journal = "Journal of Biomedical Informatics",

issn = "1532-0464",

publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - A method for comparing multiple imputation techniques

T2 - A case study on the U.S. national COVID cohort collaborative

AU - Casiraghi, Elena

AU - Wong, Rachel

AU - Hall, Margaret

AU - Coleman, Ben

AU - Notaro, Marco

AU - Evans, Michael D.

AU - Tronieri, Jena S.

AU - Blau, Hannah

AU - Laraway, Bryan

AU - Callahan, Tiffany J.

AU - Chan, Lauren E.

AU - Bramante, Carolyn T.

AU - Buse, John B.

AU - Moffitt, Richard A.

AU - Stürmer, Til

AU - Johnson, Steven G.

AU - Raymond Shao, Yu

AU - Reese, Justin

AU - Robinson, Peter N.

AU - Paccanaro, Alberto

AU - Valentini, Giorgio

AU - Huling, Jared D.

AU - Wilkins, Kenneth J.

PY - 2023/3

Y1 - 2023/3

N2 - Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients’ predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm's parameters and data-related modeling choices are also both crucial and challenging. In this paper we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. Extensive experiments show that our approach can effectively highlight the most promising and performant missing-data handling strategy for our case study. Moreover, our methodology allowed a better understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.

AB - Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful for assessing associations between patients’ predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases, whose removal may introduce severe bias. Several multiple imputation algorithms have been proposed to attempt to recover the missing information under an assumed missingness mechanism. Each algorithm presents strengths and weaknesses, and there is currently no consensus on which multiple imputation algorithm works best in a given scenario. Furthermore, the selection of each algorithm's parameters and data-related modeling choices are also both crucial and challenging. In this paper we propose a novel framework to numerically evaluate strategies for handling missing data in the context of statistical analysis, with a particular focus on multiple imputation techniques. We demonstrate the feasibility of our approach on a large cohort of type-2 diabetes patients provided by the National COVID Cohort Collaborative (N3C) Enclave, where we explored the influence of various patient characteristics on outcomes related to COVID-19. Our analysis included classic multiple imputation techniques as well as simple complete-case Inverse Probability Weighted models. Extensive experiments show that our approach can effectively highlight the most promising and performant missing-data handling strategy for our case study. Moreover, our methodology allowed a better understanding of the behavior of the different models and of how it changed as we modified their parameters. Our method is general and can be applied to different research fields and on datasets containing heterogeneous types.

KW - COVID-19 severity assessment

KW - Clinical informatics

KW - Diabetic patients

KW - Evaluation framework

KW - Multiple Imputation

UR - http://www.scopus.com/inward/record.url?scp=85147663829&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85147663829&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2023.104295

DO - 10.1016/j.jbi.2023.104295

M3 - Article

C2 - 36716983

AN - SCOPUS:85147663829

SN - 1532-0464

VL - 139

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

M1 - 104295

ER -

A method for comparing multiple imputation techniques: A case study on the U.S. national COVID cohort collaborative

Abstract

Bibliographical note

Keywords

UN SDGs

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this