TY - JOUR
T1 - Corpus domain effects on distributional semantic modeling of medical terms
AU - Pakhomov, Serguei V.S.
AU - Finley, Greg
AU - McEwan, Reed
AU - Wang, Yan
AU - Melton, Genevieve B.
N1 - Publisher Copyright:
© The Author 2016.
PY - 2016
Y1 - 2016
N2 - Motivation: Automatically quantifying semantic similarity and relatedness between clinical terms is an important aspect of text mining from electronic health records, which are increasingly recognized as valuable sources of phenotypic information for clinical genomics and bioinformatics research. A key obstacle to development of semantic relatedness measures is the limited availability of large quantities of clinical text to researchers and developers outside of major medical centers. Text from general English and biomedical literature are freely available; however, their validity as a substitute for clinical domain to represent semantics of clinical terms remains to be demonstrated. Results: We constructed neural network representations of clinical terms found in a publicly available benchmark dataset manually labeled for semantic similarity and relatedness. Similarity and relatedness measures computed from text corpora in three domains (Clinical Notes, PubMed Central articles and Wikipedia) were compared using the benchmark as reference. We found that measures computed from full text of biomedical articles in PubMed Central repository (rho=0.62 for similarity and 0.58 for relatedness) are on par with measures computed from clinical reports (rho=0.60 for similarity and 0.57 for relatedness). We also evaluated the use of neural network based relatedness measures for query expansion in a clinical document retrieval task and a biomedical term word sense disambiguation task. We found that, with some limitations, biomedical articles may be used in lieu of clinical reports to represent the semantics of clinical terms and that distributional semantic methods are useful for clinical and biomedical natural language processing applications.
AB - Motivation: Automatically quantifying semantic similarity and relatedness between clinical terms is an important aspect of text mining from electronic health records, which are increasingly recognized as valuable sources of phenotypic information for clinical genomics and bioinformatics research. A key obstacle to development of semantic relatedness measures is the limited availability of large quantities of clinical text to researchers and developers outside of major medical centers. Text from general English and biomedical literature are freely available; however, their validity as a substitute for clinical domain to represent semantics of clinical terms remains to be demonstrated. Results: We constructed neural network representations of clinical terms found in a publicly available benchmark dataset manually labeled for semantic similarity and relatedness. Similarity and relatedness measures computed from text corpora in three domains (Clinical Notes, PubMed Central articles and Wikipedia) were compared using the benchmark as reference. We found that measures computed from full text of biomedical articles in PubMed Central repository (rho=0.62 for similarity and 0.58 for relatedness) are on par with measures computed from clinical reports (rho=0.60 for similarity and 0.57 for relatedness). We also evaluated the use of neural network based relatedness measures for query expansion in a clinical document retrieval task and a biomedical term word sense disambiguation task. We found that, with some limitations, biomedical articles may be used in lieu of clinical reports to represent the semantics of clinical terms and that distributional semantic methods are useful for clinical and biomedical natural language processing applications.
UR - http://www.scopus.com/inward/record.url?scp=85016174691&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85016174691&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btw529
DO - 10.1093/bioinformatics/btw529
M3 - Article
C2 - 27531100
AN - SCOPUS:85016174691
SN - 1367-4803
VL - 32
SP - 3635
EP - 3644
JO - Bioinformatics
JF - Bioinformatics
IS - 23
ER -