Using Natural Language Processing and Machine Learning to Replace Human Content Coders

Yilei Wang; Jingyuan Tian; Yagizhan Yazar; Deniz S. Ones; Richard N. Landers

doi:10.1037/met0000518

Using Natural Language Processing and Machine Learning to Replace Human Content Coders

Yilei Wang, Jingyuan Tian, Yagizhan Yazar, Deniz S. Ones, Richard N. Landers

Psychology (Twin Cities)

Research output: Contribution to journal › Article › peer-review

6 Scopus citations

Abstract

Content analysis is a common and flexible technique to quantify and make sense of qualitative data in psychological research. However, the practical implementation of content analysis is extremely laborintensive and subject to human coder errors. Applying natural language processing (NLP) techniques can help address these limitations. We explain and illustrate these techniques to psychological researchers. For this purpose, we first present a study exploring the creation of psychometrically meaningful predictions of human content codes. Using an existing database of human content codes, we build an NLP algorithm to validly predict those codes, at generally acceptable standards. We then conduct a Monte-Carlo simulation to model how four dataset characteristics (i.e., sample size, unlabeled proportion of cases, classification base rate, and human coder reliability) influence content classification performance. The simulation indicated that the influence of sample size and unlabeled proportion on model classification performance tended to be curvilinear. In addition, base rate and human coder reliability had a strong effect on classification performance. Finally, using these results, we offer practical recommendations to psychologists on the necessary dataset characteristics to achieve valid prediction of content codes to guide researchers on the use of NLP models to replace human coders in content analysis research.

Original language	English (US)
Journal	Psychological Methods
DOIs	https://doi.org/10.1037/met0000518
State	Accepted/In press - 2022

Bibliographical note

Publisher Copyright:
© 2022 American Psychological Association

Keywords

Content analysis
Environmental sustainability
Machine learning
Natural language processing
Text classification

PubMed: MeSH publication types

Journal Article

Access

10.1037/met0000518

Cite this

@article{6c7d1fbaaaa1464ea0abdfb8f2bd0869,

title = "Using Natural Language Processing and Machine Learning to Replace Human Content Coders",

abstract = "Content analysis is a common and flexible technique to quantify and make sense of qualitative data in psychological research. However, the practical implementation of content analysis is extremely laborintensive and subject to human coder errors. Applying natural language processing (NLP) techniques can help address these limitations. We explain and illustrate these techniques to psychological researchers. For this purpose, we first present a study exploring the creation of psychometrically meaningful predictions of human content codes. Using an existing database of human content codes, we build an NLP algorithm to validly predict those codes, at generally acceptable standards. We then conduct a Monte-Carlo simulation to model how four dataset characteristics (i.e., sample size, unlabeled proportion of cases, classification base rate, and human coder reliability) influence content classification performance. The simulation indicated that the influence of sample size and unlabeled proportion on model classification performance tended to be curvilinear. In addition, base rate and human coder reliability had a strong effect on classification performance. Finally, using these results, we offer practical recommendations to psychologists on the necessary dataset characteristics to achieve valid prediction of content codes to guide researchers on the use of NLP models to replace human coders in content analysis research.",

keywords = "Content analysis, Environmental sustainability, Machine learning, Natural language processing, Text classification",

author = "Yilei Wang and Jingyuan Tian and Yagizhan Yazar and Ones, {Deniz S.} and Landers, {Richard N.}",

note = "Publisher Copyright: {\textcopyright} 2022 American Psychological Association",

year = "2022",

doi = "10.1037/met0000518",

language = "English (US)",

journal = "Psychological Methods",

issn = "1082-989X",

publisher = "American Psychological Association",

}

TY - JOUR

T1 - Using Natural Language Processing and Machine Learning to Replace Human Content Coders

AU - Wang, Yilei

AU - Tian, Jingyuan

AU - Yazar, Yagizhan

AU - Ones, Deniz S.

AU - Landers, Richard N.

PY - 2022

Y1 - 2022

N2 - Content analysis is a common and flexible technique to quantify and make sense of qualitative data in psychological research. However, the practical implementation of content analysis is extremely laborintensive and subject to human coder errors. Applying natural language processing (NLP) techniques can help address these limitations. We explain and illustrate these techniques to psychological researchers. For this purpose, we first present a study exploring the creation of psychometrically meaningful predictions of human content codes. Using an existing database of human content codes, we build an NLP algorithm to validly predict those codes, at generally acceptable standards. We then conduct a Monte-Carlo simulation to model how four dataset characteristics (i.e., sample size, unlabeled proportion of cases, classification base rate, and human coder reliability) influence content classification performance. The simulation indicated that the influence of sample size and unlabeled proportion on model classification performance tended to be curvilinear. In addition, base rate and human coder reliability had a strong effect on classification performance. Finally, using these results, we offer practical recommendations to psychologists on the necessary dataset characteristics to achieve valid prediction of content codes to guide researchers on the use of NLP models to replace human coders in content analysis research.

AB - Content analysis is a common and flexible technique to quantify and make sense of qualitative data in psychological research. However, the practical implementation of content analysis is extremely laborintensive and subject to human coder errors. Applying natural language processing (NLP) techniques can help address these limitations. We explain and illustrate these techniques to psychological researchers. For this purpose, we first present a study exploring the creation of psychometrically meaningful predictions of human content codes. Using an existing database of human content codes, we build an NLP algorithm to validly predict those codes, at generally acceptable standards. We then conduct a Monte-Carlo simulation to model how four dataset characteristics (i.e., sample size, unlabeled proportion of cases, classification base rate, and human coder reliability) influence content classification performance. The simulation indicated that the influence of sample size and unlabeled proportion on model classification performance tended to be curvilinear. In addition, base rate and human coder reliability had a strong effect on classification performance. Finally, using these results, we offer practical recommendations to psychologists on the necessary dataset characteristics to achieve valid prediction of content codes to guide researchers on the use of NLP models to replace human coders in content analysis research.

KW - Content analysis

KW - Environmental sustainability

KW - Machine learning

KW - Natural language processing

KW - Text classification

UR - http://www.scopus.com/inward/record.url?scp=85137995428&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85137995428&partnerID=8YFLogxK

U2 - 10.1037/met0000518

DO - 10.1037/met0000518

M3 - Article

C2 - 36006759

AN - SCOPUS:85137995428

SN - 1082-989X

JO - Psychological Methods

JF - Psychological Methods

ER -

Using Natural Language Processing and Machine Learning to Replace Human Content Coders

Abstract

Bibliographical note

Keywords

PubMed: MeSH publication types

Access

Other files and links

Fingerprint

Cite this