Exploring eating disorder topics on twitter: Machine learning approach

Sicheng Zhou; Yunpeng Zhao; Jiang Bian; Ann F. Haynos; Rui Zhang

doi:10.2196/18273

Exploring eating disorder topics on twitter: Machine learning approach

Sicheng Zhou, Yunpeng Zhao, Jiang Bian, Ann F. Haynos, Rui Zhang

Research output: Contribution to journal › Article › peer-review

21 Scopus citations

Abstract

Background: Eating disorders (EDs) are a group of mental illnesses that have an adverse effect on both mental and physical health. As social media platforms (eg, Twitter) have become an important data source for public health research, some studies have qualitatively explored the ways in which EDs are discussed on these platforms. Initial results suggest that such research offers a promising method for further understanding this group of diseases. Nevertheless, an efficient computational method is needed to further identify and analyze tweets relevant to EDs on a larger scale. Objective: This study aims to develop and validate a machine learning–based classifier to identify tweets related to EDs and to explore factors (ie, topics) related to EDs using a topic modeling method. Methods: We collected potential ED-relevant tweets using keywords from previous studies and annotated these tweets into different groups (ie, ED relevant vs irrelevant and then promotional information vs laypeople discussion). Several supervised machine learning methods, such as convolutional neural network (CNN), long short-term memory (LSTM), support vector machine, and naïve Bayes, were developed and evaluated using annotated data. We used the classifier with the best performance to identify ED-relevant tweets and applied a topic modeling method—Correlation Explanation (CorEx)—to analyze the content of the identified tweets. To validate these machine learning results, we also collected a cohort of ED-relevant tweets on the basis of manually curated rules. Results: A total of 123,977 tweets were collected during the set period. We randomly annotated 2219 tweets for developing the machine learning classifiers. We developed a CNN-LSTM classifier to identify ED-relevant tweets published by laypeople in 2 steps: first relevant versus irrelevant (F₁ score=0.89) and then promotional versus published by laypeople (F₁ score=0.90). A total of 40,790 ED-relevant tweets were identified using the CNN-LSTM classifier. We also identified another set of tweets (ie, 17,632 ED-relevant and 83,557 ED-irrelevant tweets) posted by laypeople using manually specified rules. Using CorEx on all ED-relevant tweets, the topic model identified 162 topics. Overall, the coherence rate for topic modeling was 77.07% (1264/1640), indicating a high quality of the produced topics. The topics were further reviewed and analyzed by a domain expert. Conclusions: A developed CNN-LSTM classifier could improve the efficiency of identifying ED-relevant tweets compared with the traditional manual-based method. The CorEx topic model was applied on the tweets identified by the machine learning–based classifier and the traditional manual approach separately. Highly overlapping topics were observed between the 2 cohorts of tweets. The produced topics were further reviewed by a domain expert. Some of the topics identified by the potential ED tweets may provide new avenues for understanding this serious set of disorders.

Original language	English (US)
Article number	e18273
Journal	JMIR Medical Informatics
Volume	8
Issue number	10
DOIs	https://doi.org/10.2196/18273
State	Published - Oct 2020

Bibliographical note

Funding Information:
This study was supported by the National Center for Complementary and Integrative Health of the National Institutes of Health (NIH) under Award Number R01AT009457 (principal investigator [PI]: RZ), the National Institute of Mental Health under Award Number K23 MH112867 (PI: AH), and the National Science Foundation (NSF) under Award Number 1734134 (PI: JB). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the NSF.

Funding Information:
API: application programming interface BTM: Biterm Topic Model CNN: convolutional neural network CorEx: Correlation Explanation ED: eating disorder GB: gradient boosting trees LDA: latent Dirichlet allocation LSTM: long short-term memory NB: naïve Bayes NIH: National Institutes of Health NSF: National Science Foundation PI: principal investigator RF: random forest SVM: support vector machine

Publisher Copyright:
© 2020 JMIR Medical Informatics. All rights reserved.

Keywords

Eating disorders
Public health
Social media
Text classification
Topic modeling

Access

10.2196/18273

OpenUrl availability

Full text

Cite this

@article{e7b6c4473d184f59aee71a8b42ecd0ec,

title = "Exploring eating disorder topics on twitter: Machine learning approach",

abstract = "Background: Eating disorders (EDs) are a group of mental illnesses that have an adverse effect on both mental and physical health. As social media platforms (eg, Twitter) have become an important data source for public health research, some studies have qualitatively explored the ways in which EDs are discussed on these platforms. Initial results suggest that such research offers a promising method for further understanding this group of diseases. Nevertheless, an efficient computational method is needed to further identify and analyze tweets relevant to EDs on a larger scale. Objective: This study aims to develop and validate a machine learning–based classifier to identify tweets related to EDs and to explore factors (ie, topics) related to EDs using a topic modeling method. Methods: We collected potential ED-relevant tweets using keywords from previous studies and annotated these tweets into different groups (ie, ED relevant vs irrelevant and then promotional information vs laypeople discussion). Several supervised machine learning methods, such as convolutional neural network (CNN), long short-term memory (LSTM), support vector machine, and na{\"i}ve Bayes, were developed and evaluated using annotated data. We used the classifier with the best performance to identify ED-relevant tweets and applied a topic modeling method—Correlation Explanation (CorEx)—to analyze the content of the identified tweets. To validate these machine learning results, we also collected a cohort of ED-relevant tweets on the basis of manually curated rules. Results: A total of 123,977 tweets were collected during the set period. We randomly annotated 2219 tweets for developing the machine learning classifiers. We developed a CNN-LSTM classifier to identify ED-relevant tweets published by laypeople in 2 steps: first relevant versus irrelevant (F1 score=0.89) and then promotional versus published by laypeople (F1 score=0.90). A total of 40,790 ED-relevant tweets were identified using the CNN-LSTM classifier. We also identified another set of tweets (ie, 17,632 ED-relevant and 83,557 ED-irrelevant tweets) posted by laypeople using manually specified rules. Using CorEx on all ED-relevant tweets, the topic model identified 162 topics. Overall, the coherence rate for topic modeling was 77.07% (1264/1640), indicating a high quality of the produced topics. The topics were further reviewed and analyzed by a domain expert. Conclusions: A developed CNN-LSTM classifier could improve the efficiency of identifying ED-relevant tweets compared with the traditional manual-based method. The CorEx topic model was applied on the tweets identified by the machine learning–based classifier and the traditional manual approach separately. Highly overlapping topics were observed between the 2 cohorts of tweets. The produced topics were further reviewed by a domain expert. Some of the topics identified by the potential ED tweets may provide new avenues for understanding this serious set of disorders.",

keywords = "Eating disorders, Public health, Social media, Text classification, Topic modeling",

author = "Sicheng Zhou and Yunpeng Zhao and Jiang Bian and Haynos, {Ann F.} and Rui Zhang",

note = "Funding Information: This study was supported by the National Center for Complementary and Integrative Health of the National Institutes of Health (NIH) under Award Number R01AT009457 (principal investigator [PI]: RZ), the National Institute of Mental Health under Award Number K23 MH112867 (PI: AH), and the National Science Foundation (NSF) under Award Number 1734134 (PI: JB). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the NSF. Funding Information: API: application programming interface BTM: Biterm Topic Model CNN: convolutional neural network CorEx: Correlation Explanation ED: eating disorder GB: gradient boosting trees LDA: latent Dirichlet allocation LSTM: long short-term memory NB: na{\"i}ve Bayes NIH: National Institutes of Health NSF: National Science Foundation PI: principal investigator RF: random forest SVM: support vector machine Publisher Copyright: {\textcopyright} 2020 JMIR Medical Informatics. All rights reserved.",

year = "2020",

month = oct,

doi = "10.2196/18273",

language = "English (US)",

volume = "8",

journal = "JMIR Medical Informatics",

issn = "2291-9694",

publisher = "JMIR Publications Inc.",

number = "10",

}

TY - JOUR

T1 - Exploring eating disorder topics on twitter

T2 - Machine learning approach

AU - Zhou, Sicheng

AU - Zhao, Yunpeng

AU - Bian, Jiang

AU - Haynos, Ann F.

AU - Zhang, Rui

N1 - Funding Information: This study was supported by the National Center for Complementary and Integrative Health of the National Institutes of Health (NIH) under Award Number R01AT009457 (principal investigator [PI]: RZ), the National Institute of Mental Health under Award Number K23 MH112867 (PI: AH), and the National Science Foundation (NSF) under Award Number 1734134 (PI: JB). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the NSF. Funding Information: API: application programming interface BTM: Biterm Topic Model CNN: convolutional neural network CorEx: Correlation Explanation ED: eating disorder GB: gradient boosting trees LDA: latent Dirichlet allocation LSTM: long short-term memory NB: naïve Bayes NIH: National Institutes of Health NSF: National Science Foundation PI: principal investigator RF: random forest SVM: support vector machine Publisher Copyright: © 2020 JMIR Medical Informatics. All rights reserved.

PY - 2020/10

Y1 - 2020/10

N2 - Background: Eating disorders (EDs) are a group of mental illnesses that have an adverse effect on both mental and physical health. As social media platforms (eg, Twitter) have become an important data source for public health research, some studies have qualitatively explored the ways in which EDs are discussed on these platforms. Initial results suggest that such research offers a promising method for further understanding this group of diseases. Nevertheless, an efficient computational method is needed to further identify and analyze tweets relevant to EDs on a larger scale. Objective: This study aims to develop and validate a machine learning–based classifier to identify tweets related to EDs and to explore factors (ie, topics) related to EDs using a topic modeling method. Methods: We collected potential ED-relevant tweets using keywords from previous studies and annotated these tweets into different groups (ie, ED relevant vs irrelevant and then promotional information vs laypeople discussion). Several supervised machine learning methods, such as convolutional neural network (CNN), long short-term memory (LSTM), support vector machine, and naïve Bayes, were developed and evaluated using annotated data. We used the classifier with the best performance to identify ED-relevant tweets and applied a topic modeling method—Correlation Explanation (CorEx)—to analyze the content of the identified tweets. To validate these machine learning results, we also collected a cohort of ED-relevant tweets on the basis of manually curated rules. Results: A total of 123,977 tweets were collected during the set period. We randomly annotated 2219 tweets for developing the machine learning classifiers. We developed a CNN-LSTM classifier to identify ED-relevant tweets published by laypeople in 2 steps: first relevant versus irrelevant (F1 score=0.89) and then promotional versus published by laypeople (F1 score=0.90). A total of 40,790 ED-relevant tweets were identified using the CNN-LSTM classifier. We also identified another set of tweets (ie, 17,632 ED-relevant and 83,557 ED-irrelevant tweets) posted by laypeople using manually specified rules. Using CorEx on all ED-relevant tweets, the topic model identified 162 topics. Overall, the coherence rate for topic modeling was 77.07% (1264/1640), indicating a high quality of the produced topics. The topics were further reviewed and analyzed by a domain expert. Conclusions: A developed CNN-LSTM classifier could improve the efficiency of identifying ED-relevant tweets compared with the traditional manual-based method. The CorEx topic model was applied on the tweets identified by the machine learning–based classifier and the traditional manual approach separately. Highly overlapping topics were observed between the 2 cohorts of tweets. The produced topics were further reviewed by a domain expert. Some of the topics identified by the potential ED tweets may provide new avenues for understanding this serious set of disorders.

AB - Background: Eating disorders (EDs) are a group of mental illnesses that have an adverse effect on both mental and physical health. As social media platforms (eg, Twitter) have become an important data source for public health research, some studies have qualitatively explored the ways in which EDs are discussed on these platforms. Initial results suggest that such research offers a promising method for further understanding this group of diseases. Nevertheless, an efficient computational method is needed to further identify and analyze tweets relevant to EDs on a larger scale. Objective: This study aims to develop and validate a machine learning–based classifier to identify tweets related to EDs and to explore factors (ie, topics) related to EDs using a topic modeling method. Methods: We collected potential ED-relevant tweets using keywords from previous studies and annotated these tweets into different groups (ie, ED relevant vs irrelevant and then promotional information vs laypeople discussion). Several supervised machine learning methods, such as convolutional neural network (CNN), long short-term memory (LSTM), support vector machine, and naïve Bayes, were developed and evaluated using annotated data. We used the classifier with the best performance to identify ED-relevant tweets and applied a topic modeling method—Correlation Explanation (CorEx)—to analyze the content of the identified tweets. To validate these machine learning results, we also collected a cohort of ED-relevant tweets on the basis of manually curated rules. Results: A total of 123,977 tweets were collected during the set period. We randomly annotated 2219 tweets for developing the machine learning classifiers. We developed a CNN-LSTM classifier to identify ED-relevant tweets published by laypeople in 2 steps: first relevant versus irrelevant (F1 score=0.89) and then promotional versus published by laypeople (F1 score=0.90). A total of 40,790 ED-relevant tweets were identified using the CNN-LSTM classifier. We also identified another set of tweets (ie, 17,632 ED-relevant and 83,557 ED-irrelevant tweets) posted by laypeople using manually specified rules. Using CorEx on all ED-relevant tweets, the topic model identified 162 topics. Overall, the coherence rate for topic modeling was 77.07% (1264/1640), indicating a high quality of the produced topics. The topics were further reviewed and analyzed by a domain expert. Conclusions: A developed CNN-LSTM classifier could improve the efficiency of identifying ED-relevant tweets compared with the traditional manual-based method. The CorEx topic model was applied on the tweets identified by the machine learning–based classifier and the traditional manual approach separately. Highly overlapping topics were observed between the 2 cohorts of tweets. The produced topics were further reviewed by a domain expert. Some of the topics identified by the potential ED tweets may provide new avenues for understanding this serious set of disorders.

KW - Eating disorders

KW - Public health

KW - Social media

KW - Text classification

KW - Topic modeling

UR - http://www.scopus.com/inward/record.url?scp=85097467116&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85097467116&partnerID=8YFLogxK

U2 - 10.2196/18273

DO - 10.2196/18273

M3 - Article

C2 - 33124997

AN - SCOPUS:85097467116

SN - 2291-9694

VL - 8

JO - JMIR Medical Informatics

JF - JMIR Medical Informatics

IS - 10

M1 - e18273

ER -

Exploring eating disorder topics on twitter: Machine learning approach

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this