Duluth at SemEval-2020 Task 12: Offensive Tweet Identification in English with Logistic Regression

Ted Pedersen

Duluth at SemEval-2020 Task 12: Offensive Tweet Identification in English with Logistic Regression

Ted Pedersen

Computer Science (Duluth)

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

3 Scopus citations

Abstract

This paper describes the Duluth systems that participated in SemEval-2020 Task 12, Multilingual Offensive Language Identification in Social Media (OffensEval-2020). We participated in the three English language tasks. Our systems provide a simple Machine Learning baseline using logistic regression. We trained our models on the distantly supervised training data made available by the task organizers and used no other resources. As might be expected we did not rank highly in the comparative evaluation: 79^th of 85 in Task A, 34^th of 43 in Task B, and 24^th of 39 in Task C. We carried out a qualitative analysis of our results and found that the class labels in the gold standard data are somewhat noisy. We hypothesize that the extremely high accuracy (> 90%) of the top ranked systems may reflect methods that learn the training data very well but may not generalize to the task of identifying offensive language in English. This analysis includes examples of tweets that despite being mildly redacted are still offensive.

Original language	English (US)
Title of host publication	14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings
Editors	Aurelie Herbelot, Xiaodan Zhu, Alexis Palmer, Nathan Schneider, Jonathan May, Ekaterina Shutova
Publisher	International Committee for Computational Linguistics
Pages	1938-1946
Number of pages	9
ISBN (Electronic)	9781952148316
State	Published - 2020
Event	14th International Workshops on Semantic Evaluation, SemEval 2020 - Barcelona, Spain Duration: Dec 12 2020 → Dec 13 2020

Publication series

Name	14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings

Conference

Conference	14th International Workshops on Semantic Evaluation, SemEval 2020
Country/Territory	Spain
City	Barcelona
Period	12/12/20 → 12/13/20

Bibliographical note

Publisher Copyright:
© 2020 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings. All rights reserved.

OpenUrl availability

Full text

Cite this

Pedersen, T. (2020). Duluth at SemEval-2020 Task 12: Offensive Tweet Identification in English with Logistic Regression. In A. Herbelot, X. Zhu, A. Palmer, N. Schneider, J. May, & E. Shutova (Eds.), 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings (pp. 1938-1946). (14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings). International Committee for Computational Linguistics.

Duluth at SemEval-2020 Task 12: Offensive Tweet Identification in English with Logistic Regression. / Pedersen, Ted.
14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings. ed. / Aurelie Herbelot; Xiaodan Zhu; Alexis Palmer; Nathan Schneider; Jonathan May; Ekaterina Shutova. International Committee for Computational Linguistics, 2020. p. 1938-1946 (14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Pedersen, T 2020, Duluth at SemEval-2020 Task 12: Offensive Tweet Identification in English with Logistic Regression. in A Herbelot, X Zhu, A Palmer, N Schneider, J May & E Shutova (eds), 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings. 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings, International Committee for Computational Linguistics, pp. 1938-1946, 14th International Workshops on Semantic Evaluation, SemEval 2020, Barcelona, Spain, 12/12/20.

Pedersen T. Duluth at SemEval-2020 Task 12: Offensive Tweet Identification in English with Logistic Regression. In Herbelot A, Zhu X, Palmer A, Schneider N, May J, Shutova E, editors, 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings. International Committee for Computational Linguistics. 2020. p. 1938-1946. (14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings).

Pedersen, Ted. / Duluth at SemEval-2020 Task 12 : Offensive Tweet Identification in English with Logistic Regression. 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings. editor / Aurelie Herbelot ; Xiaodan Zhu ; Alexis Palmer ; Nathan Schneider ; Jonathan May ; Ekaterina Shutova. International Committee for Computational Linguistics, 2020. pp. 1938-1946 (14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings).

@inproceedings{a70dbf69984148df96f7ce9be581b6bb,

title = "Duluth at SemEval-2020 Task 12: Offensive Tweet Identification in English with Logistic Regression",

abstract = "This paper describes the Duluth systems that participated in SemEval-2020 Task 12, Multilingual Offensive Language Identification in Social Media (OffensEval-2020). We participated in the three English language tasks. Our systems provide a simple Machine Learning baseline using logistic regression. We trained our models on the distantly supervised training data made available by the task organizers and used no other resources. As might be expected we did not rank highly in the comparative evaluation: 79th of 85 in Task A, 34th of 43 in Task B, and 24th of 39 in Task C. We carried out a qualitative analysis of our results and found that the class labels in the gold standard data are somewhat noisy. We hypothesize that the extremely high accuracy (> 90%) of the top ranked systems may reflect methods that learn the training data very well but may not generalize to the task of identifying offensive language in English. This analysis includes examples of tweets that despite being mildly redacted are still offensive.",

author = "Ted Pedersen",

note = "Publisher Copyright: {\textcopyright} 2020 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings. All rights reserved.; 14th International Workshops on Semantic Evaluation, SemEval 2020 ; Conference date: 12-12-2020 Through 13-12-2020",

year = "2020",

language = "English (US)",

series = "14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings",

publisher = "International Committee for Computational Linguistics",

pages = "1938--1946",

editor = "Aurelie Herbelot and Xiaodan Zhu and Alexis Palmer and Nathan Schneider and Jonathan May and Ekaterina Shutova",

booktitle = "14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings",

}

TY - GEN

T1 - Duluth at SemEval-2020 Task 12

T2 - 14th International Workshops on Semantic Evaluation, SemEval 2020

AU - Pedersen, Ted

N1 - Publisher Copyright: © 2020 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings. All rights reserved.

PY - 2020

Y1 - 2020

N2 - This paper describes the Duluth systems that participated in SemEval-2020 Task 12, Multilingual Offensive Language Identification in Social Media (OffensEval-2020). We participated in the three English language tasks. Our systems provide a simple Machine Learning baseline using logistic regression. We trained our models on the distantly supervised training data made available by the task organizers and used no other resources. As might be expected we did not rank highly in the comparative evaluation: 79th of 85 in Task A, 34th of 43 in Task B, and 24th of 39 in Task C. We carried out a qualitative analysis of our results and found that the class labels in the gold standard data are somewhat noisy. We hypothesize that the extremely high accuracy (> 90%) of the top ranked systems may reflect methods that learn the training data very well but may not generalize to the task of identifying offensive language in English. This analysis includes examples of tweets that despite being mildly redacted are still offensive.

AB - This paper describes the Duluth systems that participated in SemEval-2020 Task 12, Multilingual Offensive Language Identification in Social Media (OffensEval-2020). We participated in the three English language tasks. Our systems provide a simple Machine Learning baseline using logistic regression. We trained our models on the distantly supervised training data made available by the task organizers and used no other resources. As might be expected we did not rank highly in the comparative evaluation: 79th of 85 in Task A, 34th of 43 in Task B, and 24th of 39 in Task C. We carried out a qualitative analysis of our results and found that the class labels in the gold standard data are somewhat noisy. We hypothesize that the extremely high accuracy (> 90%) of the top ranked systems may reflect methods that learn the training data very well but may not generalize to the task of identifying offensive language in English. This analysis includes examples of tweets that despite being mildly redacted are still offensive.

UR - http://www.scopus.com/inward/record.url?scp=85094741507&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85094741507&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85094741507

T3 - 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings

SP - 1938

EP - 1946

BT - 14th International Workshops on Semantic Evaluation, SemEval 2020 - co-located 28th International Conference on Computational Linguistics, COLING 2020, Proceedings

A2 - Herbelot, Aurelie

A2 - Zhu, Xiaodan

A2 - Palmer, Alexis

A2 - Schneider, Nathan

A2 - May, Jonathan

A2 - Shutova, Ekaterina

PB - International Committee for Computational Linguistics

Y2 - 12 December 2020 through 13 December 2020

ER -

Duluth at SemEval-2020 Task 12: Offensive Tweet Identification in English with Logistic Regression

Abstract

Publication series

Conference

Bibliographical note

OpenUrl availability

Other files and links

Fingerprint

Cite this