Skip to main navigation Skip to search Skip to main content

DisaggregHate It Corpus: A Disaggregated Italian Dataset of Hate Speech

  • Marco Madeddu
  • , Simona Frenda
  • , Mirko Lai
  • , Viviana Patti
  • , Valerio Basile

Research output: Contribution to journalConference articlepeer-review

Abstract

Recent studies in Machine Learning advocate for the exploitation of disagreement between annotators to train models in line with the different opinions of humans about a specific phenomenon. This means that datasets where the annotations are aggregated by majority voting are not enough. In this paper, we present an Italian disaggregated dataset concerning hate speech and encoding some information about the annotators: the DisaggregHate It Corpus. The corpus contains Italian tweets that focus on the topic of racism and has been annotated by native Italian university students. We explain how the dataset was gathered by following the recommendation of the perspectivist approach [1], encouraging the annotators to give some socio-demographic information about them. To exploit the disagreement in the learning process, we proposed two types of soft labels: softmax and standard normalization. We investigated the benefit of using disagreement by creating a baseline binary model and two regression models that were respectively trained on the 'hard' (aggregated label by majority voting) and the two types of 'soft' labels. We tested the models in an in-domain and out-of-domain setting, evaluating their performance using the cross-entropy as a metric, and showing that the models trained on the soft labels performed better.

Original languageEnglish
JournalCEUR Workshop Proceedings
Volume3596
Publication statusPublished - 2023
Externally publishedYes
Event9th Italian Conference on Computational Linguistics, CLiC-it 2023 - Venice, Italy
Duration: 30 Nov 20232 Dec 2023

Keywords

  • disagreement
  • hate speech
  • perspectivism

Fingerprint

Dive into the research topics of 'DisaggregHate It Corpus: A Disaggregated Italian Dataset of Hate Speech'. Together they form a unique fingerprint.

Cite this