DisaggregHate It Corpus: A Disaggregated Italian Dataset of Hate Speech

  • Marco Madeddu
  • , Simona Frenda
  • , Mirko Lai
  • , Viviana Patti
  • , Valerio Basile

Risultato della ricerca: Contributo su rivistaArticolo da conferenzapeer review

Abstract

Recent studies in Machine Learning advocate for the exploitation of disagreement between annotators to train models in line with the different opinions of humans about a specific phenomenon. This means that datasets where the annotations are aggregated by majority voting are not enough. In this paper, we present an Italian disaggregated dataset concerning hate speech and encoding some information about the annotators: the DisaggregHate It Corpus. The corpus contains Italian tweets that focus on the topic of racism and has been annotated by native Italian university students. We explain how the dataset was gathered by following the recommendation of the perspectivist approach [1], encouraging the annotators to give some socio-demographic information about them. To exploit the disagreement in the learning process, we proposed two types of soft labels: softmax and standard normalization. We investigated the benefit of using disagreement by creating a baseline binary model and two regression models that were respectively trained on the 'hard' (aggregated label by majority voting) and the two types of 'soft' labels. We tested the models in an in-domain and out-of-domain setting, evaluating their performance using the cross-entropy as a metric, and showing that the models trained on the soft labels performed better.

Lingua originaleInglese
RivistaCEUR Workshop Proceedings
Volume3596
Stato di pubblicazionePubblicato - 2023
Pubblicato esternamente
Evento9th Italian Conference on Computational Linguistics, CLiC-it 2023 - Venice, Italy
Durata: 30 nov 20232 dic 2023

Fingerprint

Entra nei temi di ricerca di 'DisaggregHate It Corpus: A Disaggregated Italian Dataset of Hate Speech'. Insieme formano una fingerprint unica.

Cita questo