Abstract
Recent studies in Machine Learning advocate for the exploitation of disagreement between annotators to train models in line with the different opinions of humans about a specific phenomenon. This means that datasets where the annotations are aggregated by majority voting are not enough. In this paper, we present an Italian disaggregated dataset concerning hate speech and encoding some information about the annotators: the DisaggregHate It Corpus. The corpus contains Italian tweets that focus on the topic of racism and has been annotated by native Italian university students. We explain how the dataset was gathered by following the recommendation of the perspectivist approach [1], encouraging the annotators to give some socio-demographic information about them. To exploit the disagreement in the learning process, we proposed two types of soft labels: softmax and standard normalization. We investigated the benefit of using disagreement by creating a baseline binary model and two regression models that were respectively trained on the 'hard' (aggregated label by majority voting) and the two types of 'soft' labels. We tested the models in an in-domain and out-of-domain setting, evaluating their performance using the cross-entropy as a metric, and showing that the models trained on the soft labels performed better.
| Original language | English |
|---|---|
| Journal | CEUR Workshop Proceedings |
| Volume | 3596 |
| Publication status | Published - 2023 |
| Externally published | Yes |
| Event | 9th Italian Conference on Computational Linguistics, CLiC-it 2023 - Venice, Italy Duration: 30 Nov 2023 → 2 Dec 2023 |
Keywords
- disagreement
- hate speech
- perspectivism
Fingerprint
Dive into the research topics of 'DisaggregHate It Corpus: A Disaggregated Italian Dataset of Hate Speech'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver