We present a curated resource of Latin etymologies automatically extracted from Wiktionary, enriched with links to the LiLa Knowledge Base of Latin and modelled as RDF triples using the LemonEty ontology. We also present the Python pipeline the data was generated with, as it can be reused to extract Wiktionary’s etymologies for other languages. The etymology chains cover Latin words and their attested or reconstructed ancestors in languages such as Proto-Indo-European, Proto-Italic, Ancient Greek, Hebrew, Egyptian, and others. To address the structural noise and editorial heterogeneity ofWiktionary etymology data, we have introduced strong rule-based filters throughout the pipeline, especially in the curation stage. After validation, the resulting dataset contains etymological chains for 9,684 lemmas, which can be used to support research in Historical Linguistics, Computational Etymology and language learning, among other applications.

De Torres, J., Passarotti, M. C., Mambrini, F., Pellegrini, M., Moretti, G., A Dataset of Latin Etymologies Extracted from Wiktionary, in Proceedings of the 11th Edition of the Swiss Text Analytics Conference, (Zurigo (Svizzera), 10-10 June 2026), Association for Computational Linguistics, Zurigo (Svizzera) 2026: 226-233 [https://hdl.handle.net/10807/338924]

A Dataset of Latin Etymologies Extracted from Wiktionary

Passarotti, Marco Carlo;Mambrini, Francesco;Pellegrini, Matteo;Moretti, Giovanni
2026

Abstract

We present a curated resource of Latin etymologies automatically extracted from Wiktionary, enriched with links to the LiLa Knowledge Base of Latin and modelled as RDF triples using the LemonEty ontology. We also present the Python pipeline the data was generated with, as it can be reused to extract Wiktionary’s etymologies for other languages. The etymology chains cover Latin words and their attested or reconstructed ancestors in languages such as Proto-Indo-European, Proto-Italic, Ancient Greek, Hebrew, Egyptian, and others. To address the structural noise and editorial heterogeneity ofWiktionary etymology data, we have introduced strong rule-based filters throughout the pipeline, especially in the curation stage. After validation, the resulting dataset contains etymological chains for 9,684 lemmas, which can be used to support research in Historical Linguistics, Computational Etymology and language learning, among other applications.
2026
Inglese
Proceedings of the 11th Edition of the Swiss Text Analytics Conference
Swiss Text Analytics Conference
Zurigo (Svizzera)
10-giu-2026
10-giu-2026
NA
Association for Computational Linguistics
De Torres, J., Passarotti, M. C., Mambrini, F., Pellegrini, M., Moretti, G., A Dataset of Latin Etymologies Extracted from Wiktionary, in Proceedings of the 11th Edition of the Swiss Text Analytics Conference, (Zurigo (Svizzera), 10-10 June 2026), Association for Computational Linguistics, Zurigo (Svizzera) 2026: 226-233 [https://hdl.handle.net/10807/338924]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/338924
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact