IRIS UniCatt

We present a curated resource of Latin etymologies automatically extracted from Wiktionary, enriched with links to the LiLa Knowledge Base of Latin and modelled as RDF triples using the LemonEty ontology. We also present the Python pipeline the data was generated with, as it can be reused to extract Wiktionary’s etymologies for other languages. The etymology chains cover Latin words and their attested or reconstructed ancestors in languages such as Proto-Indo-European, Proto-Italic, Ancient Greek, Hebrew, Egyptian, and others. To address the structural noise and editorial heterogeneity ofWiktionary etymology data, we have introduced strong rule-based filters throughout the pipeline, especially in the curation stage. After validation, the resulting dataset contains etymological chains for 9,684 lemmas, which can be used to support research in Historical Linguistics, Computational Etymology and language learning, among other applications.

De Torres, J., Passarotti, M. C., Mambrini, F., Pellegrini, M., Moretti, G., A Dataset of Latin Etymologies Extracted from Wiktionary, in Proceedings of the 11th Edition of the Swiss Text Analytics Conference, (Zurigo (Svizzera), 10-10 June 2026), Association for Computational Linguistics, Zurigo (Svizzera) 2026: 226-233 [https://hdl.handle.net/10807/338924]

A Dataset of Latin Etymologies Extracted from Wiktionary

de Torres Javier;Passarotti, Marco Carlo;Mambrini, Francesco;Pellegrini, Matteo;Moretti, Giovanni

2026

Abstract

We present a curated resource of Latin etymologies automatically extracted from Wiktionary, enriched with links to the LiLa Knowledge Base of Latin and modelled as RDF triples using the LemonEty ontology. We also present the Python pipeline the data was generated with, as it can be reused to extract Wiktionary’s etymologies for other languages. The etymology chains cover Latin words and their attested or reconstructed ancestors in languages such as Proto-Indo-European, Proto-Italic, Ancient Greek, Hebrew, Egyptian, and others. To address the structural noise and editorial heterogeneity ofWiktionary etymology data, we have introduced strong rule-based filters throughout the pipeline, especially in the curation stage. After validation, the resulting dataset contains etymological chains for 9,684 lemmas, which can be used to support research in Historical Linguistics, Computational Etymology and language learning, among other applications.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2026
			
	Lingua del contenuto
	
				Inglese
			
	Titolo del volume che raccoglie gli atti
	
				Proceedings of the 11th Edition of the Swiss Text Analytics Conference
			
	Denominazione evento
	
				Swiss Text Analytics Conference
			
	Luogo dell'evento
	
				Zurigo (Svizzera)
			
	Data inizio evento
	
				10-giu-2026
			
	Data fine evento
	
				10-giu-2026
			
	ISBN del volume
	
				NA
			
	Editore
	
				Association for Computational Linguistics
			
	URL alternativo
	
				https://aclanthology.org/2026.swisstext-1.21.pdf
			
	Citazione
	
				De Torres, J., Passarotti, M. C., Mambrini, F., Pellegrini, M., Moretti, G.,  A Dataset of Latin Etymologies Extracted from Wiktionary, in Proceedings of the 11th Edition of the Swiss Text Analytics Conference, (Zurigo (Svizzera),  10-10 June 2026), Association for Computational Linguistics, Zurigo (Svizzera) 2026: 226-233 [https://hdl.handle.net/10807/338924]
			
	Appare nelle tipologie:
	
				Atti di Convegno, Congresso, Giornate di studio, ecc., Workshop (in volume)

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/338924

Citazioni

ND

ND

ND

social impact