IRIS UniCatt

Over the last decade, word embeddings have enabled machines to represent words and sentences as vectors, enabling researchers to reason on text for tasks like semantic similarity, contextual understanding, machine translation, etc. However, the synthesis of embeddings involves domain-specific parameters that affect semantic accuracy and contextual relevance, often leading to unpredictable biases and inconsistent comparisons. This issue is particularly relevant in labor market analysis, where different embeddings yield varying results, making the selection of the most appropriate model a key element. This paper addresses these challenges by (i) proposing a methodology to train, select, and align vector space models for a target taxonomy, ensuring comparability across dimensions and languages; (ii) applying this approach to 4.5 million job ads in 28 languages, aligning country-specific embeddings using the ESCO taxonomy; (iii) generating over 3000 models over 142 machine days, making the best-performing ones publicly available via VEUCTOR; and (iv) showing how model choice significantly impacts labor market analysis, revealing substantial variations in occupational skill bundles across embeddings.

Colombo, E., D'Amico, S., Mercorio, F., Mezzanzanica, M., VEUCTOR: Training and selecting best vector space models from online job ads for European countries, <<INFORMATION SCIENCES>>, 2026; 741 (2): 1-25. [doi:10.1016/j.ins.2026.123274] [https://hdl.handle.net/10807/331617]

VEUCTOR: Training and selecting best vector space models from online job ads for European countries

Colombo, Emilio^Primo;D'Amico, Simone;Mercorio, Fabio;Mezzanzanica, Mario

2026

Abstract

Over the last decade, word embeddings have enabled machines to represent words and sentences as vectors, enabling researchers to reason on text for tasks like semantic similarity, contextual understanding, machine translation, etc. However, the synthesis of embeddings involves domain-specific parameters that affect semantic accuracy and contextual relevance, often leading to unpredictable biases and inconsistent comparisons. This issue is particularly relevant in labor market analysis, where different embeddings yield varying results, making the selection of the most appropriate model a key element. This paper addresses these challenges by (i) proposing a methodology to train, select, and align vector space models for a target taxonomy, ensuring comparability across dimensions and languages; (ii) applying this approach to 4.5 million job ads in 28 languages, aligning country-specific embeddings using the ESCO taxonomy; (iii) generating over 3000 models over 142 machine days, making the best-performing ones publicly available via VEUCTOR; and (iv) showing how model choice significantly impacts labor market analysis, revealing substantial variations in occupational skill bundles across embeddings.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2026
			
	Lingua del contenuto
	
				Inglese
			
	Nome del periodico
	
				INFORMATION SCIENCES
			
	DOI del contributo
	
				https://dx.doi.org/10.1016/j.ins.2026.123274
			
	Citazione
	
				Colombo, E., D'Amico, S., Mercorio, F., Mezzanzanica, M., VEUCTOR: Training and selecting best vector space models from online job ads for European countries, <<INFORMATION SCIENCES>>, 2026;  741 (2): 1-25. [doi:10.1016/j.ins.2026.123274] [https://hdl.handle.net/10807/331617]
			
	Appare nelle tipologie:
	
				Articolo in rivista, Nota a sentenza

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/331617

Citazioni

ND

ND

ND

social impact