IRIS UniCatt

This article concentrates on the automatic extraction of collocations, defined by the Explanatory and Combinatorial Lexicology as phraseological units composed by two elements – the base and the collocate. The aim of this article is to propose a methodology to follow in order to automatically extract collocations from a terminological corpus. This method takes into account different measures: the syntactic dependences between the items of the collocation, their frequency, their tendency to co-occur (PMI) and their specificity to the e-commerce domain. After having explained the theoretical framework, the methodology is illustrated using a pilot study of the French terminology of e-commerce. In the pilot study, data were extracted from a corpus made up of e-commerce texts, which are drawn from a larger corpus called DIACOM-fr, a corpus in the process of being built at the University of Verona within the project Digital Humanities applied to foreign languages and literatures. Data extraction was primarily done using two tools: Stanza a Python natural language analysis package developed by the Stanford NLP group and TermoStat an automatic extractor tool developed at the Observatoire de Linguistique Sens-Texte of the University of Montreal.

Calvi, S., Collocations terminologiques et extraction automatique : une étude pilote dans le domaine du commerce électronique, <<ACADEMIC JOURNAL OF MODERN PHILOLOGY>>, 2021; (13): 75-82 [https://hdl.handle.net/10807/229040]

Collocations terminologiques et extraction automatique : une étude pilote dans le domaine du commerce électronique

Calvi, Silvia

2021

Abstract

This article concentrates on the automatic extraction of collocations, defined by the Explanatory and Combinatorial Lexicology as phraseological units composed by two elements – the base and the collocate. The aim of this article is to propose a methodology to follow in order to automatically extract collocations from a terminological corpus. This method takes into account different measures: the syntactic dependences between the items of the collocation, their frequency, their tendency to co-occur (PMI) and their specificity to the e-commerce domain. After having explained the theoretical framework, the methodology is illustrated using a pilot study of the French terminology of e-commerce. In the pilot study, data were extracted from a corpus made up of e-commerce texts, which are drawn from a larger corpus called DIACOM-fr, a corpus in the process of being built at the University of Verona within the project Digital Humanities applied to foreign languages and literatures. Data extraction was primarily done using two tools: Stanza a Python natural language analysis package developed by the Stanford NLP group and TermoStat an automatic extractor tool developed at the Observatoire de Linguistique Sens-Texte of the University of Montreal.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2021
			
	Lingua del contenuto
	
				Francese
			
	Nome del periodico
	
				ACADEMIC JOURNAL OF MODERN PHILOLOGY
			
	Citazione
	
				Calvi, S., Collocations terminologiques et extraction automatique : une étude pilote dans le domaine du commerce électronique, <<ACADEMIC JOURNAL OF MODERN PHILOLOGY>>, 2021;  (13): 75-82 [https://hdl.handle.net/10807/229040]
			
	Appare nelle tipologie:
	
				Articolo in rivista, Nota a sentenza

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/229040

Citazioni

ND

ND

ND

social impact