IRIS UniCatt

Disorder named entity recognition (DNER) is a fundamental task of biomedical natural language processing, which has attracted plenty of attention. This task consists in extracting named entities of disorders such as diseases, symptoms, and pathological functions from unstructured text. The European Clinical Case Corpus (E3C) is a freely available multilingual corpus (English, French, Italian, Spanish, and Basque) of semantically annotated clinical case texts. The entities of type disorder in the clinical cases are annotated at both mention and concept level. At mention -level, the annotation identifies the entity text spans, for example, abdominal pain. At concept level, the entity text spans are associated with their concept identifiers in Unified Medical Language System, for example, C0000737. This corpus can be exploited as a benchmark for training and assessing information extraction systems. Within the context of the present work, multiple experiments have been conducted in order to test the appropriateness of the mention-level annotation of the E3C corpus for training DNER models. In these experiments, traditional machine learning models like conditional random fields and more recent multilingual pre-trained models based on deep learning were compared with standard baselines. With regard to the multilingual pre-trained models, they were fine-tuned (i) on each language of the corpus to test per-language performance, (ii) on all languages to test multilingual learning, and (iii) on all languages except the target language to test cross-lingual transfer learning. Results show the appropriateness of the E3C corpus for training a system capable of mining disorder entities from clinical case texts. Researchers can use these results as the baselines for this corpus to compare their own models. The implemented models have been made available through the European Language Grid platform for quick and easy access.

Zanoli, R., Lavelli, A., Verdi Do Amarante, D., Toti, D., Assessment of the E3C corpus for the recognition of disorders in clinical texts, <<NATURAL LANGUAGE ENGINEERING>>, 2024; (30): 851-869. [doi:10.1017/S1351324923000335] [https://hdl.handle.net/10807/249356]

Assessment of the E3C corpus for the recognition of disorders in clinical texts

Zanoli R.;Lavelli A.;Verdi Do Amarante D.;Toti, Daniele

2023

Abstract

Disorder named entity recognition (DNER) is a fundamental task of biomedical natural language processing, which has attracted plenty of attention. This task consists in extracting named entities of disorders such as diseases, symptoms, and pathological functions from unstructured text. The European Clinical Case Corpus (E3C) is a freely available multilingual corpus (English, French, Italian, Spanish, and Basque) of semantically annotated clinical case texts. The entities of type disorder in the clinical cases are annotated at both mention and concept level. At mention -level, the annotation identifies the entity text spans, for example, abdominal pain. At concept level, the entity text spans are associated with their concept identifiers in Unified Medical Language System, for example, C0000737. This corpus can be exploited as a benchmark for training and assessing information extraction systems. Within the context of the present work, multiple experiments have been conducted in order to test the appropriateness of the mention-level annotation of the E3C corpus for training DNER models. In these experiments, traditional machine learning models like conditional random fields and more recent multilingual pre-trained models based on deep learning were compared with standard baselines. With regard to the multilingual pre-trained models, they were fine-tuned (i) on each language of the corpus to test per-language performance, (ii) on all languages to test multilingual learning, and (iii) on all languages except the target language to test cross-lingual transfer learning. Results show the appropriateness of the E3C corpus for training a system capable of mining disorder entities from clinical case texts. Researchers can use these results as the baselines for this corpus to compare their own models. The implemented models have been made available through the European Language Grid platform for quick and easy access.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2023
			
	Lingua del contenuto
	
				Inglese
			
	Nome del periodico
	
				NATURAL LANGUAGE ENGINEERING
			
	DOI del contributo
	
				https://dx.doi.org/10.1017/S1351324923000335
			
	Citazione
	
				Zanoli, R., Lavelli, A., Verdi Do Amarante, D., Toti, D., Assessment of the E3C corpus for the recognition of disorders in clinical texts, <<NATURAL LANGUAGE ENGINEERING>>, 2024;  (30): 851-869. [doi:10.1017/S1351324923000335] [https://hdl.handle.net/10807/249356]
			
	Appare nelle tipologie:
	
				Articolo in rivista, Nota a sentenza

File in questo prodotto:

File	Dimensione	Formato
2023_E3C_NLE.pdf accesso aperto Tipologia file ?: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 1.02 MB Formato Adobe PDF Visualizza/Apri	1.02 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/249356

Citazioni

ND

4

2

social impact