This paper describes a methodology for discovering and resolving protein names abbreviations from the full-text versions of scientific articles, implemented in the PRAISED framework with the ultimate purpose of building up a publicly available abbreviation repository. Three processing steps lie at the core of the framework: i) an abbreviation identification phase, carried out via domain-independent metrics, whose purpose is to identify all possible abbreviations within a scientific text; ii) an abbreviation resolution phase, which takes into account a number of syntactical and semantic criteria in order to match an abbreviation with its potential explanation; and iii) a dictionary-based protein name identification, which is meant to select only those abbreviations belonging to the protein science domain. A local copy of the UniProt database is used as a source repository for all the known proteins. © 2012, Jagiellonian University, Medical College, Kraków, Poland. All rights reserved.

Toti, D., Atzeni, P., Polticelli, F., Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The PRAISED Framework, <<BIO-ALGORITHMS AND MED-SYSTEMS>>, 2012; 8 (1): 13-51. [doi:10.2478/bams-2012-0002] [http://hdl.handle.net/10807/163322]

Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The PRAISED Framework

Toti, Daniele
Primo
;
2012

Abstract

This paper describes a methodology for discovering and resolving protein names abbreviations from the full-text versions of scientific articles, implemented in the PRAISED framework with the ultimate purpose of building up a publicly available abbreviation repository. Three processing steps lie at the core of the framework: i) an abbreviation identification phase, carried out via domain-independent metrics, whose purpose is to identify all possible abbreviations within a scientific text; ii) an abbreviation resolution phase, which takes into account a number of syntactical and semantic criteria in order to match an abbreviation with its potential explanation; and iii) a dictionary-based protein name identification, which is meant to select only those abbreviations belonging to the protein science domain. A local copy of the UniProt database is used as a source repository for all the known proteins. © 2012, Jagiellonian University, Medical College, Kraków, Poland. All rights reserved.
2012
Inglese
Toti, D., Atzeni, P., Polticelli, F., Automatic Protein Abbreviations Discovery and Resolution from Full-Text Scientific Papers: The PRAISED Framework, <<BIO-ALGORITHMS AND MED-SYSTEMS>>, 2012; 8 (1): 13-51. [doi:10.2478/bams-2012-0002] [http://hdl.handle.net/10807/163322]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/163322
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 15
  • ???jsp.display-item.citation.isi??? ND
social impact