We propose a methodology for discovering and resolving a wide range of protein name abbreviations from the full-text versions of scientific articles, as implemented in our PRAISED framework. Three processing steps lie at the core of our approach: an abbreviation identifi- cation phase, carried out via largely domain-independent metrics based on lexical clues and exclusion rules, whose purpose is to identify all pos- sible abbreviations within a scientific text; an abbreviation resolution phase, which takes into account a number of syntactical and semantic criteria and corresponding optimization techniques, in order to match an abbreviation with its potential explanation; and a dictionary-based pro- tein name identification, which is meant to eventually sort out those ab- breviations actually belonging to the biological domain. We have tested our implementation against the well-known Medstract Gold Standard Corpus and a relevant subset of real scientific papers extracted from the PubMed database, obtaining significant results in terms of recall, pre- cision and overall correctness. In comparison to other methods, our ap- proach retains its effectiveness without compromising performance, while addressing the complexity of full-text papers instead of the simpler ab- stracts more generally used. At the same time, computational overhead is kept to a minimum and its light-weight approach further enhances customization and scalability.

Atzeni, P., Polticelli, F., Toti, D., Automatic discovery and resolution of protein abbreviations from full-text scientific papers: A light-weight approach towards data extraction from unstructured biological sources, Paper, in SEBD 2011 - Proceedings of the 19th Italian Symposium on Advanced Database Systems, (Maratea, ita, 26-29 June 2011), Università della Basilicata, Maratea 2011: 317-324 [http://hdl.handle.net/10807/165879]

Automatic discovery and resolution of protein abbreviations from full-text scientific papers: A light-weight approach towards data extraction from unstructured biological sources

Toti, Daniele
2011

Abstract

We propose a methodology for discovering and resolving a wide range of protein name abbreviations from the full-text versions of scientific articles, as implemented in our PRAISED framework. Three processing steps lie at the core of our approach: an abbreviation identifi- cation phase, carried out via largely domain-independent metrics based on lexical clues and exclusion rules, whose purpose is to identify all pos- sible abbreviations within a scientific text; an abbreviation resolution phase, which takes into account a number of syntactical and semantic criteria and corresponding optimization techniques, in order to match an abbreviation with its potential explanation; and a dictionary-based pro- tein name identification, which is meant to eventually sort out those ab- breviations actually belonging to the biological domain. We have tested our implementation against the well-known Medstract Gold Standard Corpus and a relevant subset of real scientific papers extracted from the PubMed database, obtaining significant results in terms of recall, pre- cision and overall correctness. In comparison to other methods, our ap- proach retains its effectiveness without compromising performance, while addressing the complexity of full-text papers instead of the simpler ab- stracts more generally used. At the same time, computational overhead is kept to a minimum and its light-weight approach further enhances customization and scalability.
2011
Inglese
SEBD 2011 - Proceedings of the 19th Italian Symposium on Advanced Database Systems
19th Italian Symposium on Advanced Database Systems, SEBD 2011
Maratea, ita
Paper
26-giu-2011
29-giu-2011
Università della Basilicata
Atzeni, P., Polticelli, F., Toti, D., Automatic discovery and resolution of protein abbreviations from full-text scientific papers: A light-weight approach towards data extraction from unstructured biological sources, Paper, in SEBD 2011 - Proceedings of the 19th Italian Symposium on Advanced Database Systems, (Maratea, ita, 26-29 June 2011), Università della Basilicata, Maratea 2011: 317-324 [http://hdl.handle.net/10807/165879]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/165879
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 4
  • ???jsp.display-item.citation.isi??? ND
social impact