We propose a methodology to identify and resolve protein-related abbreviations found in the full texts of scientific papers, as part of a semi-automatic process implemented in our PRAISED framework. The identification of biological acronyms is carried out via an effective syntactical approach, by taking advantage of lexical clues and using mostly domain-independent metrics, resulting in considerably high levels of recall as well as extremely low execution time. The subsequent abbreviation resolution uses both syntactical and semantic criteria in order to match an abbreviation with its potential explanation, as discovered among a number of contiguous words proportional to the abbreviation's length. We have tested our system against the Medstract Gold Standard corpus and a relevant set of manually annotated PubMed papers, obtaining significant results and high performance levels, while at the same time allowing for great customization, lightness and scalability. © 2011 Springer-Verlag.

Atzeni, P., Polticelli, F., Toti, D., An automatic identification and resolution system for protein-related abbreviations in scientific papers, Paper, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), (Torino, ita, 27-29 April 2011), Springer Verlag, N/A 2011:<<LECTURE NOTES IN COMPUTER SCIENCE>>,6623 171-176. 10.1007/978-3-642-20389-3_18 [http://hdl.handle.net/10807/163936]

An automatic identification and resolution system for protein-related abbreviations in scientific papers

Toti, Daniele
2011

Abstract

We propose a methodology to identify and resolve protein-related abbreviations found in the full texts of scientific papers, as part of a semi-automatic process implemented in our PRAISED framework. The identification of biological acronyms is carried out via an effective syntactical approach, by taking advantage of lexical clues and using mostly domain-independent metrics, resulting in considerably high levels of recall as well as extremely low execution time. The subsequent abbreviation resolution uses both syntactical and semantic criteria in order to match an abbreviation with its potential explanation, as discovered among a number of contiguous words proportional to the abbreviation's length. We have tested our system against the Medstract Gold Standard corpus and a relevant set of manually annotated PubMed papers, obtaining significant results and high performance levels, while at the same time allowing for great customization, lightness and scalability. © 2011 Springer-Verlag.
2011
Inglese
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
9th European Conference on Evolutionary Computation, Machine Learning, and Data Mining in Bioinformatics, EvoBIO 2011
Torino, ita
Paper
27-apr-2011
29-apr-2011
978-3-642-20388-6
Springer Verlag
Atzeni, P., Polticelli, F., Toti, D., An automatic identification and resolution system for protein-related abbreviations in scientific papers, Paper, in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), (Torino, ita, 27-29 April 2011), Springer Verlag, N/A 2011:<<LECTURE NOTES IN COMPUTER SCIENCE>>,6623 171-176. 10.1007/978-3-642-20389-3_18 [http://hdl.handle.net/10807/163936]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/163936
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 13
  • ???jsp.display-item.citation.isi??? 8
social impact