We report and comment the experimental results of the PRAISED system, which implements an automatic method for discovering and resolving a wide range of protein name abbreviations from the full-text versions of scientific articles. This system has been recently proposed as part of a framework for creating and maintaining a publicly-accessible abbreviation repository. The testing phase was carried out against the widely used Medstract Gold Standard Corpus and a relevant subset of real scientific papers extracted from the PubMed database. As far as the Medstract corpus is concerned, we obtained significantly high results in terms of recall, precision and overall correctness. As for the fulltext papers, results inevitably varied, due to the complex and often chaotic nature of the confronted domain; even so, we detected encouraging levels of recall and extremely fast execution times. The major strength of the system lies in addressing the unstructuredness of the scientific publications and being able to save time and effort for extracting protein-related information in an automatic fashion, while at the same time keeping computational overhead to a minimum thanks to its light-weight approach. Copyright © 2011 ACM.
Atzeni, P., Polticelli, F., Toti, D., Experimentation of an automatic resolution method for protein abbreviations in full-text papers, Paper, in 2011 ACM Conference on Bioinformatics, Computational Biology and Biomedicine, BCB 2011, (Chicago, IL, usa, 01-03 August 2011), ACM Press, N/A 2011: 465-467. 10.1145/2147805.2147871 [http://hdl.handle.net/10807/163937]
Experimentation of an automatic resolution method for protein abbreviations in full-text papers
Toti, Daniele
2011
Abstract
We report and comment the experimental results of the PRAISED system, which implements an automatic method for discovering and resolving a wide range of protein name abbreviations from the full-text versions of scientific articles. This system has been recently proposed as part of a framework for creating and maintaining a publicly-accessible abbreviation repository. The testing phase was carried out against the widely used Medstract Gold Standard Corpus and a relevant subset of real scientific papers extracted from the PubMed database. As far as the Medstract corpus is concerned, we obtained significantly high results in terms of recall, precision and overall correctness. As for the fulltext papers, results inevitably varied, due to the complex and often chaotic nature of the confronted domain; even so, we detected encouraging levels of recall and extremely fast execution times. The major strength of the system lies in addressing the unstructuredness of the scientific publications and being able to save time and effort for extracting protein-related information in an automatic fashion, while at the same time keeping computational overhead to a minimum thanks to its light-weight approach. Copyright © 2011 ACM.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.