This article provides an overview of the history of linguistic corpora, i.e., collections of written or spoken texts used for linguistic research. Starting from the earliest concordances and lexicographical efforts during the pre-electronic era, the article details the birth of modern linguistic corpora in the mid-20th century and the subsequent creation of large-scale corpora made possible by technological advancements. The exploitation of annotated corpora by supervised machine learning algorithms for natural language processing is described, as well as the use of huge raw corpora for building Large Language Models in unsupervised fashion.
Passarotti, M. C., Voce "Linguistic Corpora, History of", in International Encyclopedia of Language and Linguistics, Elsevier, London (UK) 2026:1 692-696. https://dx.doi.org/10.1016/B978-0-323-95504-1.00333-1 [https://hdl.handle.net/10807/339730]
Linguistic Corpora, History of
Passarotti, Marco Carlo
2026
Abstract
This article provides an overview of the history of linguistic corpora, i.e., collections of written or spoken texts used for linguistic research. Starting from the earliest concordances and lexicographical efforts during the pre-electronic era, the article details the birth of modern linguistic corpora in the mid-20th century and the subsequent creation of large-scale corpora made possible by technological advancements. The exploitation of annotated corpora by supervised machine learning algorithms for natural language processing is described, as well as the use of huge raw corpora for building Large Language Models in unsupervised fashion.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



