IRIS UniCatt

Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding a complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a Digital Library. In Natural Language Processing it is crucial to remove these redundancies before applying any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. This chapter investigates two aspects of the historical text re-use detection process, based on seven English editions of the Holy Bible. First, we measure the performance of several techniques. For this purpose, when considering a verse—such as book Genesis, Chapter 1, Verse 1—that is present in two editions, one verse is always understood as a paraphrase of the other. It is worth noting that paraphrasing is considered a hyponym of text re-use. Depending on the intention with which the new version was created, verses tend to differ significantly in the wording, but not in the meaning. Secondly, this chapter explains and evaluates a way of extracting paradigmatic relations. However, as regards historical languages, there is a lack of language resources (for example, WordNet) that makes non-literal text re-use and paraphrases much more difficult to identify. These differences are present in the form of replacements, corrections, varying writing styles, etc. For this reason, we introduce both the aforementioned and other correlated steps as a method to identify text re-use, including language acquisition to detect changes that we call paradigmatic relations. The chapter concludes with the recommendation to move from a ”single run” detection to an iterative process by using the acquired relations to run a new task.

Büchler, M., Burns, P. R., Müller, M., Franzini, E., Franzini, G., Towards a Historical Text Re-use Detection, in Chris Biemann, A. M. (ed.), Text Mining, Theory and Applications of Natural Language Processing, Springer International Publishing, Cham 2014: 221- 238. 10.1007/978-3-319-12655-5_11 [http://hdl.handle.net/10807/127337]

Towards a Historical Text Re-use Detection

Philip R. Burns;Martin Müller;Emily Franzini;Franzini, Greta^Ultimo

2014

Abstract

Text re-use describes the spoken and written repetition of information. Historical text re-use, with its longer time span, embraces a larger set of morphological, linguistic, syntactic, semantic and copying variations, thus adding a complication to text-reuse detection. Furthermore, it increases the chances of redundancy in a Digital Library. In Natural Language Processing it is crucial to remove these redundancies before applying any kind of machine learning techniques to the text. In Humanities, these redundancies foreground textual criticism and allow scholars to identify lines of transmission. This chapter investigates two aspects of the historical text re-use detection process, based on seven English editions of the Holy Bible. First, we measure the performance of several techniques. For this purpose, when considering a verse—such as book Genesis, Chapter 1, Verse 1—that is present in two editions, one verse is always understood as a paraphrase of the other. It is worth noting that paraphrasing is considered a hyponym of text re-use. Depending on the intention with which the new version was created, verses tend to differ significantly in the wording, but not in the meaning. Secondly, this chapter explains and evaluates a way of extracting paradigmatic relations. However, as regards historical languages, there is a lack of language resources (for example, WordNet) that makes non-literal text re-use and paraphrases much more difficult to identify. These differences are present in the form of replacements, corrections, varying writing styles, etc. For this reason, we introduce both the aforementioned and other correlated steps as a method to identify text re-use, including language acquisition to detect changes that we call paradigmatic relations. The chapter concludes with the recommendation to move from a ”single run” detection to an iterative process by using the acquired relations to run a new task.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2014
			
	Lingua del contenuto
	
				Inglese
			
	Titolo del volume
	
				Text Mining, Theory and Applications of Natural Language Processing
			
	ISBN del volume
	
				978-3-319-12655-5
			
	Editore
	
				Springer International Publishing
			
	DOI del contributo
	
				https://dx.doi.org/10.1007/978-3-319-12655-5_11
			
	Citazione
	
				Büchler, M., Burns, P. R., Müller, M., Franzini, E., Franzini, G., Towards a Historical Text Re-use Detection, in Chris Biemann, A. M. (ed.),  Text Mining, Theory and Applications of Natural Language Processing,  Springer International Publishing, Cham 2014: 221- 238. 10.1007/978-3-319-12655-5_11 [http://hdl.handle.net/10807/127337]
			
	Appare nelle tipologie:
	
				In libro con curatela: Capitolo o saggio; Prefazione/Postfazione; Breve introduzione; Schede di catalogo, repertorio o corpus

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/127337

Citazioni

ND

ND

ND

social impact