IRIS UniCatt

 Knowledge distillation is recognized as a valuable model compression strategy that alleviates the computational burden of large language models while preserving performance. This strategy involves training a smaller model utilizing both real data and predictions from a more cumbersome model. Traditional distillation methods, however, are often compromised by exposure bias, which results from reliance on next-step prediction training loss. This bias emerges when models are tested in free-running mode, differing from their training regime and leading to a progressive drift in input distributions between testing and training phases. An analogous issue, known as ‘distributional shift’, has been effectively addressed in imitation learning through various methodologies. Therefore, this paper specifically tailors an imitation learning-based solution to a traditional knowledge distillation framework which inherently considers both real data and the teacher’s predictions as dual sources of expert demonstrations. The effectiveness of this approach is demonstrated over five different test datasets, where it outperforms traditional benchmarks across all evaluation metrics. Specifically, it achieves superior results in perplexity, multi-token generation, and G-Eval score, indicating improvements in both predictive accuracy and alignment with human judgment in text quality. These results underscore the potential of this approach to effectively address exposure bias in large language model distillation.

Pozzi, A., Incremona, A., Tessera, D., Toti, D., Mitigating exposure bias in large language model distillation: an imitation learning approach, <<NEURAL COMPUTING & APPLICATIONS>>, 2025; (N/A): N/A-N/A. [doi:10.1007/s00521-025-11162-0] [https://hdl.handle.net/10807/312759]

Mitigating exposure bias in large language model distillation: an imitation learning approach

Pozzi, Andrea^Primo;Incremona, Alessandro^Secondo;Tessera, Daniele^Penultimo;Toti, Daniele^Ultimo

2025

Abstract

Knowledge distillation is recognized as a valuable model compression strategy that alleviates the computational burden of large language models while preserving performance. This strategy involves training a smaller model utilizing both real data and predictions from a more cumbersome model. Traditional distillation methods, however, are often compromised by exposure bias, which results from reliance on next-step prediction training loss. This bias emerges when models are tested in free-running mode, differing from their training regime and leading to a progressive drift in input distributions between testing and training phases. An analogous issue, known as ‘distributional shift’, has been effectively addressed in imitation learning through various methodologies. Therefore, this paper specifically tailors an imitation learning-based solution to a traditional knowledge distillation framework which inherently considers both real data and the teacher’s predictions as dual sources of expert demonstrations. The effectiveness of this approach is demonstrated over five different test datasets, where it outperforms traditional benchmarks across all evaluation metrics. Specifically, it achieves superior results in perplexity, multi-token generation, and G-Eval score, indicating improvements in both predictive accuracy and alignment with human judgment in text quality. These results underscore the potential of this approach to effectively address exposure bias in large language model distillation.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2025
			
	Lingua del contenuto
	
				Inglese
			
	Nome del periodico
	
				NEURAL COMPUTING & APPLICATIONS
			
	DOI del contributo
	
				https://dx.doi.org/10.1007/s00521-025-11162-0
			
	Citazione
	
				Pozzi, A., Incremona, A., Tessera, D., Toti, D., Mitigating exposure bias in large language model distillation: an imitation learning approach, <<NEURAL COMPUTING & APPLICATIONS>>, 2025;  (N/A): N/A-N/A. [doi:10.1007/s00521-025-11162-0] [https://hdl.handle.net/10807/312759]
			
	Appare nelle tipologie:
	
				Articolo in rivista, Nota a sentenza

File in questo prodotto:

File	Dimensione	Formato
s00521-025-11162-0.pdf accesso aperto Tipologia file ?: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 652.03 kB Formato Adobe PDF Visualizza/Apri	652.03 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/312759

Citazioni

ND

11

ND

social impact