IRIS UniCatt

Large language models (LLMs) like OpenAI's ChatGPT (generative pretrained transformers) offer great benefits to systematic review production and quality assessment. A careful assessment and comparison with standard practice is highly needed. Two custom GPTs models were developed to compare a LLM's performance in "Risk-of-bias (ROB)" assessment and "Levels of engagement reached (LOER)" classification vs human judgments. Inter-rater agreement was calculated. ROB GPT classified a slightly higher "low risk" overall judgments (27.8% vs 22.2%) and "some concern" (58.3% vs 52.8%) than the research team, for whom "high risk" judgments were double (25.0% vs 13.9%). The research team classified slightly higher "low risk" total judgments (59.7% vs 55.1%) and almost double "high risk" (11.1% vs 5.6%) compared to "ROB GPT" (55.1%), which rated higher "some concerns" (39.4% vs 29.2%) (P = .366). With regards to LOER analysis, 91.7% vs 25.0% were classified "Collaborate" level, 5.6% vs 61.1% as "Shared leadership", and 2.8% as "Involve" vs 13.9% by researchers, while no studies classified in the first two engagement level vs 8.3% and 13.9%, respectively, by researchers (P = .169). A mixed-effect ordinal logistic regression showed an odds ratio (OR) = 0.97 [95% confidence interval (CI) 0.647-1.446, P = .874] for ROB and an OR = 1.00 (95% CI = 0.397-2.543, P = .992) for LOER compared to researchers. Partial agreement on some judgments was observed. Further evaluation of these promising tools is needed to enable their effective yet reliable introduction in scientific practice.

Di Pumpo, M., Riccardi, M. T., De Vita, V., Damiani, G., Evaluation of a large language model (ChatGPT) versus human researchers in assessing risk-of-bias and community engagement levels: a systematic review use-case analysis, <<EUROPEAN JOURNAL OF PUBLIC HEALTH>>, 2025; (n/a): N/A-N/A. [doi:10.1093/eurpub/ckaf072] [https://hdl.handle.net/10807/325581]

Evaluation of a large language model (ChatGPT) versus human researchers in assessing risk-of-bias and community engagement levels: a systematic review use-case analysis

Di Pumpo, Marcello;Riccardi, Maria Teresa;De Vita, Vittorio;Damiani, Gianfranco

2025

Abstract

Large language models (LLMs) like OpenAI's ChatGPT (generative pretrained transformers) offer great benefits to systematic review production and quality assessment. A careful assessment and comparison with standard practice is highly needed. Two custom GPTs models were developed to compare a LLM's performance in "Risk-of-bias (ROB)" assessment and "Levels of engagement reached (LOER)" classification vs human judgments. Inter-rater agreement was calculated. ROB GPT classified a slightly higher "low risk" overall judgments (27.8% vs 22.2%) and "some concern" (58.3% vs 52.8%) than the research team, for whom "high risk" judgments were double (25.0% vs 13.9%). The research team classified slightly higher "low risk" total judgments (59.7% vs 55.1%) and almost double "high risk" (11.1% vs 5.6%) compared to "ROB GPT" (55.1%), which rated higher "some concerns" (39.4% vs 29.2%) (P = .366). With regards to LOER analysis, 91.7% vs 25.0% were classified "Collaborate" level, 5.6% vs 61.1% as "Shared leadership", and 2.8% as "Involve" vs 13.9% by researchers, while no studies classified in the first two engagement level vs 8.3% and 13.9%, respectively, by researchers (P = .169). A mixed-effect ordinal logistic regression showed an odds ratio (OR) = 0.97 [95% confidence interval (CI) 0.647-1.446, P = .874] for ROB and an OR = 1.00 (95% CI = 0.397-2.543, P = .992) for LOER compared to researchers. Partial agreement on some judgments was observed. Further evaluation of these promising tools is needed to enable their effective yet reliable introduction in scientific practice.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2025
			
	Lingua del contenuto
	
				Inglese
			
	Nome del periodico
	
				EUROPEAN JOURNAL OF PUBLIC HEALTH
			
	DOI del contributo
	
				https://dx.doi.org/10.1093/eurpub/ckaf072
			
	Citazione
	
				Di Pumpo, M., Riccardi, M. T., De Vita, V., Damiani, G., Evaluation of a large language model (ChatGPT) versus human researchers in assessing risk-of-bias and community engagement levels: a systematic review use-case analysis, <<EUROPEAN JOURNAL OF PUBLIC HEALTH>>, 2025;  (n/a): N/A-N/A. [doi:10.1093/eurpub/ckaf072] [https://hdl.handle.net/10807/325581]
			
	Appare nelle tipologie:
	
				Articolo in rivista, Nota a sentenza

File in questo prodotto:

File	Dimensione	Formato
Evaluation of a large language model-2025.pdf accesso aperto Licenza: Creative commons Dimensione 677.81 kB Formato Adobe PDF Visualizza/Apri	677.81 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/325581

Citazioni

ND

ND

1

social impact