IRIS UniCatt

A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

Argiento, R., Filippi-Mazzola, E., Paci, L., Model-Based Clustering of Categorical Data Based on the Hamming Distance, <<JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION>>, 2025; 120 (550): 1178-1188. [doi:10.1080/01621459.2024.2402568] [https://hdl.handle.net/10807/301459]

Model-Based Clustering of Categorical Data Based on the Hamming Distance

Argiento R.;Filippi-Mazzola E.;Paci, Lucia

2024

Abstract

A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
				2024
			
	Lingua del contenuto
	
				Inglese
			
	Nome del periodico
	
				JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
			
	DOI del contributo
	
				https://dx.doi.org/10.1080/01621459.2024.2402568
			
	Citazione
	
				Argiento, R., Filippi-Mazzola, E., Paci, L., Model-Based Clustering of Categorical Data Based on the Hamming Distance, <<JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION>>, 2025;  120 (550): 1178-1188. [doi:10.1080/01621459.2024.2402568] [https://hdl.handle.net/10807/301459]
			
	Appare nelle tipologie:
	
				Articolo in rivista, Nota a sentenza

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/301459

Citazioni

ND

3

4

social impact