A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.

Argiento, R., Filippi-Mazzola, E., Paci, L., Model-Based Clustering of Categorical Data Based on the Hamming Distance, <<JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION>>, N/A; (N/A): 1-23. [doi:10.1080/01621459.2024.2402568] [https://hdl.handle.net/10807/301459]

Model-Based Clustering of Categorical Data Based on the Hamming Distance

Paci, Lucia
2024

Abstract

A model-based approach is developed for clustering categorical data with no natural ordering. The proposed method exploits the Hamming distance to define a family of probability mass functions to model the data. The elements of this family are then considered as kernels of a finite mixture model with an unknown number of components. Conjugate Bayesian inference has been derived for the parameters of the Hamming distribution model. The mixture is framed in a Bayesian nonparametric setting, and a transdimensional blocked Gibbs sampler is developed to provide full Bayesian inference on the number of clusters, their structure, and the group-specific parameters, facilitating the computation with respect to customary reversible jump algorithms. The proposed model encompasses a parsimonious latent class model as a special case when the number of components is fixed. Model performances are assessed via a simulation study and reference datasets, showing improvements in clustering recovery over existing approaches. Supplementary materials for this article are available online, including a standardized description of the materials available for reproducing the work.
2024
Inglese
Argiento, R., Filippi-Mazzola, E., Paci, L., Model-Based Clustering of Categorical Data Based on the Hamming Distance, <<JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION>>, N/A; (N/A): 1-23. [doi:10.1080/01621459.2024.2402568] [https://hdl.handle.net/10807/301459]
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10807/301459
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact