Skip to main content
Journal cover image

A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses.

Publication ,  Journal Article
Liu, P; Yuan, H; Ning, Y; Chakraborty, B; Liu, N; Peres, MA
Published in: BMC Med Res Methodol
December 18, 2024

BACKGROUND: Traditional clustering techniques are typically restricted to either continuous or categorical variables. However, most real-world clinical data are mixed type. This study aims to introduce a clustering technique specifically designed for datasets containing both continuous and categorical variables to offer better clustering compatibility, adaptability, and interpretability than other mixed type techniques. METHODS: This paper proposed a modified Gower distance incorporating feature importance as weights to maintain equal contributions between continuous and categorical features. The algorithm (DAFI) was evaluated using five simulated datasets with varying proportions of important features and real-world datasets from the 2011-2014 National Health and Nutrition Examination Survey (NHANES). Effectiveness was demonstrated through comparisons with 13 clustering techniques. Clustering performance was assessed using the adjusted Rand index (ARI) for accuracy in simulation studies and the silhouette score for cohesion and separation in NHANES. Additionally, multivariable logistic regression estimated the association between periodontitis (PD) and cardiovascular diseases (CVDs), adjusting for clusters in NHANES. RESULTS: In simulation studies, the DAFI-Gower algorithm consistently performs better than baseline methods according to the adjusted Rand index in settings investigated, especially on datasets with more redundant features. In NHANES, 3,760 people were analyzed. DAFI-Gower achieves the highest silhouette score (0.79). Four distinct clusters with diverse health profiles were identified. By incorporating feature importance, we found that cluster formations were more strongly influenced by CVD-related factors. The association between periodontitis and cardiovascular diseases, after adjusting for clusters, reveals significant insights (adjusted OR 1.95, 95% CI 1.50 to 2.55, p = 0.012), highlighting severe periodontitis as a potential risk factor for cardiovascular diseases. CONCLUSIONS: DAFI performed better than classic clustering baselines on both simulated and real-world datasets. It effectively captures cluster characteristics by considering feature importance, which is crucial in clinical settings where many variables may be similar or irrelevant. We envisage that DAFI offers an effective solution for mixed type clustering.

Duke Scholars

Published In

BMC Med Res Methodol

DOI

EISSN

1471-2288

Publication Date

December 18, 2024

Volume

24

Issue

1

Start / End Page

305

Location

England

Related Subject Headings

  • Periodontitis
  • Nutrition Surveys
  • Middle Aged
  • Male
  • Logistic Models
  • Humans
  • General & Internal Medicine
  • Female
  • Empirical Research
  • Computer Simulation
 

Citation

APA
Chicago
ICMJE
MLA
NLM
Liu, P., Yuan, H., Ning, Y., Chakraborty, B., Liu, N., & Peres, M. A. (2024). A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses. BMC Med Res Methodol, 24(1), 305. https://doi.org/10.1186/s12874-024-02427-8
Liu, Pinyan, Han Yuan, Yilin Ning, Bibhas Chakraborty, Nan Liu, and Marco Aurélio Peres. “A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses.BMC Med Res Methodol 24, no. 1 (December 18, 2024): 305. https://doi.org/10.1186/s12874-024-02427-8.
Liu P, Yuan H, Ning Y, Chakraborty B, Liu N, Peres MA. A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses. BMC Med Res Methodol. 2024 Dec 18;24(1):305.
Liu, Pinyan, et al. “A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses.BMC Med Res Methodol, vol. 24, no. 1, Dec. 2024, p. 305. Pubmed, doi:10.1186/s12874-024-02427-8.
Liu P, Yuan H, Ning Y, Chakraborty B, Liu N, Peres MA. A modified and weighted Gower distance-based clustering analysis for mixed type data: a simulation and empirical analyses. BMC Med Res Methodol. 2024 Dec 18;24(1):305.
Journal cover image

Published In

BMC Med Res Methodol

DOI

EISSN

1471-2288

Publication Date

December 18, 2024

Volume

24

Issue

1

Start / End Page

305

Location

England

Related Subject Headings

  • Periodontitis
  • Nutrition Surveys
  • Middle Aged
  • Male
  • Logistic Models
  • Humans
  • General & Internal Medicine
  • Female
  • Empirical Research
  • Computer Simulation