Nonparametric cluster significance testing with reference to a unimodal null distribution

Erika S. Helgeson; David M. Vock; Eric Bair

doi:10.1111/biom.13376

Nonparametric cluster significance testing with reference to a unimodal null distribution

Erika S. Helgeson, David M. Vock, Eric Bair

Biostatistics

Research output: Contribution to journal › Article › peer-review

1 Scopus citations

Abstract

Cluster analysis is an unsupervised learning strategy that is exceptionally useful for identifying homogeneous subgroups of observations in data sets of unknown structure. However, it is challenging to determine if the identified clusters represent truly distinct subgroups rather than noise. Existing approaches for addressing this problem tend to define clusters based on distributional assumptions, ignore the inherent correlation structure in the data, or are not suited for high-dimension low-sample size (HDLSS) settings. In this paper, we propose a novel method to evaluate the significance of identified clusters by comparing the explained variation due to the clustering from the original data to that produced by clustering a unimodal reference distribution that preserves the covariance structure in the data. The reference distribution is generated using kernel density estimation, and thus, does not require that the data follow a particular distribution. By utilizing sparse covariance estimation, the method is adapted for the HDLSS setting. The approach can be used to test the null hypothesis that the data cannot be partitioned into clusters and to determine the optimal number of clusters. Simulation examples, theoretical evaluations, and applications to temporomandibular disorder research and cancer microarray data illustrate the utility of the proposed method.

Original language	English (US)
Pages (from-to)	1215-1226
Number of pages	12
Journal	Biometrics
Volume	77
Issue number	4
DOIs	https://doi.org/10.1111/biom.13376
State	Published - Dec 2021

Bibliographical note

Publisher Copyright:
© 2020 The International Biometric Society

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access

10.1111/biom.13376

OpenUrl availability

Full text

Cite this

@article{c4165e49a3be4cfeb9a7e1d957516241,

title = "Nonparametric cluster significance testing with reference to a unimodal null distribution",

abstract = "Cluster analysis is an unsupervised learning strategy that is exceptionally useful for identifying homogeneous subgroups of observations in data sets of unknown structure. However, it is challenging to determine if the identified clusters represent truly distinct subgroups rather than noise. Existing approaches for addressing this problem tend to define clusters based on distributional assumptions, ignore the inherent correlation structure in the data, or are not suited for high-dimension low-sample size (HDLSS) settings. In this paper, we propose a novel method to evaluate the significance of identified clusters by comparing the explained variation due to the clustering from the original data to that produced by clustering a unimodal reference distribution that preserves the covariance structure in the data. The reference distribution is generated using kernel density estimation, and thus, does not require that the data follow a particular distribution. By utilizing sparse covariance estimation, the method is adapted for the HDLSS setting. The approach can be used to test the null hypothesis that the data cannot be partitioned into clusters and to determine the optimal number of clusters. Simulation examples, theoretical evaluations, and applications to temporomandibular disorder research and cancer microarray data illustrate the utility of the proposed method.",

author = "Helgeson, {Erika S.} and Vock, {David M.} and Eric Bair",

note = "Publisher Copyright: {\textcopyright} 2020 The International Biometric Society",

year = "2021",

month = dec,

doi = "10.1111/biom.13376",

language = "English (US)",

volume = "77",

pages = "1215--1226",

journal = "Biometrics",

issn = "0006-341X",

publisher = "Wiley-Blackwell",

number = "4",

}

TY - JOUR

T1 - Nonparametric cluster significance testing with reference to a unimodal null distribution

AU - Helgeson, Erika S.

AU - Vock, David M.

AU - Bair, Eric

PY - 2021/12

Y1 - 2021/12

N2 - Cluster analysis is an unsupervised learning strategy that is exceptionally useful for identifying homogeneous subgroups of observations in data sets of unknown structure. However, it is challenging to determine if the identified clusters represent truly distinct subgroups rather than noise. Existing approaches for addressing this problem tend to define clusters based on distributional assumptions, ignore the inherent correlation structure in the data, or are not suited for high-dimension low-sample size (HDLSS) settings. In this paper, we propose a novel method to evaluate the significance of identified clusters by comparing the explained variation due to the clustering from the original data to that produced by clustering a unimodal reference distribution that preserves the covariance structure in the data. The reference distribution is generated using kernel density estimation, and thus, does not require that the data follow a particular distribution. By utilizing sparse covariance estimation, the method is adapted for the HDLSS setting. The approach can be used to test the null hypothesis that the data cannot be partitioned into clusters and to determine the optimal number of clusters. Simulation examples, theoretical evaluations, and applications to temporomandibular disorder research and cancer microarray data illustrate the utility of the proposed method.

AB - Cluster analysis is an unsupervised learning strategy that is exceptionally useful for identifying homogeneous subgroups of observations in data sets of unknown structure. However, it is challenging to determine if the identified clusters represent truly distinct subgroups rather than noise. Existing approaches for addressing this problem tend to define clusters based on distributional assumptions, ignore the inherent correlation structure in the data, or are not suited for high-dimension low-sample size (HDLSS) settings. In this paper, we propose a novel method to evaluate the significance of identified clusters by comparing the explained variation due to the clustering from the original data to that produced by clustering a unimodal reference distribution that preserves the covariance structure in the data. The reference distribution is generated using kernel density estimation, and thus, does not require that the data follow a particular distribution. By utilizing sparse covariance estimation, the method is adapted for the HDLSS setting. The approach can be used to test the null hypothesis that the data cannot be partitioned into clusters and to determine the optimal number of clusters. Simulation examples, theoretical evaluations, and applications to temporomandibular disorder research and cancer microarray data illustrate the utility of the proposed method.

UR - http://www.scopus.com/inward/record.url?scp=85092153659&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85092153659&partnerID=8YFLogxK

U2 - 10.1111/biom.13376

DO - 10.1111/biom.13376

M3 - Article

C2 - 32969032

AN - SCOPUS:85092153659

SN - 0006-341X

VL - 77

SP - 1215

EP - 1226

JO - Biometrics

JF - Biometrics

IS - 4

ER -

Nonparametric cluster significance testing with reference to a unimodal null distribution

Abstract

Bibliographical note

UN SDGs

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this