Significant DBSCAN+: Statistically Robust Density-based Clustering

Yiqun Xie; Xiaowei Jia; Shashi Shekhar; Han Bao; Xun Zhou

doi:10.1145/3474842

Significant DBSCAN+: Statistically Robust Density-based Clustering

Yiqun Xie, Xiaowei Jia, Shashi Shekhar, Han Bao, Xun Zhou

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single "hotspot"of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10-20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.

Original language	English (US)
Article number	3474842
Journal	ACM Transactions on Intelligent Systems and Technology
Volume	12
Issue number	5
DOIs	https://doi.org/10.1145/3474842
State	Published - Oct 2021
Externally published	Yes

Bibliographical note

Funding Information:
Yiqun Xie is supported in part by NSF grants 2105133 and 2126474, Google’s AI for Social Good Impact Scholars program, and the Dean’s Research Initiative Award at the University of Maryland; Xiaowei Jia is supported in part by USGS award G21AC10207, and Pitt Momentum Fund Award; Shashi Shekhar is supported in part by NSF grants 1901099, 1737633, IIS-1320580, IIS-0940818, IIS-1218168, and 1916518, USDOD grant HM0476-20-1-0009, USDOE (ARPA-E) award DE-AR0000795, NIH grants UL1 TR002494, KL2TR002492, and TL1 TR002493, USDA grant 2017-51181-27222, and the Minnesota Supercomputing Institute; and Han Bao and Xun Zhou are supported in part by the Safety Research using Simulation University Transportation Center (SAFER-SIM), which is funded by US-DOT’s University Transportation Centers Program through award 69A3551747131. Authors’ addresses: Y. Xie, University of Maryland, 1124 Lefrak Hall, 7251 Preinkert Dr., College Park, MD 20742; email: xie@umd.edu; X. Jia, University of Pittsburgh, Sennott Square Building, Room 6135, 210 S. Bouquet Street Pittsburgh, PA 15260-9150; email: xiaowei.jia@upitt.edu; S. Shekhar, University of Minnesota, 4-192 Keller Hall, 200 Union Street SE, Minneapolis, MN 55455; email: shekhar@umn.edu, H. Bao, University of Iowa, 14 MacLean Hall, Iowa City, IA 52242; email: han-bao@uiowa.edu; X. Zhou, University of Iowa, 108 John Pappajohn Business Building, Iowa City, IA 52242; email: xun-zhou@uiowa.edu. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2021 Association for Computing Machinery. 2157-6904/2021/11-ART62 $15.00 https://doi.org/10.1145/3474842

Publisher Copyright:
© 2021 Association for Computing Machinery.

Keywords

Clustering
DBSCAN
statistical robustness

Access

10.1145/3474842

OpenUrl availability

Full text

Cite this

@article{417b269e50d64a7a81f6a32154d60753,

title = "Significant DBSCAN+: Statistically Robust Density-based Clustering",

abstract = "Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single {"}hotspot{"}of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10-20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.",

keywords = "Clustering, DBSCAN, statistical robustness",

author = "Yiqun Xie and Xiaowei Jia and Shashi Shekhar and Han Bao and Xun Zhou",

note = "Funding Information: Yiqun Xie is supported in part by NSF grants 2105133 and 2126474, Google{\textquoteright}s AI for Social Good Impact Scholars program, and the Dean{\textquoteright}s Research Initiative Award at the University of Maryland; Xiaowei Jia is supported in part by USGS award G21AC10207, and Pitt Momentum Fund Award; Shashi Shekhar is supported in part by NSF grants 1901099, 1737633, IIS-1320580, IIS-0940818, IIS-1218168, and 1916518, USDOD grant HM0476-20-1-0009, USDOE (ARPA-E) award DE-AR0000795, NIH grants UL1 TR002494, KL2TR002492, and TL1 TR002493, USDA grant 2017-51181-27222, and the Minnesota Supercomputing Institute; and Han Bao and Xun Zhou are supported in part by the Safety Research using Simulation University Transportation Center (SAFER-SIM), which is funded by US-DOT{\textquoteright}s University Transportation Centers Program through award 69A3551747131. Authors{\textquoteright} addresses: Y. Xie, University of Maryland, 1124 Lefrak Hall, 7251 Preinkert Dr., College Park, MD 20742; email: xie@umd.edu; X. Jia, University of Pittsburgh, Sennott Square Building, Room 6135, 210 S. Bouquet Street Pittsburgh, PA 15260-9150; email: xiaowei.jia@upitt.edu; S. Shekhar, University of Minnesota, 4-192 Keller Hall, 200 Union Street SE, Minneapolis, MN 55455; email: shekhar@umn.edu, H. Bao, University of Iowa, 14 MacLean Hall, Iowa City, IA 52242; email: han-bao@uiowa.edu; X. Zhou, University of Iowa, 108 John Pappajohn Business Building, Iowa City, IA 52242; email: xun-zhou@uiowa.edu. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. {\textcopyright} 2021 Association for Computing Machinery. 2157-6904/2021/11-ART62 $15.00 https://doi.org/10.1145/3474842 Publisher Copyright: {\textcopyright} 2021 Association for Computing Machinery.",

year = "2021",

month = oct,

doi = "10.1145/3474842",

language = "English (US)",

volume = "12",

journal = "ACM Transactions on Intelligent Systems and Technology",

issn = "2157-6904",

publisher = "Association for Computing Machinery (ACM)",

number = "5",

}

TY - JOUR

T1 - Significant DBSCAN+

T2 - Statistically Robust Density-based Clustering

AU - Xie, Yiqun

AU - Jia, Xiaowei

AU - Shekhar, Shashi

AU - Bao, Han

AU - Zhou, Xun

N1 - Funding Information: Yiqun Xie is supported in part by NSF grants 2105133 and 2126474, Google’s AI for Social Good Impact Scholars program, and the Dean’s Research Initiative Award at the University of Maryland; Xiaowei Jia is supported in part by USGS award G21AC10207, and Pitt Momentum Fund Award; Shashi Shekhar is supported in part by NSF grants 1901099, 1737633, IIS-1320580, IIS-0940818, IIS-1218168, and 1916518, USDOD grant HM0476-20-1-0009, USDOE (ARPA-E) award DE-AR0000795, NIH grants UL1 TR002494, KL2TR002492, and TL1 TR002493, USDA grant 2017-51181-27222, and the Minnesota Supercomputing Institute; and Han Bao and Xun Zhou are supported in part by the Safety Research using Simulation University Transportation Center (SAFER-SIM), which is funded by US-DOT’s University Transportation Centers Program through award 69A3551747131. Authors’ addresses: Y. Xie, University of Maryland, 1124 Lefrak Hall, 7251 Preinkert Dr., College Park, MD 20742; email: xie@umd.edu; X. Jia, University of Pittsburgh, Sennott Square Building, Room 6135, 210 S. Bouquet Street Pittsburgh, PA 15260-9150; email: xiaowei.jia@upitt.edu; S. Shekhar, University of Minnesota, 4-192 Keller Hall, 200 Union Street SE, Minneapolis, MN 55455; email: shekhar@umn.edu, H. Bao, University of Iowa, 14 MacLean Hall, Iowa City, IA 52242; email: han-bao@uiowa.edu; X. Zhou, University of Iowa, 108 John Pappajohn Business Building, Iowa City, IA 52242; email: xun-zhou@uiowa.edu. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2021 Association for Computing Machinery. 2157-6904/2021/11-ART62 $15.00 https://doi.org/10.1145/3474842 Publisher Copyright: © 2021 Association for Computing Machinery.

PY - 2021/10

Y1 - 2021/10

N2 - Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single "hotspot"of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10-20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.

AB - Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single "hotspot"of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10-20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.

KW - Clustering

KW - DBSCAN

KW - statistical robustness

UR - http://www.scopus.com/inward/record.url?scp=85122095666&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85122095666&partnerID=8YFLogxK

U2 - 10.1145/3474842

DO - 10.1145/3474842

M3 - Article

AN - SCOPUS:85122095666

SN - 2157-6904

VL - 12

JO - ACM Transactions on Intelligent Systems and Technology

JF - ACM Transactions on Intelligent Systems and Technology

IS - 5

M1 - 3474842

ER -

Significant DBSCAN+: Statistically Robust Density-based Clustering

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this