Significant DBSCAN+: Statistically Robust Density-based Clustering

Yiqun Xie, Xiaowei Jia, Shashi Shekhar, Han Bao, Xun Zhou

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Cluster detection is important and widely used in a variety of applications, including public health, public safety, transportation, and so on. Given a collection of data points, we aim to detect density-connected spatial clusters with varying geometric shapes and densities, under the constraint that the clusters are statistically significant. The problem is challenging, because many societal applications and domain science studies have low tolerance for spurious results, and clusters may have arbitrary shapes and varying densities. As a classical topic in data mining and learning, a myriad of techniques have been developed to detect clusters with both varying shapes and densities (e.g., density-based, hierarchical, spectral, or deep clustering methods). However, the vast majority of these techniques do not consider statistical rigor and are susceptible to detecting spurious clusters formed as a result of natural randomness. On the other hand, scan statistic approaches explicitly control the rate of spurious results, but they typically assume a single "hotspot"of over-density and many rely on further assumptions such as a tessellated input space. To unite the strengths of both lines of work, we propose a statistically robust formulation of a multi-scale DBSCAN, namely Significant DBSCAN+, to identify significant clusters that are density connected. As we will show, incorporation of statistical rigor is a powerful mechanism that allows the new Significant DBSCAN+ to outperform state-of-the-art clustering techniques in various scenarios. We also propose computational enhancements to speed-up the proposed approach. Experiment results show that Significant DBSCAN+ can simultaneously improve the success rate of true cluster detection (e.g., 10-20% increases in absolute F1 scores) and substantially reduce the rate of spurious results (e.g., from thousands/hundreds of spurious detections to none or just a few across 100 datasets), and the acceleration methods can improve the efficiency for both clustered and non-clustered data.

Original languageEnglish (US)
Article number3474842
JournalACM Transactions on Intelligent Systems and Technology
Volume12
Issue number5
DOIs
StatePublished - Oct 2021
Externally publishedYes

Bibliographical note

Funding Information:
Yiqun Xie is supported in part by NSF grants 2105133 and 2126474, Google’s AI for Social Good Impact Scholars program, and the Dean’s Research Initiative Award at the University of Maryland; Xiaowei Jia is supported in part by USGS award G21AC10207, and Pitt Momentum Fund Award; Shashi Shekhar is supported in part by NSF grants 1901099, 1737633, IIS-1320580, IIS-0940818, IIS-1218168, and 1916518, USDOD grant HM0476-20-1-0009, USDOE (ARPA-E) award DE-AR0000795, NIH grants UL1 TR002494, KL2TR002492, and TL1 TR002493, USDA grant 2017-51181-27222, and the Minnesota Supercomputing Institute; and Han Bao and Xun Zhou are supported in part by the Safety Research using Simulation University Transportation Center (SAFER-SIM), which is funded by US-DOT’s University Transportation Centers Program through award 69A3551747131. Authors’ addresses: Y. Xie, University of Maryland, 1124 Lefrak Hall, 7251 Preinkert Dr., College Park, MD 20742; email: xie@umd.edu; X. Jia, University of Pittsburgh, Sennott Square Building, Room 6135, 210 S. Bouquet Street Pittsburgh, PA 15260-9150; email: xiaowei.jia@upitt.edu; S. Shekhar, University of Minnesota, 4-192 Keller Hall, 200 Union Street SE, Minneapolis, MN 55455; email: shekhar@umn.edu, H. Bao, University of Iowa, 14 MacLean Hall, Iowa City, IA 52242; email: han-bao@uiowa.edu; X. Zhou, University of Iowa, 108 John Pappajohn Business Building, Iowa City, IA 52242; email: xun-zhou@uiowa.edu. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2021 Association for Computing Machinery. 2157-6904/2021/11-ART62 $15.00 https://doi.org/10.1145/3474842

Publisher Copyright:
© 2021 Association for Computing Machinery.

Keywords

  • Clustering
  • DBSCAN
  • statistical robustness

Fingerprint

Dive into the research topics of 'Significant DBSCAN+: Statistically Robust Density-based Clustering'. Together they form a unique fingerprint.

Cite this