Clustering very large data sets using a low memory matrix factored representation

David Littau; Daniel Boley

doi:10.1111/j.1467-8640.2009.00331.x

Clustering very large data sets using a low memory matrix factored representation

David Littau, Daniel Boley

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

3 Scopus citations

Abstract

A scalable method to cluster data sets too large to fit in memory is presented. This method does not depend on random subsampling, but does scan every individual data sample in a deterministic way. The original data are represented in factored form by the product of two matrices, one or both of which is very sparse. This factored form avoids the need to multiply together these two matrices by using a variant of the Principal Direction Divisive Partitioning (PDDP) algorithm which does not depend on computing the distances between the individual samples. The resulting clustering algorithm is Piecemeal PDDP (PMPDDP), in which the original data are broken up into sections which will fit into memory and clustered. The cluster centers are used to create approximations to the original data items, and each original data item is represented by a linear combination of these centers. We evaluate the performance of PMPDDP on three real data sets, and observe that the quality of the clusters of PMPDDP is comparable to PDDP for the data sets examined.

Original language	English (US)
Pages (from-to)	114-135
Number of pages	22
Journal	Computational Intelligence
Volume	25
Issue number	2
DOIs	https://doi.org/10.1111/j.1467-8640.2009.00331.x
State	Published - May 2009

Keywords

Clustering
Large data sets
Low-memory factorization
PDDP

Access

10.1111/j.1467-8640.2009.00331.x

OpenUrl availability

Full text

Cite this

@article{fc1f45f5747446858290a6d17ab07ad5,

title = "Clustering very large data sets using a low memory matrix factored representation",

abstract = "A scalable method to cluster data sets too large to fit in memory is presented. This method does not depend on random subsampling, but does scan every individual data sample in a deterministic way. The original data are represented in factored form by the product of two matrices, one or both of which is very sparse. This factored form avoids the need to multiply together these two matrices by using a variant of the Principal Direction Divisive Partitioning (PDDP) algorithm which does not depend on computing the distances between the individual samples. The resulting clustering algorithm is Piecemeal PDDP (PMPDDP), in which the original data are broken up into sections which will fit into memory and clustered. The cluster centers are used to create approximations to the original data items, and each original data item is represented by a linear combination of these centers. We evaluate the performance of PMPDDP on three real data sets, and observe that the quality of the clusters of PMPDDP is comparable to PDDP for the data sets examined.",

keywords = "Clustering, Large data sets, Low-memory factorization, PDDP",

author = "David Littau and Daniel Boley",

year = "2009",

month = may,

doi = "10.1111/j.1467-8640.2009.00331.x",

language = "English (US)",

volume = "25",

pages = "114--135",

journal = "Computational Intelligence",

issn = "0824-7935",

publisher = "Wiley-Blackwell",

number = "2",

}

TY - JOUR

T1 - Clustering very large data sets using a low memory matrix factored representation

AU - Littau, David

AU - Boley, Daniel

PY - 2009/5

Y1 - 2009/5

N2 - A scalable method to cluster data sets too large to fit in memory is presented. This method does not depend on random subsampling, but does scan every individual data sample in a deterministic way. The original data are represented in factored form by the product of two matrices, one or both of which is very sparse. This factored form avoids the need to multiply together these two matrices by using a variant of the Principal Direction Divisive Partitioning (PDDP) algorithm which does not depend on computing the distances between the individual samples. The resulting clustering algorithm is Piecemeal PDDP (PMPDDP), in which the original data are broken up into sections which will fit into memory and clustered. The cluster centers are used to create approximations to the original data items, and each original data item is represented by a linear combination of these centers. We evaluate the performance of PMPDDP on three real data sets, and observe that the quality of the clusters of PMPDDP is comparable to PDDP for the data sets examined.

AB - A scalable method to cluster data sets too large to fit in memory is presented. This method does not depend on random subsampling, but does scan every individual data sample in a deterministic way. The original data are represented in factored form by the product of two matrices, one or both of which is very sparse. This factored form avoids the need to multiply together these two matrices by using a variant of the Principal Direction Divisive Partitioning (PDDP) algorithm which does not depend on computing the distances between the individual samples. The resulting clustering algorithm is Piecemeal PDDP (PMPDDP), in which the original data are broken up into sections which will fit into memory and clustered. The cluster centers are used to create approximations to the original data items, and each original data item is represented by a linear combination of these centers. We evaluate the performance of PMPDDP on three real data sets, and observe that the quality of the clusters of PMPDDP is comparable to PDDP for the data sets examined.

KW - Clustering

KW - Large data sets

KW - Low-memory factorization

KW - PDDP

UR - http://www.scopus.com/inward/record.url?scp=65349107709&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=65349107709&partnerID=8YFLogxK

U2 - 10.1111/j.1467-8640.2009.00331.x

DO - 10.1111/j.1467-8640.2009.00331.x

M3 - Article

AN - SCOPUS:65349107709

SN - 0824-7935

VL - 25

SP - 114

EP - 135

JO - Computational Intelligence

JF - Computational Intelligence

IS - 2

ER -

Clustering very large data sets using a low memory matrix factored representation

Abstract

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this