CONTOUR: An efficient algorithm for discovering discriminating subsequences

Jianyong Wang; Yuzhou Zhang; Lizhu Zhou; George Karypis; Charu C. Aggarwal

doi:10.1007/s10618-008-0100-7

CONTOUR: An efficient algorithm for discovering discriminating subsequences

Jianyong Wang, Yuzhou Zhang, Lizhu Zhou, George Karypis, Charu C. Aggarwal

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

10 Scopus citations

Abstract

In recent years we have witnessed several applications of frequent sequence mining, such as feature selection for protein sequence classification and mining block correlations in storage systems. In typical applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One approach to discovering the subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm to find the complete set of frequent subsequences. Then, a subset of interesting subsequences can be further identified. Unfortunately, it is very time consuming to mine the complete set of frequent subsequences for large sequence databases. In this paper, we propose a new algorithm, CONTOUR, which efficiently mines a subset of high-quality subsequences directly in order to cluster the input sequences. We mainly focus on how to design some effective search space pruning methods to accelerate the mining process and discuss how to construct an accurate clustering algorithm based on the result of CONTOUR. We conducted an extensive performance study to evaluate the efficiency and scalability of CONTOUR, and the accuracy of the frequent subsequence-based clustering algorithm.

Original language	English (US)
Pages (from-to)	1-29
Number of pages	29
Journal	Data Mining and Knowledge Discovery
Volume	18
Issue number	1
DOIs	https://doi.org/10.1007/s10618-008-0100-7
State	Published - Feb 2009

Bibliographical note

Funding Information:
Acknowledgements Jianyong Wang was supported in part by National Basic Research Program of China under Grant No. 2006CB303103, Program for Selected Talents (i.e., “Gu Gan Ren Cai") in Tsinghua University, Program for New Century Excellent Talents in University under Grant No. NCET-07-0491, and the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry of China. George Karypis was supported by NSF EIA-9986042, ACI-0133464, IIS-0431135, NIH RLM008713A, NIH T32GM008347, the Digital Technology Center, University of Minnesota and the Minnesota Supercomputing Institute. This paper is a major-value added version of a conference paper that appeared in the 2007 SIAM International Conference on Data Mining (SIAM SDM’07).

Keywords

Clustering
Discriminating subsequence
Sequence mining
Summarization subsequence

Access

10.1007/s10618-008-0100-7

OpenUrl availability

Full text

Cite this

@article{405c5bada4854edebd93af7da5512684,

title = "CONTOUR: An efficient algorithm for discovering discriminating subsequences",

abstract = "In recent years we have witnessed several applications of frequent sequence mining, such as feature selection for protein sequence classification and mining block correlations in storage systems. In typical applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One approach to discovering the subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm to find the complete set of frequent subsequences. Then, a subset of interesting subsequences can be further identified. Unfortunately, it is very time consuming to mine the complete set of frequent subsequences for large sequence databases. In this paper, we propose a new algorithm, CONTOUR, which efficiently mines a subset of high-quality subsequences directly in order to cluster the input sequences. We mainly focus on how to design some effective search space pruning methods to accelerate the mining process and discuss how to construct an accurate clustering algorithm based on the result of CONTOUR. We conducted an extensive performance study to evaluate the efficiency and scalability of CONTOUR, and the accuracy of the frequent subsequence-based clustering algorithm.",

keywords = "Clustering, Discriminating subsequence, Sequence mining, Summarization subsequence",

author = "Jianyong Wang and Yuzhou Zhang and Lizhu Zhou and George Karypis and Aggarwal, {Charu C.}",

note = "Funding Information: Acknowledgements Jianyong Wang was supported in part by National Basic Research Program of China under Grant No. 2006CB303103, Program for Selected Talents (i.e., “Gu Gan Ren Cai{"}) in Tsinghua University, Program for New Century Excellent Talents in University under Grant No. NCET-07-0491, and the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry of China. George Karypis was supported by NSF EIA-9986042, ACI-0133464, IIS-0431135, NIH RLM008713A, NIH T32GM008347, the Digital Technology Center, University of Minnesota and the Minnesota Supercomputing Institute. This paper is a major-value added version of a conference paper that appeared in the 2007 SIAM International Conference on Data Mining (SIAM SDM{\textquoteright}07).",

year = "2009",

month = feb,

doi = "10.1007/s10618-008-0100-7",

language = "English (US)",

volume = "18",

pages = "1--29",

journal = "Data Mining and Knowledge Discovery",

issn = "1384-5810",

publisher = "Springer Netherlands",

number = "1",

}

TY - JOUR

T1 - CONTOUR

T2 - An efficient algorithm for discovering discriminating subsequences

AU - Wang, Jianyong

AU - Zhang, Yuzhou

AU - Zhou, Lizhu

AU - Karypis, George

AU - Aggarwal, Charu C.

N1 - Funding Information: Acknowledgements Jianyong Wang was supported in part by National Basic Research Program of China under Grant No. 2006CB303103, Program for Selected Talents (i.e., “Gu Gan Ren Cai") in Tsinghua University, Program for New Century Excellent Talents in University under Grant No. NCET-07-0491, and the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry of China. George Karypis was supported by NSF EIA-9986042, ACI-0133464, IIS-0431135, NIH RLM008713A, NIH T32GM008347, the Digital Technology Center, University of Minnesota and the Minnesota Supercomputing Institute. This paper is a major-value added version of a conference paper that appeared in the 2007 SIAM International Conference on Data Mining (SIAM SDM’07).

PY - 2009/2

Y1 - 2009/2

N2 - In recent years we have witnessed several applications of frequent sequence mining, such as feature selection for protein sequence classification and mining block correlations in storage systems. In typical applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One approach to discovering the subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm to find the complete set of frequent subsequences. Then, a subset of interesting subsequences can be further identified. Unfortunately, it is very time consuming to mine the complete set of frequent subsequences for large sequence databases. In this paper, we propose a new algorithm, CONTOUR, which efficiently mines a subset of high-quality subsequences directly in order to cluster the input sequences. We mainly focus on how to design some effective search space pruning methods to accelerate the mining process and discuss how to construct an accurate clustering algorithm based on the result of CONTOUR. We conducted an extensive performance study to evaluate the efficiency and scalability of CONTOUR, and the accuracy of the frequent subsequence-based clustering algorithm.

AB - In recent years we have witnessed several applications of frequent sequence mining, such as feature selection for protein sequence classification and mining block correlations in storage systems. In typical applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One approach to discovering the subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm to find the complete set of frequent subsequences. Then, a subset of interesting subsequences can be further identified. Unfortunately, it is very time consuming to mine the complete set of frequent subsequences for large sequence databases. In this paper, we propose a new algorithm, CONTOUR, which efficiently mines a subset of high-quality subsequences directly in order to cluster the input sequences. We mainly focus on how to design some effective search space pruning methods to accelerate the mining process and discuss how to construct an accurate clustering algorithm based on the result of CONTOUR. We conducted an extensive performance study to evaluate the efficiency and scalability of CONTOUR, and the accuracy of the frequent subsequence-based clustering algorithm.

KW - Clustering

KW - Discriminating subsequence

KW - Sequence mining

KW - Summarization subsequence

UR - http://www.scopus.com/inward/record.url?scp=57849094586&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=57849094586&partnerID=8YFLogxK

U2 - 10.1007/s10618-008-0100-7

DO - 10.1007/s10618-008-0100-7

M3 - Article

AN - SCOPUS:57849094586

SN - 1384-5810

VL - 18

SP - 1

EP - 29

JO - Data Mining and Knowledge Discovery

JF - Data Mining and Knowledge Discovery

IS - 1

ER -

CONTOUR: An efficient algorithm for discovering discriminating subsequences

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this