Induction of comprehensible models for gene expression datasets by subgroup discovery methodology

Dragan Gamberger; Nada Lavrač; Filip Železný; Jakub Tolar

doi:10.1016/j.jbi.2004.07.007

Induction of comprehensible models for gene expression datasets by subgroup discovery methodology

Dragan Gamberger, Nada Lavrač, Filip Železný, Jakub Tolar

Administration (TMED)

Research output: Contribution to journal › Article › peer-review

39 Scopus citations

Abstract

Finding disease markers (classifiers) from gene expression data by machine learning algorithms is characterized by a high risk of overfitting the data due the abundance of attributes (simultaneously measured gene expression values) and shortage of available examples (observations). To avoid this pitfall and achieve predictor robustness, state-of-the-art approaches construct complex classifiers that combine relatively weak contributions of up to thousands of genes (attributes) to classify a disease. The complexity of such classifiers limits their transparency and consequently the biological insights they can provide. The goal of this study is to apply to this domain the methodology of constructing simple yet robust logic-based classifiers amenable to direct expert interpretation. On two well-known, publicly available gene expression classification problems, the paper shows the feasibility of this approach, employing a recently developed subgroup discovery methodology. Some of the discovered classifiers allow for novel biological interpretations.

Original language	English (US)
Pages (from-to)	269-284
Number of pages	16
Journal	Journal of Biomedical Informatics
Volume	37
Issue number	4
DOIs	https://doi.org/10.1016/j.jbi.2004.07.007
State	Published - Aug 2004

Bibliographical note

Funding Information:
This work was supported by the Croatian Ministry of Science, Education and Sport, the Slovenian Ministry of Education, Science and Sport, and the Czech Ministry of Education through the project MSM 212300013.

Keywords

Comprehensible classification
Disease markers
Gene expression measurements
Machine learning
Subgroup discovery

Access

10.1016/j.jbi.2004.07.007

OpenUrl availability

Full text

Cite this

@article{e89861dc6a3040c48d7c921ef3389646,

title = "Induction of comprehensible models for gene expression datasets by subgroup discovery methodology",

abstract = "Finding disease markers (classifiers) from gene expression data by machine learning algorithms is characterized by a high risk of overfitting the data due the abundance of attributes (simultaneously measured gene expression values) and shortage of available examples (observations). To avoid this pitfall and achieve predictor robustness, state-of-the-art approaches construct complex classifiers that combine relatively weak contributions of up to thousands of genes (attributes) to classify a disease. The complexity of such classifiers limits their transparency and consequently the biological insights they can provide. The goal of this study is to apply to this domain the methodology of constructing simple yet robust logic-based classifiers amenable to direct expert interpretation. On two well-known, publicly available gene expression classification problems, the paper shows the feasibility of this approach, employing a recently developed subgroup discovery methodology. Some of the discovered classifiers allow for novel biological interpretations.",

keywords = "Comprehensible classification, Disease markers, Gene expression measurements, Machine learning, Subgroup discovery",

author = "Dragan Gamberger and Nada Lavra{\v c} and Filip {\v Z}elezn{\'y} and Jakub Tolar",

note = "Funding Information: This work was supported by the Croatian Ministry of Science, Education and Sport, the Slovenian Ministry of Education, Science and Sport, and the Czech Ministry of Education through the project MSM 212300013.",

year = "2004",

month = aug,

doi = "10.1016/j.jbi.2004.07.007",

language = "English (US)",

volume = "37",

pages = "269--284",

journal = "Journal of Biomedical Informatics",

issn = "1532-0464",

publisher = "Academic Press Inc.",

number = "4",

}

TY - JOUR

T1 - Induction of comprehensible models for gene expression datasets by subgroup discovery methodology

AU - Gamberger, Dragan

AU - Lavrač, Nada

AU - Železný, Filip

AU - Tolar, Jakub

N1 - Funding Information: This work was supported by the Croatian Ministry of Science, Education and Sport, the Slovenian Ministry of Education, Science and Sport, and the Czech Ministry of Education through the project MSM 212300013.

PY - 2004/8

Y1 - 2004/8

N2 - Finding disease markers (classifiers) from gene expression data by machine learning algorithms is characterized by a high risk of overfitting the data due the abundance of attributes (simultaneously measured gene expression values) and shortage of available examples (observations). To avoid this pitfall and achieve predictor robustness, state-of-the-art approaches construct complex classifiers that combine relatively weak contributions of up to thousands of genes (attributes) to classify a disease. The complexity of such classifiers limits their transparency and consequently the biological insights they can provide. The goal of this study is to apply to this domain the methodology of constructing simple yet robust logic-based classifiers amenable to direct expert interpretation. On two well-known, publicly available gene expression classification problems, the paper shows the feasibility of this approach, employing a recently developed subgroup discovery methodology. Some of the discovered classifiers allow for novel biological interpretations.

AB - Finding disease markers (classifiers) from gene expression data by machine learning algorithms is characterized by a high risk of overfitting the data due the abundance of attributes (simultaneously measured gene expression values) and shortage of available examples (observations). To avoid this pitfall and achieve predictor robustness, state-of-the-art approaches construct complex classifiers that combine relatively weak contributions of up to thousands of genes (attributes) to classify a disease. The complexity of such classifiers limits their transparency and consequently the biological insights they can provide. The goal of this study is to apply to this domain the methodology of constructing simple yet robust logic-based classifiers amenable to direct expert interpretation. On two well-known, publicly available gene expression classification problems, the paper shows the feasibility of this approach, employing a recently developed subgroup discovery methodology. Some of the discovered classifiers allow for novel biological interpretations.

KW - Comprehensible classification

KW - Disease markers

KW - Gene expression measurements

KW - Machine learning

KW - Subgroup discovery

UR - http://www.scopus.com/inward/record.url?scp=4744364732&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=4744364732&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2004.07.007

DO - 10.1016/j.jbi.2004.07.007

M3 - Article

C2 - 15465480

AN - SCOPUS:4744364732

SN - 1532-0464

VL - 37

SP - 269

EP - 284

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

IS - 4

ER -

Induction of comprehensible models for gene expression datasets by subgroup discovery methodology

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this