TY - JOUR
T1 - Gene selection for sample classifications in microarray experiments
AU - Tsai, Chen A.N.
AU - Chen, Chun Houh
AU - Lee, Te Chang
AU - Ho, I. Ching
AU - Yang, Ueng Cheng
AU - Chen, James J.
PY - 2004/10
Y1 - 2004/10
N2 - DNA microarray technology provides useful tools for profiling global gene expression patterns in different cell/tissue samples. One major challenge is the large number of genes relative to the number of samples. The use of all genes can suppress or reduce the performance of a classification rule due to the noise of nondiscriminatory genes. Selection of an optimal subset from the original gene set becomes an important prestep in sample classification. In this study, we propose a family-wise error (FWE) rate approach to selection of discriminatory genes for two-sample or multiple-sample classification. The FWE approach controls the probability of the number of one or more false positives at a prespecified level. A public colon cancer data set is used to evaluate the performance of the proposed approach for the two classification methods: k nearest neighbors (k-NN) and support vector machine (SVM). The selected gene sets from the proposed procedure appears to perform better than or comparable to several results reported in the literature using the univariate analysis without performing multivariate search. In addition, we apply the FWE approach to a toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV) for a total of 55 samples for a multisample classification. Two gene sets are considered: the gene set ΩF formed by the ANOVA F-test, and a gene set ΩT formed by the union of one-versus-all t-tests. The predicted accuracies are evaluated using the internal and external crossvalidation. Using the SVM classification, the overall accuracies to predict 55 samples into one of the nine treatments are above 80% for internal crossvalidation. ΩF has slightly higher accuracy rates than ΩT. The overall predicted accuracies are above 70% for the external crossvalidation; the two gene sets ΩT and ΩF performed equally well.
AB - DNA microarray technology provides useful tools for profiling global gene expression patterns in different cell/tissue samples. One major challenge is the large number of genes relative to the number of samples. The use of all genes can suppress or reduce the performance of a classification rule due to the noise of nondiscriminatory genes. Selection of an optimal subset from the original gene set becomes an important prestep in sample classification. In this study, we propose a family-wise error (FWE) rate approach to selection of discriminatory genes for two-sample or multiple-sample classification. The FWE approach controls the probability of the number of one or more false positives at a prespecified level. A public colon cancer data set is used to evaluate the performance of the proposed approach for the two classification methods: k nearest neighbors (k-NN) and support vector machine (SVM). The selected gene sets from the proposed procedure appears to perform better than or comparable to several results reported in the literature using the univariate analysis without performing multivariate search. In addition, we apply the FWE approach to a toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV) for a total of 55 samples for a multisample classification. Two gene sets are considered: the gene set ΩF formed by the ANOVA F-test, and a gene set ΩT formed by the union of one-versus-all t-tests. The predicted accuracies are evaluated using the internal and external crossvalidation. Using the SVM classification, the overall accuracies to predict 55 samples into one of the nine treatments are above 80% for internal crossvalidation. ΩF has slightly higher accuracy rates than ΩT. The overall predicted accuracies are above 70% for the external crossvalidation; the two gene sets ΩT and ΩF performed equally well.
UR - http://www.scopus.com/inward/record.url?scp=8144223371&partnerID=8YFLogxK
U2 - 10.1089/dna.2004.23.607
DO - 10.1089/dna.2004.23.607
M3 - Article
C2 - 15585118
AN - SCOPUS:8144223371
SN - 1044-5498
VL - 23
SP - 607
EP - 614
JO - DNA and Cell Biology
JF - DNA and Cell Biology
IS - 10
ER -