In 2008, there were over 100,000 newly reported cases of colon cancer, and 40,000 cases of rectal cancer in the United States. In order to minimize the number of deaths from these diseases, researchers have been striving to find a set of genes that can accurately characterize the correct prognosis for colorectal cancer. Working with a gene expression microarray dataset of about 55,000 genes, collected from 122 colorectal cancer patients, this research developed technology to identify an optimal set of features through several methods of feature selection. These methods included coarse feature reduction, fine feature selection, and classification using a Genetic Algorithm/Support Vector Machine (GA/SVM) hybrid. However, microarray data with dimensions such as these are feature rich and case poor, which can lead to dangers of overfitting to the data. In order to combat this issue, a noise perturbation scheme was introduced with the assumption that genes that are able to survive in this noise will have a strong relation to colorectal cancer. The feature reduction methods produced chromosomes containing genes with known relation to cancer. However, the perturbation analysis, which was designed to confirm these genes, was deemed inconclusive. This research was successful in developing a feature reduction method that was able to suggest a set of genes with potential ties to colorectal cancer, provoking further investigation into this relationship.
展开▼