首页> 美国卫生研究院文献>Medical Physics >Effect of finite sample size on feature selection and classification: A simulation study
【2h】

Effect of finite sample size on feature selection and classification: A simulation study

机译:有限样本量对特征选择和分类的影响:仿真研究

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

>Purpose: The small number of samples available for training and testing is often the limiting factor in finding the most effective features and designing an optimal computer-aided diagnosis (CAD) system. Training on a limited set of samples introduces bias and variance in the performance of a CAD system relative to that trained with an infinite sample size. In this work, the authors conducted a simulation study to evaluate the performances of various combinations of classifiers and feature selection techniques and their dependence on the class distribution, dimensionality, and the training sample size. The understanding of these relationships will facilitate development of effective CAD systems under the constraint of limited available samples.>Methods: Three feature selection techniques, the stepwise feature selection (SFS), sequential floating forward search (SFFS), and principal component analysis (PCA), and two commonly used classifiers, Fisher’s linear discriminant analysis (LDA) and support vector machine (SVM), were investigated. Samples were drawn from multidimensional feature spaces of multivariate Gaussian distributions with equal or unequal covariance matrices and unequal means, and with equal covariance matrices and unequal means estimated from a clinical data set. Classifier performance was quantified by the area under the receiver operating characteristic curve Az. The mean Az values obtained by resubstitution and hold-out methods were evaluated for training sample sizes ranging from 15 to 100 per class. The number of simulated features available for selection was chosen to be 50, 100, and 200.>Results: It was found that the relative performance of the different combinations of classifier and feature selection method depends on the feature space distributions, the dimensionality, and the available training sample sizes. The LDA and SVM with radial kernel performed similarly for most of the conditions evaluated in this study, although the SVM classifier showed a slightly higher hold-out performance than LDA for some conditions and vice versa for other conditions. PCA was comparable to or better than SFS and SFFS for LDA at small samples sizes, but inferior for SVM with polynomial kernel. For the class distributions simulated from clinical data, PCA did not show advantages over the other two feature selection methods. Under this condition, the SVM with radial kernel performed better than the LDA when few training samples were available, while LDA performed better when a large number of training samples were available.>Conclusions: None of the investigated feature selection-classifier combinations provided consistently superior performance under the studied conditions for different sample sizes and feature space distributions. In general, the SFFS method was comparable to the SFS method while PCA may have an advantage for Gaussian feature spaces with unequal covariance matrices. The performance of the SVM with radial kernel was better than, or comparable to, that of the SVM with polynomial kernel under most conditions studied.
机译:>目的:可用于培训和测试的样本数量很少,这是寻找最有效的功能并设计最佳的计算机辅助诊断(CAD)系统的限制因素。相对于使用无限样本量训练的样本,对有限样本集合的训练引入了CAD系统性能的偏差和方差。在这项工作中,作者进行了仿真研究,以评估分类器和特征选择技术的各种组合的性能,以及它们对类分布,维数和训练样本量的依赖性。了解这些关系将有助于在可用样本有限的情况下开发有效的CAD系统。>方法:三种特征选择技术,即逐步特征选择(SFS),顺序浮动前向搜索(SFFS),和主成分分析(PCA),以及两个常用的分类器,即Fisher线性判别分析(LDA)和支持向量机(SVM)。从多元高斯分布的多维特征空间中抽取样本,这些样本具有相等或不相等的协方差矩阵和不相等的均值,并且具有相等的协方差矩阵和不相等的均值是根据临床数据集估算的。通过接收器工作特性曲线Az下的面积来量化分类器的性能。通过重新替换和保留方法获得的平均Az值针对每类15至100的训练样本量进行了评估。可供选择的模拟特征数量分别为50、100和200。>结果:发现分类器和特征选择方法不同组合的相对性能取决于特征空间分布,维度和可用的训练样本数量。在本研究中评估的大​​多数条件下,带有径向核的LDA和SVM表现相似,尽管在某些条件下SVM分类器显示的保留性能略高于LDA,而在其他条件下则相反。对于小样本量的LDA,PCA可以与SFS和SFFS相媲美或优于SFS和SFFS,但对于具有多项式核的SVM而言,PCA不如SFS和SFFS。对于根据临床数据模拟的类别分布,PCA没有显示出优于其他两种特征选择方法的优势。在这种情况下,当可用的训练样本较少时,带有径向核的SVM的性能优于LDA,而当可用的训练样本数量较多时,LDA的性能更好。>结论:没有一个研究的特征选择-分类器组合在研究条件下针对不同样本大小和特征空间分布始终提供卓越的性能。通常,SFFS方法与SFS方法相当,而PCA对于协方差矩阵不相等的高斯特征空间可能具有优势。在大多数研究条件下,具有径向核的SVM的性能优于或与具有多项式核的SVM的性能相当。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号