首页> 美国卫生研究院文献>other >The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics
【2h】

The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics

机译:二元分类器平衡训练和测试数据集在生物信息学中的作用

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Training and testing of conventional machine learning models on binary classification problems depend on the proportions of the two outcomes in the relevant data sets. This may be especially important in practical terms when real-world applications of the classifier are either highly imbalanced or occur in unknown proportions. Intuitively, it may seem sensible to train machine learning models on data similar to the target data in terms of proportions of the two binary outcomes. However, we show that this is not the case using the example of prediction of deleterious and neutral phenotypes of human missense mutations in human genome data, for which the proportion of the binary outcome is unknown. Our results indicate that using balanced training data (50% neutral and 50% deleterious) results in the highest balanced accuracy (the average of True Positive Rate and True Negative Rate), Matthews correlation coefficient, and area under ROC curves, no matter what the proportions of the two phenotypes are in the testing data. Besides balancing the data by undersampling the majority class, other techniques in machine learning include oversampling the minority class, interpolating minority-class data points and various penalties for misclassifying the minority class. However, these techniques are not commonly used in either the missense phenotype prediction problem or in the prediction of disordered residues in proteins, where the imbalance problem is substantial. The appropriate approach depends on the amount of available data and the specific problem at hand.
机译:关于二进制分类问题的常规机器学习模型的训练和测试取决于相关数据集中这两个结果的比例。当分类器的实际应用高度不平衡或比例未知时,这在实践上可能尤其重要。直观地讲,按照两个二进制结果的比例,在类似于目标数据的数据上训练机器学习模型似乎是明智的。但是,我们通过使用预测人类基因组数据中人类错义突变的有害和中性表型的例子表明情况并非如此,其二进制结果的比例未知。我们的结果表明,使用平衡训练数据(50%中性和50%有害)可实现最高的平衡精度(真正率和真负率的平均值),马修斯相关系数以及ROC曲线下的面积,无论两种表型的比例在测试数据中。除了通过对多数类进行低采样来平衡数据之外,机器学习中的其他技术还包括对少数类进行过采样,对少数类数据点进行插值以及对少数类进行错误分类的各种惩罚措施。但是,这些技术通常不用于错义表型预测问题或不平衡问题严重的蛋白质中无序残基的预测中。适当的方法取决于可用数据量和当前的特定问题。

著录项

  • 期刊名称 other
  • 作者单位
  • 年(卷),期 -1(8),7
  • 年度 -1
  • 页码 e67863
  • 总页数 12
  • 原文格式 PDF
  • 正文语种
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号