首页> 外文期刊>Journal of chemical information and modeling >Comparing the Influence of Simulated Experimental Errors on 12 Machine Learning Algorithms in Bioactivity Modeling Using 12 Diverse Data Sets
【24h】

Comparing the Influence of Simulated Experimental Errors on 12 Machine Learning Algorithms in Bioactivity Modeling Using 12 Diverse Data Sets

机译:使用12个不同数据集比较模拟实验错误对12种机器学习算法在生物活性建模中的影响

获取原文
获取原文并翻译 | 示例
           

摘要

To date, no systematic study has assessed the effect of random experimental errors on the predictive power of QSAR models. To address this shortage, we have benchmarked the noise sensitivity of 12 learning algorithms on 12 data sets (15,840 models in total), namely the following: Support Vector Machines (SVM) with radial and polynomial (Poly) kernels, Gaussian Process (GP) with radial and polynomial kernels, Relevant Vector Machines (radial kernel), Random Forest (RF), Gradient Boosting Machines (GBM), Bagged Regression Trees, Partial Least Squares, and k-Nearest Neighbors. Model performance on the test set was used as a proxy to monitor the relative noise sensitivity of these algorithms as a function of the level of simulated noise added to the bioactivities from the training set. The noise was simulated by sampling from Gaussian distributions with increasingly larger variances, which ranged from zero to the range of pIC(50) values comprised in a given data set. General trends were identified by designing a full-factorial experiment, which was analyzed with a normal linear model. Overall, GBM displayed low noise tolerance, although its performance was comparable to RF, SVM Radial, SVM Poly, GP Poly, and GP Radial at low noise levels. Of practical relevance, we show that the bag fraction parameter has a marked influence on the noise sensitivity of GBM, suggesting that low values (e.g., 0.1-0.2) for this parameter should be set when modeling noisy data. The remaining 11 algorithms display a comparable noise tolerance, as a smooth and linear degradation of model performance is observed with the level of noise. However, SVM Poly and GP Poly display significant noise sensitivity at high noise levels in some cases. Overall, these results provide a practical guide to make informed decisions about which algorithm and parameter values to use according to the noise level present in the data.
机译:迄今为止,还没有系统的研究评估随机实验误差对QSAR模型预测能力的影响。为了解决这一短缺问题,我们在12个数据集(总共15,840个模型)上对12种学习算法的噪声敏感性进行了基准测试,即:具有径向和多项式(多边形)内核的支持向量机(SVM),高斯过程(GP)带有径向和多项式内核,相关矢量机(径向内核),随机森林(RF),梯度提升机(GBM),袋装回归树,偏最小二乘和k最近邻。测试集上的模型性能被用作代理来监视这些算法的相对噪声敏感性,这些相对噪声敏感性是从训练集中添加到生物活动中的模拟噪声水平的函数。通过从方差越来越大的高斯分布中采样来模拟噪声,该方差的范围从零到给定数据集中包含的pIC(50)值的范围。通过设计全要素实验确定总体趋势,然后使用常规线性模型进行分析。总体而言,GBM在低噪声水平下的性能可与RF,SVM Radial,SVM Poly,GP Poly和GP Radial媲美,但其噪声容忍度较低。具有实际意义的结果表明,袋分数参数对GBM的噪声敏感性有显着影响,建议在对噪声数据进行建模时,应为此参数设置较低的值(例如0.1-0.2)。其余11种算法显示出可比的噪声容忍度,因为随着噪声水平观察到模型性能的平滑线性下降。但是,在某些情况下,SVM Poly和GP Poly在高噪声水平下显示出显着的噪声敏感性。总体而言,这些结果为根据数据中存在的噪声水平使用哪种算法和参数值做出明智的决策提供了实用指南。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号