首页> 外文期刊>Journal of chemical information and modeling >Assessment of Machine Learning Reliability Methods for Quantifying the Applicability Domain of QSAR Regression Models
【24h】

Assessment of Machine Learning Reliability Methods for Quantifying the Applicability Domain of QSAR Regression Models

机译:机器学习可靠性方法的评估,以量化QSAR回归模型的适用范围

获取原文
获取原文并翻译 | 示例
           

摘要

The vastness of chemical space and the relatively small coverage by experimental data recording molecular properties require us to identify subspaces, or domains, for which we can confidently apply QSAR models. The prediction of QSAR models in these domains is reliable, and potential subsequent investigations of such compounds would find that the predictions closely match the experimental values. Standard approaches in QSAR assume that predictions are more reliable for compounds that are "similar" to those in subspaces with denser experimental data. Here, we report on a study of an alternative set of techniques recently proposed in the machine learning community. These methods quantify prediction confidence through estimation of the prediction error at the point of interest. Our study includes 20 public QSAR data sets with continuous response and assesses the quality of 10 reliability scoring methods by observing their correlation with prediction error. We show that these new alternative approaches can outperform standard reliability scores that rely only on similarity to compounds in the training set. The results also indicate that the quality of reliability scoring methods is sensitive to data set characteristics and to the regression method used in QSAR. We demonstrate that at the cost of increased computational complexity these dependencies can be leveraged by integration of scores from various reliability estimation approaches. The reliability estimation techniques described in this paper have been implemented in an open source add-on package (https:// bitbucket.org/biolab/orange-reliability) to the Orange data mining suite.
机译:化学空间的广阔和记录分子特性的实验数据所覆盖的范围较小,这要求我们确定可以可靠地应用QSAR模型的子空间或域。在这些领域中对QSAR模型的预测是可靠的,并且对此类化合物的潜在后续研究将发现预测与实验值非常匹配。 QSAR中的标准方法假设,对于与具有更密集实验数据的子空间中的化合物“相似”的化合物,预测更为可靠。在这里,我们报告了对机器学习社区中最近提出的一组替代技术的研究。这些方法通过估计感兴趣点的预测误差来量化预测置信度。我们的研究包括具有连续响应的20个公共QSAR数据集,并通过观察10种可靠性评分方法与预测误差的相关性来评估其质量。我们表明,这些新的替代方法可以胜过仅依赖于训练集中化合物相似性的标准可靠性评分。结果还表明,可靠性评分方法的质量对数据集特征和QSAR中使用的回归方法敏感。我们证明,以增加计算复杂性为代价,可以通过集成来自各种可靠性估计方法的分数来利用这些依赖性。本文描述的可靠性估计技术已在Orange数据挖掘套件的开源附加软件包(https://bitbucket.org/biolab/orange-reliability)中实现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号