首页> 外文学位 >Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development
【24h】

Data Mining/Machine Learning Techniques for Drug Discovery: Computational and Experimental Pipeline Development

机译:用于药物发现的数据挖掘/机器学习技术:计算和实验管道开发

获取原文
获取原文并翻译 | 示例

摘要

Medicine is a precious commodity that saves, prolongs, or increases the quality of life. However, medicinal active ingredient discovery is challenging and is one of the major bottlenecks to developing new pharmaceuticals. Progressive development of new therapeutic targets and compounds exacerbates the problem as the scale of the drug discovery endeavor increases to an unmanageable size. For example, the National Institute of Health houses the National Library of Medicine, which contains an ever-growing archive of genes, proteins, and therapeutic targets as well as candidate compounds. Manual inspection of all compounds and biological targets cannot match the rate in which new information is created and deposited. New methods of data processing and drug candidate consideration are needed.;The work presented used and processed data from the NLM to identify new candidates for consideration. The drug discovery pipeline central to this work created models from existing compound-target interaction data that correlated structure to activity. The models were used to identify next candidates to test. Compound structural information was captured using the Signature molecular descriptor while models were created using principal component analysis, genetic algorithm, and support vector machines. The models identify new candidates for activity validation experiments in a virtual high-throughput screen of the 72 million compounds in PubChem Compound database of the National Library of Medicine. The models were retrained to determine if improvement was possible and what might affect improvement resulting from retraining. After activity validation experiments, the activity and structure of candidates and compounds from the training set were compared to identify structure-activity relationships for additional avenues of inquiry.;Seven different case studies were conducted to test the robustness of the pipeline in response to changing dataset size and active fraction: Cathepsin L, Factor XIIa, Factor XIa, C1s, SENP8, and PK-M2 with two different datasets. The information from all seven case studies found model retraining was beneficial and the pipeline was more effective at low active fractions. Recommendations for future use include retraining models when possible, to extrapolate incrementally, and to apply to small active fractions datasets but avoid large high active fractions datasets to maximize pipeline effectiveness and utility.
机译:医学是一种宝贵的商品,可以挽救,延长或提高生活质量。然而,药物活性成分的发现是具有挑战性的,并且是开发新药物的主要瓶颈之一。随着药物发现努力的规模增加到难以控制的规模,新的治疗靶标和化合物的逐步开发加剧了该问题。例如,国立卫生研究院(National Institute of Health)拥有国家医学图书馆,该图书馆包含不断增长的基因,蛋白质和治疗靶标以及候选化合物的档案。手动检查所有化合物和生物目标物的速度与创建和存储新信息的速度不匹配。需要新的数据处理方法和候选药物考虑方法。所展示的工作使用和处理了来自NLM的数据,以识别需要考虑的新候选药物。这项工作的核心药物发现管道从现有的化合物-靶标相互作用数据创建了将结构与活性相关联的模型。这些模型用于识别下一个要测试的候选人。使用特征分子描述符捕获化合物的结构信息,同时使用主成分分析,遗传算法和支持向量机创建模型。这些模型在国家医学图书馆PubChem化合物数据库中的7200万种化合物的虚拟高通量屏幕中,为活性验证实验确定了新的候选者。对模型进行了重新培训,以确定是否有可能进行改进以及由于重新培训而可能影响改进的因素。经过活动验证实验后,比较了训练集中的候选物和化合物的活性和结构,以确定构效关系,以寻求更多的查询途径。进行了七个不同的案例研究,以测试响应数据集变化的管道的鲁棒性大小和活性成分:组织蛋白酶L,因子XIIa,因子XIa,Cl,SENP8和PK-M2,具有两个不同的数据集。来自所有七个案例研究的信息都发现,模型再训练是有益的,并且在低有效分数时管道更有效。未来使用的建议包括在可能的情况下重新训练模型,以进行增量推断,并应用于较小的有效馏分数据集,但应避免使用较大的较高有效馏分数据集,以最大程度地提高管道有效性和实用性。

著录项

  • 作者

    Chen, Jonathan Jun Feng.;

  • 作者单位

    The University of Akron.;

  • 授予单位 The University of Akron.;
  • 学科 Biochemistry.;Chemical engineering.;Bioinformatics.;Biology.;Computer science.
  • 学位 Ph.D.
  • 年度 2018
  • 页码 265 p.
  • 总页数 265
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号