首页> 美国卫生研究院文献>Molecules >Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level
【2h】

Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level

机译:转移学习:基于较小的化学反应数据集规模将逆合成预测提高到新水平

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Effective computational prediction of complex or novel molecule syntheses can greatly help organic and medicinal chemistry. Retrosynthetic analysis is a method employed by chemists to predict synthetic routes to target compounds. The target compounds are incrementally converted into simpler compounds until the starting compounds are commercially available. However, predictions based on small chemical datasets often result in low accuracy due to an insufficient number of samples. To address this limitation, we introduced transfer learning to retrosynthetic analysis. Transfer learning is a machine learning approach that trains a model on one task and then applies the model to a related but different task; this approach can be used to solve the limitation of few data. The unclassified USPTO-380K large dataset was first applied to models for pretraining so that they gain a basic theoretical knowledge of chemistry, such as the chirality of compounds, reaction types and the SMILES form of chemical structure of compounds. The USPTO-380K and the USPTO-50K (which was also used by Liu et al.) were originally derived from Lowe’s patent mining work. Liu et al. further processed these data and divided the reaction examples into 10 categories, but we did not. Subsequently, the acquired skills were transferred to be used on the classified USPTO-50K small dataset for continuous training and retrosynthetic reaction tests, and the pretrained accuracy data were simultaneously compared with the accuracy of results from models without pretraining. The transfer learning concept was combined with the sequence-to-sequence (seq2seq) or Transformer model for prediction and verification. The seq2seq and Transformer models, both of which are based on an encoder-decoder architecture, were originally constructed for language translation missions. The two algorithms translate SMILES form of structures of reactants to SMILES form of products, also taking into account other relevant chemical information (chirality, reaction types and conditions). The results demonstrated that the accuracy of the retrosynthetic analysis by the seq2seq and Transformer models after pretraining was significantly improved. The top-1 accuracy (which is the accuracy rate of the first prediction matching the actual result) of the Transformer-transfer-learning model increased from 52.4% to 60.7% with greatly improved prediction power. The model’s top-20 prediction accuracy (which is the accuracy rate of the top 20 categories containing actual results) was 88.9%, which represents fairly good prediction in retrosynthetic analysis. In summary, this study proves that transferring learning between models working with different chemical datasets is feasible. The introduction of transfer learning to a model significantly improved prediction accuracy and, especially, assisted in small dataset based reaction prediction and retrosynthetic analysis.
机译:对复杂或新颖分子合成的有效计算预测可以极大地帮助有机化学和药物化学。逆合成分析是化学家用来预测目标化合物合成路线的一种方法。将目标化合物逐步转化为较简单的化合物,直到起始化合物可商购。但是,由于样本数量不足,基于小型化学数据集的预测通常会导致准确性较低。为了解决此限制,我们将转移学习引入了逆合成分析。转移学习是一种机器学习方法,可以在一个任务上训练模型,然后将模型应用于相关但不同的任务。这种方法可用于解决少量数据的局限性。未分类的USPTO-380K大型数据集首先应用于预训练模型,以便它们获得化学的基本理论知识,例如化合物的手性,反应类型和化合物的化学结构的SMILES形式。 USPTO-380K和USPTO-50K(Liu等人也使用过)最初源自Lowe的专利开采工作。刘等。进一步处理这些数据并将反应示例分为10类,但我们没有。随后,将获得的技能转移到已分类的USPTO-50K小数据集上,以进行连续训练和逆向合成反应测试,并将预先训练的准确性数据与未经预先训练的模型结果的准确性进行比较。转移学习概念与序列到序列(seq2seq)或Transformer模型相结合,用于预测和验证。 seq2seq和Transformer模型都是基于编码器-解码器体系结构,最初是为语言翻译任务而构建的。两种算法还考虑了其​​他相关的化学信息(手性,反应类型和条件),将反应物的SMILES形式转化为SMILES形式的产物。结果表明,seq2seq和Transformer模型进行预训练后进行逆合成分析的准确性得到了显着提高。变压器传输学习模型的top-1准确性(即与实际结果匹配的第一个预测的准确率)从52.4%提高到60.7%,预测能力大大提高。该模型的前20个预测准确性(这是包含实际结果的前20个类别的准确率)为88.9%,在逆合成分析中代表了相当不错的预测。总而言之,这项研究证明在使用不同化学数据集的模型之间转移学习是可行的。将转移学习引入模型可以显着提高预测准确性,尤其是在基于小数据集的反应预测和逆合成分析方面提供了帮助。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号