首页> 美国卫生研究院文献>Computational and Structural Biotechnology Journal >Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning
【2h】

Identifying the minimum amplicon sequence depth to adequately predict classes in eDNA-based marine biomonitoring using supervised machine learning

机译:使用受监管机器学习确定基于EDNA的海洋生物生物监逻中的最小扩增子序列深度的最小扩增子序列深度

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Environmental DNA metabarcoding is a powerful approach for use in biomonitoring and impact assessments. Amplicon-based eDNA sequence data are characteristically highly divergent in sequencing depth (total reads per sample) as influenced inter alia by the number of samples simultaneously analyzed per sequencing run. The random forest (RF) machine learning algorithm has been successfully employed to accurately classify unknown samples into monitoring categories. To employ RF to eDNA data, and avoid sequencing-depth artifacts, sequence data across samples are normalized using rarefaction, a process that inherently loses information. The aim of this study was to inform future sampling designs in terms of the relationship between sampling depth and RF accuracy. We analyzed three published and one new bacterial amplicon datasets, using a RF, based initially on the maximal rarefied data available (minimum mean of > 30,000 reads across all datasets) to give our baseline performance. We then evaluated the RF classification success based on increasingly rarefied datasets. We found that extreme to moderate rarefaction (50–5000 sequences per sample) was sufficient to achieve prediction performance commensurate to the full data, depending on the classification task. We did not find that the number of classification classes, data balance across classes, or the total number of sequences or samples, were associated with predictive accuracy. We identified the ability of the training data to adequately characterize the classes being mapped as the most important criterion and discuss how this finding can inform future sampling design for eDNA based biomonitoring to reduce costs and computation time.
机译:环境DNA Metabarcoding是一种强大的方法,用于生物监测和影响评估。基于扩增子的EDNA序列数据在测序深度(每个样品的总读数)中是特性高度发散的,因为通过每测序运行同时分析的样品的数量尤其影响。随机森林(RF)机器学习算法已成功用于将未知样本准确地分类为监测类别。为了使用RF至EDNA数据,并避免测序深度伪像,使用稀疏的样本跨越样本的序列数据,该过程固有地失去信息。本研究的目的是根据采样深度和RF精度之间的关系来告知未来的采样设计。我们分析了使用RF的三个发布和一个新的细菌扩增子数据集,最初在可用的最大稀土数据上(所有数据集的最小平均值> 30,000读数)以提供我们的基线性能。然后,我们基于越来越稀薄的数据集评估RF分类成功。我们发现极致的适度稀疏(每个样本50-5000个序列)足以实现对完整数据的预测性能,具体取决于分类任务。我们没有发现分类类的数量,跨类的数据余额,或序列或样本的总数,与预测准确性相关联。我们确定培训数据充分表征所映射为最重要的标准的课程的能力,并讨论该发现如何为未来的基于EDNA的生物监测提供的采样设计,以降低成本和计算时间。

著录项

相似文献

  • 外文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号