...
首页> 外文期刊>Biology Direct >A machine learning framework to determine geolocations from metagenomic profiling
【24h】

A machine learning framework to determine geolocations from metagenomic profiling

机译:一种机器学习框架,用于确定偏见的剖面性剖析

获取原文
           

摘要

Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset.
机译:环境微生物样本的偏见数据的研究发现,微生物群落似乎是地理位置特异性的,并且微生物组丰富分布可以是鉴定样本的地理域的差异特征。在本文中,我们展示了一种机器学习框架,用于确定来自微生物样品的偏见性分析的地理源。我们的方法应用于来自Metasub的多源微生物组数据(地铁和城市生物群组的距离和城市生物群体的距离和城市生物群组)的国际财团的多源微生物组数据进行了Camda 2019年Metagenomic取证挑战(挑战)。挑战的目的是通过构建微生物组指纹来预测神秘样品的地理起源。首先,我们提取了来自偏见的丰度概况的特征。然后,我们将培训数据随机分割为训练和验证集,并培训了训练集上的预测模型。在验证集中评估预测性能。通过使用L2标准化的逻辑回归,模型的预测精度达到86%,平均超过100个训练和验证数据集的随机分裂。测试数据包括来自培训数据中不发生的城市的样本。为了预测在测试数据之前未采样的“神秘”城市,我们首先根据来自它们的微生物样品的相似性定义针对采样城市的生物坐标。然后我们在地图上执行了仿射变换,使城市之间的距离测量它们的生物差异而不是地理距离。此后,我们根据使用Kriging插值的采样城市的预测概率来派生给定测试样本的概率。结果表明,该方法可以成功为测试样本原产地的真正城市分配高概率。我们的框架在预测培训数据的城市来说,展示了良好的性能。此外,我们证明了预测来自不在训练数据集的位置的样品的偏见方法的潜力。

著录项

获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号