首页> 美国卫生研究院文献>Synthetic and Systems Biotechnology >The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer
【2h】

The statistical power of k-mer based aggregative statistics for alignment-free detection of horizontal gene transfer

机译:基于k-mer的聚合统计量对水平基因转移的无比对检测的统计能力

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Alignment-based database search and sequence comparison are commonly used to detect horizontal gene transfer (HGT). However, with the rapid increase of sequencing depth, hundreds of thousands of contigs are routinely assembled from metagenomics studies, which challenges alignment-based HGT analysis by overwhelming the known reference sequences. Detecting HGT by k-mer statistics thus becomes an attractive alternative. These alignment-free statistics have been demonstrated in high performance and efficiency in whole-genome and transcriptome comparisons. To adapt k-mer statistics for HGT detection, we developed two aggregative statistics TsumS and Tsum*, which subsample metagenome contigs by their representative regions, and summarize the regional D2S and D2* metrics by their upper bounds. We systematically studied the aggregative statistics’ power at different k-mer size using simulations. Our analysis showed that, in general, the power of TsumS and Tsum* increases with sequencing coverage, and reaches a maximum power >80% at k = 6, with 5% Type-I error and the coverage ratio >0.2x. The statistical power of TsumS and Tsum* was evaluated with realistic simulations of HGT mechanism, sequencing depth, read length, and base error. We expect these statistics to be useful distance metrics for identifying HGT in metagenomic studies.
机译:基于比对的数据库搜索和序列比较通常用于检测水平基因转移(HGT)。但是,随着测序深度的迅速增加,从宏基因组学研究中常规地组装了成千上万个重叠群,这通过压倒已知的参考序列对基于比对的HGT分析提出了挑战。因此,通过k聚体统计检测HGT成为一种有吸引力的选择。这些免比对的统计数据已在全基因组和转录组比较中得到了高性能和高效率的证明。为了使k-mer统计信息适合HGT检测,我们开发了两个汇总统计信息 T s u m < / mrow> S T s < mi> u m * ,这是对子基因组的子采样重叠区域的代表区域,并总结区域 < msubsup> D 2 S < msubsup> D 2 * 指标的上限。我们通过模拟系统地研究了不同k-mer大小下的汇总统计量。我们的分析表明,通常 T s u m < mrow> S T s u < / mi> m * 随着序列覆盖率的增加而增加在k> =%6时最大功率> 80%,I型误差为5%,覆盖率> 0.2x。 < mtext> T s u m S < / mtext> T s u m * 是通过HGT机制,序列深度,读取长度,和基本错误。我们希望这些统计数据对于在宏基因组学研究中鉴定HGT有用的距离指标。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号