首页> 外文期刊>LIPIcs : Leibniz International Proceedings in Informatics >Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line
【24h】

Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line

机译:弹性简并文本中的字典匹配及其在在线搜索VCF文件中的应用

获取原文
           

摘要

An elastic-degenerate string is a sequence of n sets of strings of total length N. It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm^2+N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N * ceil[m/w]) with pre-processing time and space O(m * ceil[m/w]), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M. A straightforward generalization of the existing bit-vector algorithm would require time O(N * ceil[M/w]) with pre-processing time and space O(M * ceil[M/w]), which is prohibitive in practice. We present a new on-line O(N * ceil[M/w])-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population's variants are considered.
机译:弹性简并字符串是n个总长度为N的字符串集的序列。已经引入它以紧凑形式表示紧密相关序列的多个序列比对。对于长度为m的标准图案,可以使用预处理时间和空间O(m)在时间O(nm ^ 2 + N)上在线求解弹性简并文本中的图案匹配(Grossi等人,CPM 2017)。一种快速的位向量算法也需要时间O(N * ceil [m / w])和预处理时间和空间O(m * ceil [m / w]),其中w是计算机字的大小。呈现。在本文中,我们对总长度为M的一组模式考虑了相同的问题。现有位向量算法的直接概括将需要时间O(N * ceil [M / w]),且预处理时间和空间为O (M * ceil [M / w]),实际上这是禁止的。我们提出了一种具有预处理时间和空间O(M)的新的在线O(N * ceil [M / w])-时间算法。我们使用合成和真实数据展示了实验结果,证明了算法的性能。我们进一步展示了我们的算法在管道中的实际应用,该管道用于发现和验证人类基因组中的最小缺失词(MAW),表明考虑到种群的变体时,许多先前发现的MAW实际上是假阳性的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号