首页> 外文期刊>IEICE Transactions on Information and Systems >Scalable Detection of Frequent Substrings by Grammar-Based Compression
【24h】

Scalable Detection of Frequent Substrings by Grammar-Based Compression

机译:通过基于语法的压缩可伸缩地检测频繁的子字符串

获取原文
获取原文并翻译 | 示例
           

摘要

A scalable pattern discovery by compression is proposed. A string is representable by a context-free grammar deriving the string de-terministically. In this framework of grammar-based compression, the aim of the algorithm is to output as small a grammar as possible. Beyond that, the optimization problem is approximately solvable. In such approximation algorithms, the compressor based on edit-sensitive parsing (ESP) is especially suitable for detecting maximal common substrings as well as long frequent substrings. Based on ESP, we design a linear time algorithm to find all frequent patterns in a string approximately and prove several lower bounds to guarantee the length of extracted patterns. We also examine the performance of our algorithm by experiments in biological sequences and other compressible real world texts. Compared to other practical algorithms, our algorithm is faster and more scalable with large and repetitive strings.
机译:提出了通过压缩的可伸缩模式发现。字符串可以由确定性派生该字符串的上下文无关语法表示。在这种基于语法的压缩框架中,算法的目的是输出尽可能小的语法。除此之外,优化问题是可以解决的。在这样的近似算法中,基于编辑敏感分析(ESP)的压缩器特别适合于检测最大的公共子字符串以及长的频繁子字符串。基于ESP,我们设计了一种线性时间算法,以近似地查找字符串中的所有频繁模式,并证明几个下限以保证提取的模式的长度。我们还通过在生物序列和其他可压缩的真实世界文本中进行实验来检验算法的性能。与其他实用算法相比,我们的算法在使用大型重复字符串的情况下速度更快且可扩展性更高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号