首页> 外文期刊>Journal of chemical information and modeling >Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval
【24h】

Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval

机译:使用整数熵代码对化学指纹进行无损压缩可改善存储和检索

获取原文
获取原文并翻译 | 示例
           

摘要

Many modern chemoinformatics systems for small molecules rely on large fingerprint vector representations,where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features,such as labeled paths or labeled trees.These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure.Here,we combine statistical models of fingerprints with integer entropy codes,such as Golomb and Elias codes,to encode the indices or the run lengths of the fingerprints.After reordering the fingerprint components by decreasing frequency order,the indices are monotone-increasing and the ran lengths are quasi-monotone-increasing,and both exhibit power-law distribution trends.We take advantage of these statistical properties to derive new efficient,lossless,compression algorithms for monotone integer sequences:monotone value (MOV) coding and monotone length (MOL) coding.In contrast to lossy systems that use 1024 or more bits of storage per molecule,we can achieve lossless compression of long chemical fingerprints based on circular substructures in slightly over 300 bits per molecule,close to the Shannon entropy limit,using a MOL Elias Gamma code for run lengths.The improvement in storage comes at a modest computational cost.Furthermore,because the compression is lossless,uncompressed similarity (e.g.,Tanimoto) between molecules can be computed exactly from their compressed representations,leading to significant improvements in retrival performance,as shown on six benchmark data sets of druglike molecules.
机译:许多现代的小分子化学信息系统都依赖大的指纹矢量表示形式,其中矢量的组成部分记录了具有特定组合特征的分子图中分子的存在或出现次数,例如标记的路径或标记的树。这些大的指纹矢量通常是使用基于简单模过程的有损压缩方案将其压缩为更短的指纹矢量。这里,我们将指纹的统计模型与整数熵代码(例如Golomb和Elias代码)结合起来,以对指纹的索引或游程长度进行编码。通过降低频率顺序对指纹分量进行重新排序后,指标增加了单调,运行长度增加了准单调,并且都表现出幂律分布趋势。我们利用这些统计属性来得出新的高效,无损,单调整数序列的压缩算法:单调值(MOV)编码和单调le ngth(MOL)编码。与每个分子使用1024位或更多位存储空间的有损系统相比,我们可以基于圆形子结构以每分子300位以上的略微无损压缩长化学指纹,接近香农熵极限,使用MOL Elias Gamma代码获得游程长度。存储的改进得益于适度的计算成本。此外,由于压缩无损,因此可以从其压缩表示形式中精确计算出分子之间的未压缩相似性(例如,Tanimoto),从而获得了显着的六个类似药物分子的基准数据集显示检索性能得到了改善。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号