Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval

Pierre Baldi; Ryan W.Benz; Daniel S.Hirschberg

首页> 外文期刊>Journal of chemical information and modeling >Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval

【24h】

Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval

机译：使用整数熵代码对化学指纹进行无损压缩可改善存储和检索

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Many modern chemoinformatics systems for small molecules rely on large fingerprint vector representations,where the components of the vector record the presence or number of occurrences in the molecular graphs of particular combinatorial features,such as labeled paths or labeled trees.These large fingerprint vectors are often compressed to much shorter fingerprint vectors using a lossy compression scheme based on a simple modulo procedure.Here,we combine statistical models of fingerprints with integer entropy codes,such as Golomb and Elias codes,to encode the indices or the run lengths of the fingerprints.After reordering the fingerprint components by decreasing frequency order,the indices are monotone-increasing and the ran lengths are quasi-monotone-increasing,and both exhibit power-law distribution trends.We take advantage of these statistical properties to derive new efficient,lossless,compression algorithms for monotone integer sequences:monotone value (MOV) coding and monotone length (MOL) coding.In contrast to lossy systems that use 1024 or more bits of storage per molecule,we can achieve lossless compression of long chemical fingerprints based on circular substructures in slightly over 300 bits per molecule,close to the Shannon entropy limit,using a MOL Elias Gamma code for run lengths.The improvement in storage comes at a modest computational cost.Furthermore,because the compression is lossless,uncompressed similarity (e.g.,Tanimoto) between molecules can be computed exactly from their compressed representations,leading to significant improvements in retrival performance,as shown on six benchmark data sets of druglike molecules.

机译：许多现代的小分子化学信息系统都依赖大的指纹矢量表示形式，其中矢量的组成部分记录了具有特定组合特征的分子图中分子的存在或出现次数，例如标记的路径或标记的树。这些大的指纹矢量通常是使用基于简单模过程的有损压缩方案将其压缩为更短的指纹矢量。这里，我们将指纹的统计模型与整数熵代码（例如Golomb和Elias代码）结合起来，以对指纹的索引或游程长度进行编码。通过降低频率顺序对指纹分量进行重新排序后，指标增加了单调，运行长度增加了准单调，并且都表现出幂律分布趋势。我们利用这些统计属性来得出新的高效，无损，单调整数序列的压缩算法：单调值（MOV）编码和单调le ngth（MOL）编码。与每个分子使用1024位或更多位存储空间的有损系统相比，我们可以基于圆形子结构以每分子300位以上的略微无损压缩长化学指纹，接近香农熵极限，使用MOL Elias Gamma代码获得游程长度。存储的改进得益于适度的计算成本。此外，由于压缩无损，因此可以从其压缩表示形式中精确计算出分子之间的未压缩相似性（例如，Tanimoto），从而获得了显着的六个类似药物分子的基准数据集显示检索性能得到了改善。

著录项

来源
《Journal of chemical information and modeling》 |2007年第6期|共12页
作者
Pierre Baldi; Ryan W.Benz; Daniel S.Hirschberg;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类化学;
关键词

相似文献

外文文献
中文文献
专利

1. Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval [J] . Pierre Baldi, Ryan W.Benz, Daniel S.Hirschberg Journal of chemical information and modeling . 2007,第6期

机译：使用整数熵代码对化学指纹进行无损压缩可改善存储和检索
2. Lossless Compression Performance of a Simple Counter-Based Entropy Coder [J] . Armein Z R Langi Journal of ICT Research and Applications . 2011,第3期

机译：简单的基于计数器的熵编码器的无损压缩性能
3. Lossless Compression Performance of a Simple Counter-Based Entropy Coder [J] . Armein Z.R. Langi Journal of ICT Research and Applications . 2011,第3期

机译：简单的基于计数器的熵编码器的无损压缩性能
4. The Rice Coding algorithm achieves high-performance lossless and progressive image compression basing on the improving of integer lifting scheme Rice Coding algorithm [C] . Xie Cheng Jun, Yan Su, Zhang Wei Applications of Digital Image Processing XXIX . 2006

机译：Rice编码算法在整数提升方案的基础上实现了高性能的无损渐进图像压缩。Rice编码算法
5. Lossless Image Compression Using Reversible Integer Wavelet Transforms and Convolutional Neural Networks [D] . Ahanonu, Eze. 2018

机译：使用可逆整数小波变换和卷积神经网络的无损图像压缩
6. Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval [O] . Pierre Baldi, Ryan W. Benz, Daniel S. Hirschberg, -1

机译：使用整数熵代码对化学指纹进行无损压缩可改善存储和检索
7. Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval [O] . Pierre Baldi, Ryan W. Benz, Daniel S. Hirschberg, 2007

机译：使用整数熵代码对化学指纹的无损压缩改进了存储和检索
8. Exploration of the Operational Ramifications of Lossless Compression of 1000 ppi Fingerprint Imagery. [R] . Orandi, S., Libert, J. M., Ko, K., 2012

机译：1000 ppi指纹图像无损压缩操作分歧的探讨。

Lossless Compression of Chemical Fingerprints Using Integer Entropy Codes Improves Storage and Retrieval

摘要

著录项

相似文献

相关主题

期刊订阅