【24h】

Experiments on Adaptive Set Intersections for Text Retrieval Systems

机译:文本检索系统自适应集相交的实验

获取原文
获取原文并翻译 | 示例

摘要

In [3] we introduced an adaptive algorithm for computing the intersection of k sorted sets within a factor of at most 8k comparisons of the information-theoretic lower bound under a model that deals with an encoding of the shortest proof of the answer. This adaptive algorithm performs better for "burstier" inputs than a straightforward worst-case optimal method. Indeed, we have shown that, subject to a reasonable measure of instance difficulty, the algorithm adapts optimally up to a constant factor. This paper explores how this algorithm behaves under actual data distributions, compared with standard algorithms. We present experiments for searching 114 megabytes of text from the World Wide Web using 5,000 actual user queries from a commercial search engine. From the experiments, it is observed that the theoretically optimal adaptive algorithm is not always the optimal in practice, given the distribution of WWW text data. We then proceed to study several improvement techniques for the standard algorithms. These techniques combine improvements suggested by the observed distribution of the data as well as the theoretical results from. We perform controlled experiments on these techniques to determine which ones result in improved performance, resulting in an algorithm that outperforms existing algorithms in most cases.
机译:在[3]中,我们引入了一种自适应算法,用于在处理最短答案的编码的模型下,计算在信息理论下限的最多8k比较中,k个排序集的交集。与“最坏的情况”最优方法相比,该自适应算法在“突发”输入中的性能更好。确实,我们已经表明,在对实例难度进行合理衡量的前提下,该算法可以最优化地适应恒定因子。与标准算法相比,本文探讨了该算法在实际数据分布下的行为。我们提出了使用来自商业搜索引擎的5,000个实际用户查询从Internet搜索114兆字节文本的实验。从实验中观察到,给定WWW文本数据的分布,理论上最佳的自适应算法在实践中并不总是最佳的。然后,我们继续研究标准算法的几种改进技术。这些技术结合了观察到的数据分布以及从中得出的理论结果所提出的改进。我们对这些技术进行了受控实验,以确定哪些技术可提高性能,从而使该算法在大多数情况下均优于现有算法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号