【24h】

Processing Data-Stream Join Aggregates Using Skimmed Sketches

机译:使用撇取草图处理数据流联接聚合

获取原文
获取原文并翻译 | 示例

摘要

There is a growing interest in on-line algorithms for analyzing and querying data streams, that examine each stream element only once and have at their disposal, only a limited amount of memory. Providing (perhaps approximate) answers to aggregate queries over such streams is a crucial requirement for many application environments; examples include large IP network installations where performance data from different parts of the network needs to be continuously collected and analyzed. In this paper, we present the skimmed-sketch algorithm for estimating the join size of two streams. (Our techniques also readily extend to other join-aggregate queries.) To the best of our knowledge, our skimmed-sketch technique is the first comprehensive join-size estimation algorithm to provide tight error guarantees while: (1) achieving the lower bound on the space required by any join-size estimation method in a streaming environment, (2) handling streams containing general update operations (inserts and deletes), (3) incurring a low logarithmic processing time per stream element, and (4) not assuming any a-priori knowledge of the frequency distribution for domain values. Our skimmed-sketch technique achieves all of the above by first skimming the dense frequencies from random hash-sketch summaries of the two streams. It then computes the subjoin size involving only dense frequencies directly, and uses the skimmed sketches only to approximate subjoin sizes for the non-dense frequencies. Results from our experimental study with real-life as well as synthetic data streams indicate that our skimmed-sketch algorithm provides significantly more accurate estimates for join sizes compared to earlier sketch-based techniques.
机译:对用于分析和查询数据流的在线算法的兴趣日益浓厚,该算法仅检查每个流元素一次,只有有限数量的内存可供使用。在许多应用程序环境中,提供(可能是近似的)答案以通过此类流进行聚合查询是至关重要的要求。示例包括大型IP网络安装,其中需要不断收集和分析来自网络不同部分的性能数据。在本文中,我们提出了略读草图算法来估计两个流的连接大小。 (我们的技术也可以很容易地扩展到其他联接聚合查询。)据我们所知,我们的略读草图技术是第一个提供严格的错误保证的综合联接大小估计算法,同时:(1)在流环境中任何联接大小估计方法所需的空间,(2)处理包含常规更新操作(插入和删除)的流,(3)每个流元素的对数处理时间短,并且(4)不假定任何域值频率分布的先验知识。我们的略读技术通过首先从两个流的随机哈希略写摘要中略过密集的频率,从而实现了上述所有目的。然后,它直接计算仅涉及密集频率的子连接点大小,并仅使用撇取的草图来近似计算非密集频率的子连接点大小。我们对现实生活以及合成数据流进行实验研究的结果表明,与早期的基于草图的技术相比,我们的“略过草图”算法可为连接尺寸提供更准确的估算。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号