首页> 外文会议>Network and parallel computing >Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System
【24h】

Scale-Adaptable Recrawl Strategies for DHT-Based Distributed Web Crawling System

机译:基于DHT的分布式Web爬网系统的规模适应性重新爬网策略

获取原文
获取原文并翻译 | 示例

摘要

Large scale distributed Web crawling system using voluntarily contributed personal computing resources allows small companies to build their own search engines with very low cost. The biggest challenge for such system is how to implement the functionalities equivalent to that of the traditional search engines under a fluctuating distributed environment. One of the functionalities is incremental crawl which requires recrawl each Web site according to the update frequency of each Web site's content. However, recrawl intervals solely calculated from change frequency of the Web sites may mismatch the system's real-time capacity which leads to inefficient utilization of resources. Based on our previous works on a DHT-based Web crawling system, in this paper, we propose two scale-adaptable recrawl strategies aiming to find solutions to the above issue. The methods proposed are evaluated through simulations based on real Web datasets and show satisfactory results.
机译:使用自愿贡献的个人计算资源的大规模分布式Web爬网系统允许小型公司以非常低的成本构建自己的搜索引擎。对于这种系统而言,最大的挑战是如何在波动的分布式环境下实现与传统搜索引擎相同的功能。功能之一是增量爬网,它需要根据每个网站内容的更新频率重新爬网每个网站。但是,仅根据网站的更改频率计算的重新爬网间隔可能与系统的实时容量不匹配,从而导致资源利用效率低下。基于我们以前在基于DHT的Web爬网系统上所做的工作,在本文中,我们提出了两种适用于规模的重新爬网策略,旨在为上述问题找到解决方案。通过基于真实Web数据集的仿真对提出的方法进行了评估,并显示出令人满意的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号