首页> 外文会议>International Symposium on Computational and Business Intelligence >CrawlPart: Creating Crawl Partitions in Parallel Crawlers
【24h】

CrawlPart: Creating Crawl Partitions in Parallel Crawlers

机译:CrawlPart:在并行爬网程序中创建爬网分区

获取原文

摘要

With the ever proliferating size and scale of the WWW [1], efficient ways of exploring content are of increasing importance. How can we efficiently retrieve information from it through crawling? And in this "era of tera" and multi-core processors, we ought to think of multi-threaded processes as a serving solution. So, even better how can we improve the crawling performance by using parallel crawlers that work independently? The paper devotes to the fundamental advantages and challenges arising from the design of parallel crawlers [4]. The paper mainly focuses on the aspect of URL distribution among the various parallel crawling processes. How to distribute URLs from the URL frontier to the various concurrently executing crawling process threads is an orthogonal problem. The paper provides a solution to the problem by designing a framework that partitions the URL frontier into a several URL queues by ordering the URLs within each of the distributed set of URLs.
机译:随着WWW [1]的规模和规模的不断扩大,有效的内容浏览方式变得越来越重要。我们如何通过爬网有效地从中检索信息?在这个“ tera时代”和多核处理器中,我们应该将多线程进程视为服务解决方案。因此,更好的是,如何通过使用独立工作的并行搜寻器来提高搜寻性能?本文致力于平行履带设计的基本优点和挑战[4]。本文主要侧重于各种并行爬网过程之间的URL分发方面。如何将URL的URL边界中的URL分发到同时执行的各种爬网过程线程是一个正交的问题。本文通过设计一个框架来解决该问题,该框架通过在每个分布式URL集中对URL进行排序来将URL边界划分为几个URL队列。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号