【24h】

Crawling the Infinite Web: Five Levels Are Enough

机译:爬行无限的网络:五个层次已足够

获取原文
获取原文并翻译 | 示例

摘要

A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing "infinite" Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 "clicks" away from the start page, to reach 90% of the pages that users actually visit.
机译:大量公开可用的网页是根据请求动态生成的,并且包含指向其他动态生成的页面的链接。通常,这会产生可创建许多页面的网站。在本文中,提出并研究了几种浏览“无限”网站的概率模型。我们使用这些模型来估计爬网程序必须下载多长时间才能下载实际访问的网站内容的很大一部分。所提议的模型已针对多个网站上的页面视图上的真实数据进行了验证,表明从理论上和实践上,爬虫仅需下载几个级别,即可从起始页面下载不超过3到5次“点击”,达到用户实际访问的页面的90%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号