首页> 外文会议>IEEE International Conference on Big Data >Modeling Updates of Scholarly Webpages Using Archived Data
【24h】

Modeling Updates of Scholarly Webpages Using Archived Data

机译:使用归档数据建模学术网页的更新

获取原文

摘要

The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the scholarly web using 19,977 seed URLs of authors’ homepages obtained from their Google Scholar profiles. We first obtain archived copies of these webpages from the Internet Archive (IA), and estimate when their actual updates occurred. Next, we apply maximum likelihood to estimate their mean update frequency (λ) values. Our evaluation shows that λ values derived from a short history of archived data provide a good estimate for the true update frequency in the short-term, and that our method provides better estimations of updates at a fraction of resources compared to the baseline models. Based on this, we demonstrate the utility of archived data to optimize the crawling strategy of web crawlers, and uncover important challenges that inspire future research directions.
机译:网络浩瀚规定了建设大型搜索引擎与利用有限的资源成本过高。抓取前沿因而需要进行优化,以提高爬网内容的覆盖面和新鲜度。在本文中,我们提出了在使用网页的归档副本的Web应用建模变化的动态的方法。为了评估它的效用,我们利用他们的谷歌学术分析获得作者的主页的19977个种子URL学术网络上进行了初步研究。我们首先从互联网档案馆(IA)这些网页的归档副本,并估算其实际更新时有发生。接下来,我们应用最大似然估计它们的平均更新频率(λ)值。我们的评估显示,从归档数据的历史很短衍生λ值提供了一个很好的估计在短期内真正的更新频率,我们的方法在比较基准模型资源的一小部分提供更新更好的估计。在此基础上,我们展示了归档数据的工具来优化网络爬虫,而且激发未来的研究方向揪出重大挑战的爬行策略。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号