Modeling Updates of Scholarly Webpages Using Archived Data

机译：使用归档数据建模学术网页的更新

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The vastness of the web imposes a prohibitive cost on building large-scale search engines with limited resources. Crawl frontiers thus need to be optimized to improve the coverage and freshness of crawled content. In this paper, we propose an approach for modeling the dynamics of change in the web using archived copies of webpages. To evaluate its utility, we conduct a preliminary study on the scholarly web using 19,977 seed URLs of authors’ homepages obtained from their Google Scholar profiles. We first obtain archived copies of these webpages from the Internet Archive (IA), and estimate when their actual updates occurred. Next, we apply maximum likelihood to estimate their mean update frequency (λ) values. Our evaluation shows that λ values derived from a short history of archived data provide a good estimate for the true update frequency in the short-term, and that our method provides better estimations of updates at a fraction of resources compared to the baseline models. Based on this, we demonstrate the utility of archived data to optimize the crawling strategy of web crawlers, and uncover important challenges that inspire future research directions.

机译：网络浩瀚规定了建设大型搜索引擎与利用有限的资源成本过高。抓取前沿因而需要进行优化，以提高爬网内容的覆盖面和新鲜度。在本文中，我们提出了在使用网页的归档副本的Web应用建模变化的动态的方法。为了评估它的效用，我们利用他们的谷歌学术分析获得作者的主页的19977个种子URL学术网络上进行了初步研究。我们首先从互联网档案馆（IA）这些网页的归档副本，并估算其实际更新时有发生。接下来，我们应用最大似然估计它们的平均更新频率（λ）值。我们的评估显示，从归档数据的历史很短衍生λ值提供了一个很好的估计在短期内真正的更新频率，我们的方法在比较基准模型资源的一小部分提供更新更好的估计。在此基础上，我们展示了归档数据的工具来优化网络爬虫，而且激发未来的研究方向揪出重大挑战的爬行策略。

著录项

来源
《IEEE International Conference on Big Data》|2020年|1868-1877|共10页
会议地点
作者
Yasith Jayawardana; Alexander C. Nwala; Gavindya Jayawardena; Jian Wu; Sampath Jayarathna; Michael L. Nelson; C. Lee Giles;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Big Data; Search engines; Portable document format; Frequency estimation; Data models; Internet; History;

机译：大数据;搜索引擎;便携式文件格式;频率估计;数据模型;互联网;历史;

相似文献

外文文献
中文文献
专利

1. Scholarly Archives for Miscellaneous Videos and Educational Data (SAMVED) [J] . Srinidhi S. Kashyap, Deepthi S. Narayan, B. Subhash Reddy SRELS journal of information management . 2019,第4期

机译：学术归档杂项视频和教育数据（SAMVED）
2. Bringing It All Together: Integrating Text, Audio, Metadata, GIS, and Scholarly Criticism in a Holocaust Oral History Archive [J] . English Eben Chicago Colloquium on Digital Humanities and Computer Science. Journal . 2010,第2期

机译：整合在一起：将文本，音频，元数据，GIS和学术批评整合到大屠杀口述历史档案中
3. Semantic Space models for classification of consumer webpages on metadata attributes. [J] . Chen G, Warren J, Riddle P Journal of biomedical informatics. . 2010,第5期

机译：用于根据元数据属性对消费者网页进行分类的语义空间模型。
4. Metadata Synthesis and Updates on Collections Harvested Using the Open Archive Initiative Protocol for Metadata Harvesting [C] . Sarantos Kapidakis International conference on theory and practice of digital libraries . 2018

机译：使用开放存档倡议协议进行元数据收集的元数据综合和收集的更新
5. The Graphic Design Archive: Generating Higher Levels of Scholarship Through a User Experience Approach to Scholarly Image Database Design [D] . Shay, Jenna. 2018

机译：图形设计档案：通过用户体验方法来生成更高级别的奖学金，以获得学术图像数据库设计
6. Updates from 2018: Being indexed in Embase becoming an affiliated journal of the World Federation for Medical Education implementing an optional open data policy adopting principles of transparency and best practice in scholarly publishing and appreciation to reviewers [O] . Sun Huh 2018

机译：自2018年以来的更新：被Embase收录成为世界医学教育联合会的附属期刊实施了一项可选的开放数据政策在学术出版中采用了透明性和最佳实践原则并对评论家表示赞赏
7. Preserving the scholarly record with WebCite(R) (www.webcitation.org): an archiving system for long-term digital preservation of cited webpages [O] . G. Eysenbach 2008

机译：使用WebCite（R）（www.webcitation.org）保存学术记录：用于对引用网页进行长期数字保存的归档系统
8. Lunar Data Node: Apollo Data Restoration and Archiving Update. [R] . Williams, D. R., Hills, H. K., Guiness, E. A., 2013

机译：月球数据节点：阿波罗数据恢复和存档更新。

Modeling Updates of Scholarly Webpages Using Archived Data

摘要

著录项

相似文献

相关主题

期刊订阅