首页> 外文会议>Image and printing in a web 2.0 world II >DOM-Based Print-Link Detection for Web Article Extraction
【24h】

DOM-Based Print-Link Detection for Web Article Extraction

机译:用于Web文章提取的基于DOM的打印链接检测

获取原文
获取原文并翻译 | 示例

摘要

Web article pages usually have hyperlinks (or links) that lead to print-friendly web pages containing mainly the article content. Content extraction using these print-friendly pages is generally easier and more reliable, but there are many variations of the print-link representations in HTML that made robust print-link detection more difficult than it first appears. First, the link can be text-based, image-based, or both. For example, there is a lexicon of phrases used to indicate print-friendly pages, such as "print", "print article", "print-friendly version", etc. In addition, some links use printer-resembling image icons with or without a print phrase present. To complicate the matter further, not all the links contain a valid URL, but instead the pages are dynamically generated either by the client Javascript or by the server, so no URL is available for extraction. We estimate that there are more than 90% of the Web article pages have print-links, of which about 35% of them have valid print-friendly URLs, which is a good percentage. Our solution to the print-link extraction problem takes on two stages: (1) the detection of the print-link, (2) the retrieval of the print-friendly page URL from the link attributes, including the test for its validity. Experimental results based on roughly 2000 web article pages suggest our solution is capable of achieving over 99% precision and 97% recall performance measures.
机译:Web文章页面通常具有超链接(或链接),这些超链接会导致主要包含文章内容的易于打印的网页。使用这些易于打印的页面进行内容提取通常更容易且更可靠,但是HTML中的打印链接表示形式有许多变体,使得健壮的打印链接检测比最初出现的困难更大。首先,链接可以是基于文本的,基于图像的或两者兼有。例如,有一个词表词典,用于指示易于打印的页面,例如“打印”,“打印文章”,“适合打印的版本”等。此外,某些链接使用带有或类似打印机的图像图标没有打印短语。更为复杂的是,并非所有链接都包含有效的URL,而是由客户端Javascript或由服务器动态生成页面,因此没有URL可用于提取。我们估计有90%以上的Web文章页面具有打印链接,其中大约35%的页面具有有效的打印友好URL,这是一个很好的百分比。我们针对打印链接提取问题的解决方案分为两个阶段:(1)检测打印链接,(2)从链接属性中检索打印友好页面URL,包括对其有效性进行测试。基于大约2000个Web文章页面的实验结果表明,我们的解决方案能够实现99%以上的精度和97%的召回性能指标。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号