【24h】

As We May Perceive: Finding the Boundaries of Compound Documents on the Web

机译:如我们所知:在网络上查找复合文档的边界

获取原文

摘要

This paper considers the problem of identifying on the Web compound documents (cDocs) – groups of web pages that in aggregate constitute semantically coherent information entities. Examples of cDocs are a news article consisting of several html pages, or a set of pages describing specifications, price, and reviews of a digital camera. Being able to identify cDocs would be useful in many applications including web and intranet search, user navigation, automated collection generation, and information extraction. In the past, several heuristic approaches have been proposed to identify cDocs [1][5]. However, heuristics fail to capture the variety of types, styles and goals of information on the web, and do not account for the fact that the definition of a cDoc often depends on the context. This paper presents an experimental evaluation of three machine learning-based algorithms for cDoc discovery. These algorithms are responsive to the varying structure of cDocs and adaptive to their application-specific nature. Based on our previous work [4], this paper proposes a different scenario for discovering cDocs, and compares in this new setting the local machine learned clustering algorithm from [4] to a global purely graph based approach [3] and a Conditional Markov Network approach previously applied to noun coreference task [6]. The results show that the approach of [4] outperforms the other algorithms, suggesting that global relational characteristics of web sites are too noisy for cDoc identification purposes.
机译:本文考虑了在Web上确定复合文档(cDocs)的问题-总共构成语义上一致的信息实体的网页组。 cDocs的示例是由数个html页面或一组描述数码相机的规格,价格和评论的页面组成的新闻文章。能够识别cDocs在许多应用程序中将非常有用,包括Web和Intranet搜索,用户导航,自动生成集合以及信息提取。过去,已经提出了几种启发式方法来识别cDocs [1] [5]。但是,试探法无法捕获Web上信息的各种类型,样式和目标,并且不能解释cDoc的定义通常取决于上下文的事实。本文提出了三种针对cDoc发现的基于机器学习的算法的实验评估。这些算法对cDocs的变化结构做出响应,并适应其特定于应用程序的性质。在我们之前的工作[4]的基础上,本文提出了一种发现cDocs的不同方案,并在此新设置中将[4]中的本地机器学习聚类算法与基于全局图的全局方法[3]和条件马尔可夫网络进行了比较。先前应用于名词共指任务的方法[6]。结果表明,[4]的方法优于其他算法,这表明网站的全局关系特征对于cDoc识别目的而言过于嘈杂。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号