As We May Perceive: Finding the Boundaries of Compound Documents on the Web

机译：如我们所知：在网络上查找复合文档的边界

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper considers the problem of identifying on the Web compound documents (cDocs) – groups of web pages that in aggregate constitute semantically coherent information entities. Examples of cDocs are a news article consisting of several html pages, or a set of pages describing specifications, price, and reviews of a digital camera. Being able to identify cDocs would be useful in many applications including web and intranet search, user navigation, automated collection generation, and information extraction. In the past, several heuristic approaches have been proposed to identify cDocs [1][5]. However, heuristics fail to capture the variety of types, styles and goals of information on the web, and do not account for the fact that the definition of a cDoc often depends on the context. This paper presents an experimental evaluation of three machine learning-based algorithms for cDoc discovery. These algorithms are responsive to the varying structure of cDocs and adaptive to their application-specific nature. Based on our previous work [4], this paper proposes a different scenario for discovering cDocs, and compares in this new setting the local machine learned clustering algorithm from [4] to a global purely graph based approach [3] and a Conditional Markov Network approach previously applied to noun coreference task [6]. The results show that the approach of [4] outperforms the other algorithms, suggesting that global relational characteristics of web sites are too noisy for cDoc identification purposes.

机译：本文考虑了在Web上确定复合文档（cDocs）的问题-总共构成语义上一致的信息实体的网页组。 cDocs的示例是由数个html页面或一组描述数码相机的规格，价格和评论的页面组成的新闻文章。能够识别cDocs在许多应用程序中将非常有用，包括Web和Intranet搜索，用户导航，自动生成集合以及信息提取。过去，已经提出了几种启发式方法来识别cDocs [1] [5]。但是，试探法无法捕获Web上信息的各种类型，样式和目标，并且不能解释cDoc的定义通常取决于上下文的事实。本文提出了三种针对cDoc发现的基于机器学习的算法的实验评估。这些算法对cDocs的变化结构做出响应，并适应其特定于应用程序的性质。在我们之前的工作[4]的基础上，本文提出了一种发现cDocs的不同方案，并在此新设置中将[4]中的本地机器学习聚类算法与基于全局图的全局方法[3]和条件马尔可夫网络进行了比较。先前应用于名词共指任务的方法[6]。结果表明，[4]的方法优于其他算法，这表明网站的全局关系特征对于cDoc识别目的而言过于嘈杂。

著录项

来源
《第十七届国际万维网大会（the 17th International World Wide Web Conference）（WWW08）论文集》|2008年||共2页
会议地点
作者
Pavel Dmitriev;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
WWW; Compound Documents; Machine Learning;

机译：万维网;复合文件;机器学习;

相似文献

外文文献
中文文献
专利

1. African American Ethnic and Class-Based Identities on the World Wide Web: Moderating the Effects of Self-Perceived Information Seeking/Finding and Web Self-Efficacy [J] . Jennifer R.Warren, rnMichael L. Hecht, rnEura Jung, Communication research . 2010,第5期

机译：万维网上的非裔美国人种族和基于阶级的身份：调节自我感知的信息寻找/查找和网络自我效能的影响
2. Weblogs for market research: finding more relevant opinion documents using system fusion [J] . Deanna Osman, John Yearwood, Peter Vamplew On-line review . 2009,第5期

机译：市场研究网志：使用系统融合功能找到更多相关的意见文件
3. Understanding web documents: finding pagelets for transformation using structural patterns [J] . Reza Ferrydiansyah, Bambang Parmanto International Journal of Web Engineering and Technology . 2008,第3期

机译：了解Web文档：查找小页面以使用结构模式进行转换
4. Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web [C] . Rayid Ghani, Rosie Jones, Dunja Mladenic First Asia-Pacific Conference on Web Intelligence: Research and Development WI 2001, Oct 23-26, 2001, Maebashi City, Japan . 2001

机译：用于网络查询生成的在线学习：在网络上查找与少数群体概念相匹配的文档
5. Finding the boundaries of compound documents on the Web [D] . Dmitriev, Pavel Alexandrovich 2008

机译：在Web上查找复合文档的边界
6. Desktop document delivery using portable document format (PDF) files and the Web. [O] . J P Shipman, W L Gembala, J M Reeder, 1998

机译：使用可移植文档格式（PDF）文件和Web进行桌面文档传递。
7. ACC/AHA/NASPE 2002 Guideline Update for Implantation of Cardiac Pacemakers and Antiarrhythmia Devices—Summary Article A Report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines (ACC/AHA/NASPE Committee to Update the 1998 Pacemaker Guidelines) 11This document was approved by the American College of Cardiology Foundation Board of Trustees in September 2002, the American Heart Association Science Advisory and Coordinating Committee in August 2002, and the North American Society for Pacing and Electrophysiology in August 2002.22The ACC/AHA Task Force on Practice Guidelines makes every effort to avoid any actual or potential conflicts of interest that might arise as a result of an outside relationship or personal interest of a member of the writing panel. Specifically, all members of the writing panel are asked to provide disclosure statements of all such relationships that might be perceived as real or potential conflicts of interest. These statements are reviewed by the parent task force, reported orally to all members of the writing panel at the first meeting, and updated as changes occur. The conflict of interest information for the writing committee members is posted on the ACC, AHA, and NASPE Web sites with the full-length version of the update.33When citing this document, the ACC, the AHA, and NASPE would appreciate the following citation format: Gregoratos G, Abrams J, Epstein AE, Freedman RA, Hayes DL, Hlatky MA, Kerber RE, Naccarelli GV, Schoenfeld MH, Silka MJ, Winters SL. ACC/AHA/NASPE 2002 Guideline Update for Implantation of Cardiac Pacemakers and Antiarrhythmia Devices—Summary Article: A Report of the American College of Cardiology/American Heart Association Task Force on Practice Guidelines (ACC/AHA/NASPE Committee to Update the 1998 Pacemaker Guidelines). J Am Coll Cardiol2002;40:1703–19.44Copies: This document is available on the World Wide Web sites of the ACC (www.acc.org) and the AHA (www.americanheart.org). A single copy of the complete guidelines is available by calling 800-253-4636 (US only) or writing the American College of Cardiology, Resource Center, 9111 Old Georgetown Road, Bethesda, MD 20814-1699 (ask for No. 71-0237). To obtain a copy of the Summary Article, ask for reprint No. 71-0236. To purchase additional reprints (specify version and reprint number): up to 999 copies, call 800-611-6083 (US only) or fax 413-665-2671; 1000 or more copies, call 410-528-4426, fax 410-528-4264, or e-mail kbradle@lww.com.55(J Am Coll Cardiol 2002;40:1703–19.)66©2002 by the American College of Cardiology Foundation and the American Heart Association, Inc. [O] . Gregoratos Gabriel, Abrams Jonathan, Epstein Andrew E, 2002

机译：ACC / AHA / NASPE 2002心脏起搏器和抗心律失常装置植入指南更新-总结文章美国心脏病学会/美国心脏协会实践指南工作组的报告（ACC / AHA / NASPE委员会将更新1998年起搏器指南） 11该文件于2002年9月获得美国心脏病学会基金会董事会的批准，于2002年8月获得美国心脏协会科学咨询和协调委员会的批准，并于2002年8月获得北美起搏和电生理学会的批准。22ACC / AHA工作组《实践指南》将尽一切努力避免由于写作小组成员的外部关系或个人利益而引起的任何实际或潜在的利益冲突。具体来说，要求写作小组的所有成员提供所有可能被视为实际或潜在利益冲突的关系的披露声明。这些声明由上级工作组审核，在第一次会议上口头报告给写作小组的所有成员，并在发生变化时进行更新。撰写委员会成员的利益冲突信息已发布在更新的完整版本的ACC，AHA和NASPE网站上。33当引用本文档时，ACC，AHA和NASPE将不胜感激。格式：Gregoratos G，Abrams J，Epstein AE，Freedman RA，Hayes DL，Hlatky MA，Kerber RE，Naccarelli GV，Schoenfeld MH，Silka MJ，Winters SL。 ACC / AHA / NASPE 2002心脏起搏器和抗心律失常装置植入指南更新-总结文章：美国心脏病学会/美国心脏协会实践指南工作组的报告（ACC / AHA / NASPE委员会更新了1998年起搏器指南）。 J Am Coll Cardiol2002； 40：1703–19.44复制：此文档可在ACC（www.acc.org）和AHA（www.americanheart.org）的万维网站点上找到。您可以致电800-253-4636（仅限美国）或写信给美国心脏病学会资源中心，地址为9111 Old Georgetown Road，Bethesda，MD 20814-1699，获得一份完整的指南（请致电71-0237））。要获得摘要文章的副本，请索要第71-0236号转载。要购买其他重印本（指定版本和重印本号码）：最多999份，请致电800-611-6083（仅限美国）或传真413-665-2671；否则，请重新发送。 1000份或更多，请致电410-528-4426，传真410-528-4264或发送电子邮件至kbradle@lww.com.55（J Am Coll Cardiol 2002； 40：1703–19。）66©2002美国心脏病学基金会基金会和美国心脏协会有限公司

As We May Perceive: Finding the Boundaries of Compound Documents on the Web

摘要

著录项

相似文献

相关主题

期刊订阅