Bootstrapping Information Extraction from Semi-structured Web Pages

机译：从半结构化网页引导信息提取

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We consider the problem of extracting structured records from semi-structured web pages with no human supervision required for each target web site. Previous work on this problem has either required significant human effort for each target site or used brittle heuristics to identify semantic data types. Our method only requires annotation for a few pages from a few sites in the target domain. Thus, after a tiny investment of human effort, our method allows automatic extraction from potentially thousands of other sites within the same domain. Our approach extends previous methods for detecting data fields in semi-structured web pages by matching those fields to domain schema columns using robust models of data values and contexts. Annotating 2-5 pages for 4-6 web sites yields an extraction accuracy of 83.8% on job offer sites and 91.1% on vacation rental sites. These results significantly outperform a baseline approach.

机译：我们考虑从半结构化网页提取结构化记录的问题，没有每个目标网站所需的人类监督。以前关于此问题的工作需要为每个目标网站或使用脆性启发式来识别语义数据类型的脆弱性努力。我们的方法只需要从目标域中的几个站点的几页注释。因此，经过微小的人类努力投资，我们的方法允许自动提取来自同一域内的潜在数千个其他网站。我们的方法通过使用稳健的数据值和上下文匹配域模式列来扩展以前用于检测半结构化网页中的数据字段。注释2-5页为4-6个网站产生的提取准确性为83.8％，在休闲租赁站点上的91.1％。这些结果显着优于基线方法。

著录项

来源
《European Conference on Machine Learning and Knowledge Discovery in Databases》|2008年||共16页
会议地点
作者
Andrew Carlson; Charles Schafer;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP311.13;
关键词

相似文献

外文文献
中文文献
专利

1. Automatic Extraction of Objects and their Attributes from Semi-Structured Web Tables for E-commerce Tasks [J] . Yerzhan Baiburin, Aliya Nugumanova Indian Journal of Science and Technology . 2015,第30期

机译：从半结构化Web表中自动提取对象及其属性以完成电子商务任务
2. Business information extraction from semi-structured webpages [J] . Nahk Hyun Sung, Yong Sik Chang Expert Systems with Application . 2004,第4期

机译：从半结构化网页中提取业务信息
3. Automatic information extraction from semi-structured Web pages by pattern discovery [J] . Chia-Hui Chang, Chun-Nan Hsu, Shao-Cheng Lui Decision support systems . 2003,第1期

机译：通过模式发现从半结构化网页中自动提取信息
4. Bootstrapping Information Extraction from Semi-structured Web Pages [C] . Andrew Carlson, Charles Schafer European Conference on Machine Learning and Knowledge Discovery in Databases;ECML PKDD 2008 . 2008

机译：自引导从半结构化网页中提取信息
5. Entity information extraction using structured and semi-structured resources. [D] . Sil, Avirup. 2014

机译：使用结构化和半结构化资源提取实体信息。
6. TagLine: Information Extraction for Semi-Structured Text in Medical Progress Notes [O] . Dezon K. Finch, James A. McCart, Stephen L. Luther 2014

机译：口号：医疗进度记录中半结构化文本的信息提取
7. Bootstrapping Information Extraction from Semi-structured Web Pages ⋆ [O] . Andrew Carlson, Charles Schafer 2010

机译：半结构化网页的引导信息提取⋆

Bootstrapping Information Extraction from Semi-structured Web Pages

摘要

著录项

相似文献

相关主题

期刊订阅