首页> 外文会议>IEEE International Conference on Data Mining >Unsupervised Learning of Tree Alignment Models for Information Extraction
【24h】

Unsupervised Learning of Tree Alignment Models for Information Extraction

机译:信息提取的无监督学习树对齐模型

获取原文

摘要

We propose an algorithm for extracting fields from HTML search results. The output of the algorithm is a database table- a data structure that better lends itself to high-level data mining and information exploitation. Our algorithm effectively combines tree and string alignment algorithms, as well as domain-specific feature extraction to match semantically related data across search results. The applications of our approach are vast and include hidden web crawling, semantic tagging, and federated search. We build on earlier research on the use of tree alignment for information extraction. In contrast to previous approaches that rely on hand tuned parameters, our algorithm makes use of a variant of Support Vector Machines (SVMs) to learn a parameterized, site-independent tree alignment model. This model can then be used to deduce common structural and textual elements of a set of HTML parse trees. We report some preliminary results of our system's performance on data from websites with a variety of different layouts.
机译:我们提出的算法提取HTML搜索结果中的字段。该算法的输出是一个数据库表 - 的数据结构,更好地适合于高层次的数据挖掘和信息开发。我们的算法有效地结合树和字符串比对算法,以及特定领域的特征提取,来匹配的搜索结果语义相关数据。我们的方法的应用非常广泛,包括隐藏的网页抓取,语义标记,并联合搜索。我们建立关于使用树排列的信息提取早期的研究。与此相反的是依靠手工调整参数以前的方法,我们的算法利用支持向量机(SVM)的变体的学习参数,网站无关树对准模型。然后,该模型可以被用来推导一组HTML解析树中的共同的结构和文本元素。我们有各种不同的布局的报告从网站上我们的系统对数据性能的一些初步结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号