...
首页> 外文期刊>Information systems frontiers >Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules
【24h】

Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules

机译:使用启发式规则的深层Web查询接口的模式提取

获取原文
获取原文并翻译 | 示例
           

摘要

Along with the popularity of the world wide web, data volumes inside web databases have been increasing tremendously. These deep web contents, hidden behind the query interfaces, are of much better quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Many deep web contents related applications, like named entity attribute collection, topic-focused crawling, and heterogeneous data integration, are based on understanding schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. Additionally, to extract these hidden data, the schema needs to include many form submission related information, like cookies and action types. We design and implement a Heuristics-based deep web query interface Schema Extraction system (HSE). In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and use a dynamic similarity threshold to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user's view and the designer's view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Supplemented with form submission related information, the extracted schemas are then stored in the XML format, so that they could be utilized in further applications, like schema matching and merging for federated query interface integration. The experimental results on the TEL-8 dataset illustrate that HSE produces effective performance.
机译:随着万维网的普及,Web数据库中的数据量正在急剧增加。隐藏在查询界面后面的这些深层Web内容的质量比表面Web的质量好得多。 Internet用户需要在HTML查询界面中填写查询条件,然后单击“提交”按钮以获得深层的Web数据。许多与深层Web内容相关的应用程序,例如命名实体属性集合,以主题为中心的爬网和异构数据集成,都是基于对这些查询接口的理解的。该模式需要涵盖输入元素和标签的映射,有效输入值的数据类型以及输入值的范围约束。另外,要提取这些隐藏数据,该模式需要包含许多与表单提交相关的信息,例如cookie和操作类型。我们设计和实现基于启发式的深层Web查询界面架构提取系统(HSE)。在HSE中,围绕元素的文本被收集为候选标签。我们提出了一个字符串相似度函数,并使用动态相似度阈值来清理候选标签。在HSE中,简化了查询接口中的元素,候选标签和换行,以生成其接口表达式(IEXP)。通过结合用户的视图和设计者的视图,并借助语义信息,我们建立了启发式规则,以从IEXP的ICQ数据集中的查询接口中提取架构。这些规则是通过利用(1)标签和元素的特征以及(2)标签和元素的空间,组和范围关系来构造的。补充了与表单提交相关的信息,然后将提取的模式存储为XML格式,以便可以在其他应用程序中使用它们,例如模式匹配和合并以实现联合查询接口集成。 TEL-8数据集上的实验结果表明HSE产生了有效的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号