Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules

Jou Chichang

首页> 外文期刊>Information systems frontiers >Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules

【24h】

Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules

机译：使用启发式规则的深层Web查询接口的模式提取

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Along with the popularity of the world wide web, data volumes inside web databases have been increasing tremendously. These deep web contents, hidden behind the query interfaces, are of much better quality than those in the surface web. Internet users need to fill in query conditions in the HTML query interface and click the submit button to obtain deep web data. Many deep web contents related applications, like named entity attribute collection, topic-focused crawling, and heterogeneous data integration, are based on understanding schema of these query interfaces. The schema needs to cover mappings of input elements and labels, data types of valid input values, and range constraints of the input values. Additionally, to extract these hidden data, the schema needs to include many form submission related information, like cookies and action types. We design and implement a Heuristics-based deep web query interface Schema Extraction system (HSE). In HSE, texts surrounding elements are collected as candidate labels. We propose a string similarity function and use a dynamic similarity threshold to cleanse candidate labels. In HSE, elements, candidate labels, and new lines in the query interface are streamlined to produce its Interface Expression (IEXP). By combining the user's view and the designer's view, with the aid of semantic information, we build heuristic rules to extract schema from IEXP of query interfaces in the ICQ dataset. These rules are constructed through utilizing (1) the characteristics of labels and elements, and (2) the spatial, group, and range relationships of labels and elements. Supplemented with form submission related information, the extracted schemas are then stored in the XML format, so that they could be utilized in further applications, like schema matching and merging for federated query interface integration. The experimental results on the TEL-8 dataset illustrate that HSE produces effective performance.

机译：随着万维网的普及，Web数据库中的数据量正在急剧增加。隐藏在查询界面后面的这些深层Web内容的质量比表面Web的质量好得多。 Internet用户需要在HTML查询界面中填写查询条件，然后单击“提交”按钮以获得深层的Web数据。许多与深层Web内容相关的应用程序，例如命名实体属性集合，以主题为中心的爬网和异构数据集成，都是基于对这些查询接口的理解的。该模式需要涵盖输入元素和标签的映射，有效输入值的数据类型以及输入值的范围约束。另外，要提取这些隐藏数据，该模式需要包含许多与表单提交相关的信息，例如cookie和操作类型。我们设计和实现基于启发式的深层Web查询界面架构提取系统（HSE）。在HSE中，围绕元素的文本被收集为候选标签。我们提出了一个字符串相似度函数，并使用动态相似度阈值来清理候选标签。在HSE中，简化了查询接口中的元素，候选标签和换行，以生成其接口表达式（IEXP）。通过结合用户的视图和设计者的视图，并借助语义信息，我们建立了启发式规则，以从IEXP的ICQ数据集中的查询接口中提取架构。这些规则是通过利用（1）标签和元素的特征以及（2）标签和元素的空间，组和范围关系来构造的。补充了与表单提交相关的信息，然后将提取的模式存储为XML格式，以便可以在其他应用程序中使用它们，例如模式匹配和合并以实现联合查询接口集成。 TEL-8数据集上的实验结果表明HSE产生了有效的性能。

著录项

来源
《Information systems frontiers》 |2019年第1期|163-174|共12页
作者
Jou Chichang;
展开▼
作者单位

Tamkang Univ, Dept Informat Management, 151 Ying Zhuan Rd, Tamsui 25137, Taiwan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Deep web; Query interface; Schema extraction; XML; Heuristic rules; String similarity;

机译：深度网络;查询接口;模式提取;XML;启发式规则;字符串相似度;

相似文献

外文文献
中文文献
专利

1. Ontology-assisted Schema Matching for Deep Web Query Interfaces [J] . Ying Wang, rnWanli Zuo, rnXin Wang, Journal of information and computational science . 2010,第2期

机译：深度Web查询接口的本体辅助架构匹配
2. An Approach for Deep Web Interface Schema Extraction Based on Hierarchical Semantic Annotation [J] . Liang Zhang, rnYuliang Lu, rnJinhong Liu, Journal of information and computational science . 2010,第2期

机译：基于分层语义注释的深度Web界面模式提取方法
3. Automatic Complex Schema Matching Across Web Query Interfaces: A Correlation Mining Approach [J] . BIN HE, KEVIN CHEN-CHUAN CHANG ACM transactions on database systems . 2006,第1期

机译：跨Web查询接口的自动复杂模式匹配：一种关联挖掘方法
4. Heuristics-Based Schema Extraction for Deep Web Query Interfaces [C] . Chichang Jou, Yucheng Cheng IEEE International Conference on Information Reuse and Integration . 2017

机译：深度Web查询接口的基于启发式的模式提取
5. Heuristic rules for extraction of ontology from Web pages in WebOntEx. [D] . Jain, Bhanu Chaturvedi. 2000

机译：从WebOntEx中的网页提取本体的启发式规则。
6. A rule driven bi-directional translation system for remapping queries and result sets between a mediated schema and heterogeneous data sources. [O] . R. Shaker, P. Mork, M. Barclay, 2002

机译：规则驱动的双向翻译系统用于在中介模式与异构数据源之间重新映射查询和结果集。
7. Schema Matching across Query Interfaces on the Deep Web [O] . He, Z., Hong, Jun, Bell, David 2008

机译：深度Web上的查询接口之间的架构匹配

Schema Extraction for Deep Web Query Interfaces Using Heuristics Rules

摘要

著录项

相似文献

相关主题

期刊订阅