首页> 外文会议>International Conference on Intelligent Information Processing and Web Mining IIS >Filtering Web Documents for a Thematic Warehouse Case Study: eDot a Food Risk Data Warehouse (extended)
【24h】

Filtering Web Documents for a Thematic Warehouse Case Study: eDot a Food Risk Data Warehouse (extended)

机译:过滤主题仓库案例研究的Web文档:编制食物风险数据仓库(延长)

获取原文

摘要

Ordinary sources, like databases and general-pupose document collections, seems to be insufficient and inadequate to scale the needs and the requirements of the new generation of warehouses: thematic data warehouses. Knowing that more and more online thematic data is available, the web can be considered as a useful data source for populating thematic data warehouses. To do so, the warehouse data supplier must be able to filter the heterogeneous web content to keep only the documents corresponding to the warehouse topic. Therefore, building efficient automatic tools to characterize web documents dealing with a given thematic is essential to challenge the warehouse data acquisition issue. In this paper, we present our filtering approach implemented in an automatic tool called "eDot-Filter". This tool is used to filter crawled documents to keep only the documents dealing with food risk. These documents are then stored in a thematic warehouse called "eDot". Our filtering approach is based on "WeQueL", a declarative web query langage that improves the expressive power of keyword-based queries.
机译:普通来源,如数据库和普通蛹文件集合,似乎不足以扩大新一代仓库的需求和要求:专题数据仓库。知道越来越多的在线主题数据可用,可以将Web视为填充主题数据仓库的有用数据源。为此,仓库数据供应商必须能够过滤异构的Web内容,以仅保留与仓库主题对应的文档。因此,建立高效的自动工具来表征处理给定专题的Web文件是挑战仓库数据采集问题至关重要。在本文中,我们介绍了在一个名为“Edot-Filter”的自动工具中实现的过滤方法。此工具用于过滤爬行文件,只保留处理粮食风险的文件。然后将这些文档存储在名为“Edot”的主题仓库中。我们的过滤方法基于“Wequel”,这是一个声明性的Web查询Langage,可以提高基于关键字的查询的表现力量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号