首页> 外文会议>IEEE International Congress on Big Data >Computing fuzzy rough approximations in large scale information systems
【24h】

Computing fuzzy rough approximations in large scale information systems

机译:大规模信息系统中的模糊粗略近似计算

获取原文

摘要

Rough set theory is a popular and powerful machine learning tool. It is especially suitable for dealing with information systems that exhibit inconsistencies, i.e. objects that have the same values for the conditional attributes but a different value for the decision attribute. In line with the emerging granular computing paradigm, rough set theory groups objects together based on the indiscernibility of their attribute values. Fuzzy rough set theory extends rough set theory to data with continuous attributes, and detects degrees of inconsistency in the data. Key to this is turning the indiscernibility relation into a gradual relation, acknowledging that objects can be similar to a certain extent. In very large datasets with millions of objects, computing the gradual indiscernibility relation (or in other words, the soft granules) is very demanding, both in terms of runtime and in terms of memory. It is however required for the computation of the lower and upper approximations of concepts in the fuzzy rough set analysis pipeline. Current non-distributed implementations in R are limited by memory capacity. For example, we found that a state of the art non-distributed implementation in R could not handle 30,000 rows and 10 attributes on a node with 62GB of memory. This is clearly insufficient to scale fuzzy rough set analysis to massive datasets. In this paper we present a parallel and distributed solution based on Message Passing Interface (MPI) to compute fuzzy rough approximations in very large information systems. Our results show that our parallel approach scales with problem size to information systems with millions of objects. To the best of our knowledge, no other parallel and distributed solutions have been proposed so far in the literature for this problem.
机译:粗糙集理论是一种流行且功能强大的机器学习工具。它特别适用于处理显示不一致的信息系统,即,对象的条件属性值相同,而决策属性的值不同。与新兴的粒度计算范例一致,粗糙集理论基于对象属性值的不可区分性将对象分组在一起。模糊粗糙集理论将粗糙集理论扩展到具有连续属性的数据,并检测数据中的不一致程度。关键在于将不可分辨关系转变为渐进关系,并承认对象在一定程度上可以相似。在具有数百万个对象的超大型数据集中,无论是在运行时还是在内存方面,计算渐进的不可分辨关系(或换句话说,软颗粒)都非常困难。但是,对于模糊粗糙集分析流水线中概念的上下近似的计算是必需的。 R中当前的非分布式实现受到内存容量的限制。例如,我们发现R中最先进的非分布式实现无法在具有62GB内存的节点上处理30,000行和10个属性。这显然不足以将模糊粗糙集分析扩展到大量数据集。在本文中,我们提出了一种基于消息传递接口(MPI)的并行分布式解决方案,用于在大型信息系统中计算模糊粗略近似。我们的结果表明,我们的并行方法将问题规模扩展到具有数百万个对象的信息系统。据我们所知,到目前为止,在文献中尚未针对此问题提出其他并行和分布式解决方案。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号