首页> 外文学位 >A model-based approach for distributed data mining.
【24h】

A model-based approach for distributed data mining.

机译:一种基于模型的分布式数据挖掘方法。

获取原文
获取原文并翻译 | 示例

摘要

Most data mining algorithms assume that data have been pooled together in a centralized repository so that analysis can be performed. Recently, there exist a number of cases where data are distributed and cannot be shared due to local constraints, such as privacy concerns or bandwidth limits. In this thesis, we focus on studying how a model-based approach can be applied to data mining in a distributed environment.;First, we demonstrate how a model-based approach can be applied to the web data clustering and visualization. In particular, we extend the latent class model (LCM) by modeling also the topological relationship of the latent classes and study how distributed learning of the LCM can be performed via merging local LCMs.;As a major contribution of this thesis, a distributed model-based data mining approach called learning from abstraction is proposed. At each source, it first computes local data abstraction using hierarchical clustering algorithms and then aggregates the local abstractions for global analysis. Gaussian mixture model is adopted as the representation of local data abstractions. Gaussian mixture model and generative topographic mapping are the global models we study for two applications---distributed data clustering and distributed manifold discovery respectively. An EM-like algorithm is derived for learning both global models solely based on the model parameters of the local abstractions. We tested the proposed approach using different scenarios regarding the size of the data sets and the distribution of the data over the different data sources. A number of synthetic and benchmark data sets are used to validate the proposed approach. Experimental results have shown that accurate global models can still be learned from properly abstracted data (privacy protected) and the proposed approach is much more efficient (scalable) when compared with the model learned directly from the raw data. Also, its performance is found to be robust against heterogeneous data distributions among the local data sources.;While the proposed learning-from-abstraction approach is effective for distributed model-based data mining, how to obtain the right trade-off between the abstraction levels of the local data sources and the global model accuracy remains open. It is challenging because the local data sets could be inter-correlated to different extents. Therefore, the best abstraction strategy for a data source depends on how the other sources set their abstraction levels. We formulate this optimal abstraction task as a game and compute the Nash equilibrium as its solution. In addition, we investigate an iterative version of the game so that the Nash equilibrium can be computed by actively exploring the right level of details from the local sources in a need-to-know manner. In other words, based on the game theoretical approach, the local sources can self-organize to determine their own optimal granularity levels of abstraction so as to protect local data privacy at best and yet to acquire a good global model accuracy as far as possible.;Future research directions include (1) studying alternative data privacy measures, (2) extending the proposed approach to a peer-to-peer computing environment, (3) performing the theoretical study of the optimality of the proposed iterative game, (4) optimizing the local data abstraction, and (5) studying how the game theoretic based distributed data mining approach can be further enhanced for an untrusted and more dynamic environment.;Keywords. Model-based approach, clustering, manifold discovery, privacy preserving data mining, distributed data mining.
机译:大多数数据挖掘算法都假设数据已经集中在一个集中的存储库中,以便可以执行分析。近来,由于隐私限制或带宽限制等本地限制,存在许多分布数据且无法共享数据的情况。本文着重研究如何将基于模型的方法应用于分布式环境中的数据挖掘。首先,我们演示了基于模型的方法如何应用于Web数据聚类和可视化。特别是,我们还通过对潜在类的拓扑关系进行建模来扩展潜在类模型(LCM),并研究如何通过合并局部LCM来进行LCM的分布式学习。;作为本论文的主要贡献,分布式模型提出了一种基于数据的抽象学习方法。在每个来源处,它首先使用分层聚类算法计算局部数据抽象,然后聚集局部抽象以进行全局分析。采用高斯混合模型作为局部数据抽象的表示。高斯混合模型和生成地形图是我们针对两种应用研究的全局模型-分别是分布式数据聚类和分布式流形发现。派生出一种类似EM的算法,仅基于局部抽象的模型参数来学习两个全局模型。我们使用了关于数据集的大小以及数据在不同数据源上的分布的不同方案,对所提出的方法进行了测试。许多综合和基准数据集用于验证所提出的方法。实验结果表明,仍然可以从正确抽象的数据(受隐私保护)中学习准确的全局模型,并且与直接从原始数据中学习的模型相比,所提出的方法更有效(可扩展)。同时,它的性能也被证明可以抵抗本地数据源之间的异构数据分布。虽然所提建议的“从学习中学习”方法对于基于分布式模型的数据挖掘是有效的,但如何在抽象之间获得正确的权衡本地数据源的级别和全局模型的准确性仍处于开放状态。这具有挑战性,因为本地数据集可能在不同程度上相互关联。因此,数据源的最佳抽象策略取决于其他源如何设置其抽象级别。我们将此最佳抽象任务表述为一个博弈,并计算纳什均衡作为其解决方案。此外,我们研究了游戏的迭代版本,以便可以通过以需要知悉的方式积极探索来自本地资源的正确细节水平来计算纳什均衡。换句话说,基于博弈论方法,本地源可以自组织以确定自己的最佳抽象粒度级别,从而最大程度地保护本地数据隐私,同时又要尽可能获得良好的全局模型精度。 ;未来的研究方向包括:(1)研究替代数据隐私措施,(2)将提出的方法扩展到对等计算环境,(3)对提出的迭代博弈的最优性进行理论研究,(4)优化本地数据抽象,以及(5)研究如何在不可信且更具动态性的环境中进一步增强基于博弈论的分布式数据挖掘方法。基于模型的方法,聚类,多种发现,隐私保护数据挖掘,分布式数据挖掘。

著录项

  • 作者

    Zhang, Xiaofeng.;

  • 作者单位

    Hong Kong Baptist University (Hong Kong).;

  • 授予单位 Hong Kong Baptist University (Hong Kong).;
  • 学科 Statistics.;Computer Science.
  • 学位 Ph.D.
  • 年度 2008
  • 页码 136 p.
  • 总页数 136
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 统计学;自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号