首页> 外文学位 >Particle swarm optimizer: Applications in high-dimensional data clustering.
【24h】

Particle swarm optimizer: Applications in high-dimensional data clustering.

机译:粒子群优化器:在高维数据聚类中的应用。

获取原文
获取原文并翻译 | 示例

摘要

Clustering high-dimensional data is an important but difficult task in various data mining applications. A fundamental starting point for data mining is the assumption that a data object, such as text document, can be represented as a high-dimensional feature vector. Traditional clustering algorithms struggle with high-dimensional data because the quality of results deteriorates due to the curse of dimensionality. As the number of features increases, data becomes very sparse and distance measures in the whole feature space become meaningless. Usually, in a high-dimensional data set, some features may be irrelevant or redundant for clusters and different sets of features may be relevant for different clusters. Thus, clusters can often be found in different feature subsets rather than the whole feature space. Clustering for such data sets is called subspace clustering or projected clustering, aimed at finding clusters from different feature subspaces. On the other hand, the performance of many subspace/projected clustering algorithms drops quickly with the size of the subspaces in which the clusters are found. Also, many of them require domain knowledge provided by the user to help select and tune their settings, like the maximum distance between dimensional values, the threshold of input parameters and the minimum density, which are difficult to set.;Then, special treatments of objective functions and encoding schemes are proposed to tailor PSO for two problems commonly encountered in studies related to high-dimensional data clustering. The first problem is the variable weighting problem in soft projected clustering with known the number of clusters k. With presetting the number of clusters k, the problem aims at finding a set of variable weights for each cluster and is formulated as a nonlinear continuous optimization problem subjected to bound constraints. A new algorithm, called PSOVW, is proposed to achieve optimal variable weights for clusters. In PSOVW, we design a suitable k-means objective weighting function, in which a change of variable weights is exponentially reflected. We also transform the original constrained variable weighting problem into a problem with bound constraints, using a non-normalized representation of variable weights, and we utilize a particle swarm optimizer to minimize the objective function in order to obtain global optima to the variable weighting problem in clustering. Our experimental results on both synthetic and real data show that the proposed algorithm greatly improves cluster quality. In addition, the results of the new algorithm are much less dependent on the initial cluster centroids.;The latter problem aims at automatically determining the number of clusters k as well as identifying clusters. Also, it is formulated as a nonlinear optimization problem with bound constraints. For the problem of automatical determination of k, which is troublesome to most clustering algorithms, a PSO algorithm called autoPSO is proposed. A special coding of particles is introduced into autoPSO to represent partitions with different numbers of clusters in the same population. The DB index is employed as the objective function to measure the quality of partitions with similar or different numbers of clusters. autoPSO is carried out on both synthetic high-dimensional datasets and handcrafted low-dimensional datasets and its performance is compared to other selected clustering techniques. Experimental results indicate that the promising potential pertaining to autoPSO applicability to clustering high-dimensional data without the preset number of clusters k.;Developing effective particle swarm optimization (PSO) for clustering high-dimensional data is the main focus of this thesis. First, in order to improve the performance of the conventional PSO algorithm, we analyze the main causes of the premature convergence and propose a novel PSO algorithm, call InformPSO, based on principles of adaptive diffusion and hybrid mutation. Inspired by the physics of information diffusion, we design a function to achieve a better particle diversity, by taking into account their distribution and the number of evolutionary generations and by adjusting their "social cognitive" abilities. Based on genetic self-organization and chaos evolution, we build clonal selection into InformPSO to implement local evolution of the best particle candidate, gBest, and make use of a Logistic sequence to control the random drift of gBest. These techniques greatly contribute to breaking away from local optima. The global convergence of the algorithm is proved using the theorem of Markov chain. Experiments on optimization of unimodal and multimodal benchmark functions show that, comparing with some other PSO variants, InformPSO converges faster, results in better optima, is more robust, and prevents more effectively the premature convergence.
机译:在各种数据挖掘应用程序中,对高维数据进行聚类是一项重要但困难的任务。数据挖掘的基本出发点是假设可以将数据对象(例如文本文档)表示为高维特征向量。传统的聚类算法难以处理高维数据,因为结果的质量由于维数的诅咒而恶化。随着特征数量的增加,数据变得非常稀疏,并且整个特征空间中的距离度量变得毫无意义。通常,在高维数据集中,某些特征对于群集可能是不相关的或多余的,而不同的特征集对于不同的群集可能是相关的。因此,通常可以在不同的特征子集中而不是整个特征空间中找到聚类。此类数据集的聚类称为子空间聚类或投影聚类,旨在从不同要素子空间中找到聚类。另一方面,许多子空间/投影聚类算法的性能随着在其中找到聚类的子空间的大小而迅速下降。此外,它们中的许多都需要用户提供的领域知识来帮助选择和调整其设置,例如难以设置的尺寸值之间的最大距离,输入参数的阈值和最小密度。提出了目标函数和编码方案,以针对与高维数据聚类有关的研究中经常遇到的两个问题定制PSO。第一个问题是在已知聚类数k的软投影聚类中的可变加权问题。通过预设聚类数k,该问题旨在为每个聚类找到一组可变权重,并将其表述为受到约束的非线性连续优化问题。提出了一种称为PSOVW的新算法,以实现集群的最佳可变权重。在PSOVW中,我们设计了合适的k均值客观加权函数,其中可变权重的变化以指数方式反映。我们还使用可变权重的非归一化表示将原始约束变量权重问题转换为具有约束条件的问题,并利用粒子群优化器最小化目标函数,以获得变量权重问题的全局最优值。聚类。我们在合成和真实数据上的实验结果表明,该算法大大提高了聚类质量。此外,新算法的结果对初始聚类质心的依赖性大大降低。后者的问题旨在自动确定聚类数k并识别聚类。而且,它被表述为具有约束条件的非线性优化问题。针对大多数聚类算法都比较麻烦的自动确定k的问题,提出了一种称为autoPSO的PSO算法。将特殊的粒子编码引入到autoPSO中,以表示同一总体中具有不同数量簇的分区。 DB索引用作衡量具有相似或不同数量集群的分区质量的目标函数。 autoPSO在合成的高维数据集和手工制作的低维数据集上均进行,其性能与其他选定的聚类技术进行了比较。实验结果表明,autoPSO适用于在不设置聚类数k的情况下对高维数据进行聚类的潜在潜力。;为聚类高维数据开发有效的粒子群算法(PSO)是本论文的重点。首先,为了提高传统PSO算法的性能,我们分析了过早收敛的主要原因,并基于自适应扩散和混合变异原理,提出了一种新的PSO算法,称为InformPSO。受信息传播物理学的启发,我们设计了一种功能,可通过考虑其分布和进化代数并调整其“社会认知”能力来实现更好的粒子多样性。基于遗传自组织和混沌进化,我们将克隆选择构建到InformPSO中,以实现最佳候选粒子gBest的局部进化,并利用Logistic序列控制gBest的随机漂移。这些技术极大地有助于摆脱局部最优。利用马尔可夫链定理证明了该算法的全局收敛性。对单峰和多峰基准函数进行优化的实验表明,与其他一些PSO变体相比,InformPSO收敛速度更快,优化效果更好,更健壮,并且可以更有效地防止过早收敛。

著录项

  • 作者

    Lu, Yanping.;

  • 作者单位

    Universite de Sherbrooke (Canada).;

  • 授予单位 Universite de Sherbrooke (Canada).;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2009
  • 页码 96 p.
  • 总页数 96
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号