首页> 外文期刊>Expert Systems with Application >Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering
【24h】

Text feature selection with a robust weight scheme and dynamic dimension reduction to text document clustering

机译:具有稳健的权重方案和文本文档聚类的动态尺寸缩减功能的文本特征选择

获取原文
获取原文并翻译 | 示例
           

摘要

This paper proposes three feature selection algorithms with feature weight scheme and dynamic dimension reduction for the text document clustering problem. Text document clustering is a new trend in text mining; in this process, text documents are separated into several coherent clusters according to carefully selected informative features by using proper evaluation function, which usually depends on term frequency. Informative features in each document are selected using feature selection methods. Genetic algorithm (GA), harmony search (HS) algorithm, and particle swarm optimization (PSO) algorithm are the most successful feature selection methods established using a novel weighting scheme, namely, length feature weight (LFW), which depends on term frequency and appearance of features in other documents. A new dynamic dimension reduction (DDR) method is also provided to reduce the number of features used in clustering and thus improve the performance of the algorithms. Finally, k-mean, which is a popular clustering method, is used to cluster the set of text documents based on the terms (or features) obtained by dynamic reduction. Seven text mining benchmark text datasets of different sizes and complexities are evaluated. Analysis with k-mean shows that particle swarm optimization with length feature weight and dynamic reduction produces the optimal outcomes for almost all datasets tested. This paper provides new alternatives for text mining community to cluster text documents by using cohesive and informative features. (C) 2017 Elsevier Ltd. All rights reserved.
机译:针对文本文档聚类问题,提出了三种具有特征权重方案和动态降维的特征选择算法。文本文档聚类是文本挖掘中的新趋势。在此过程中,通过使用适当的评估功能(通常取决于术语频率),根据精心选择的信息功能将文本文档分为几个连贯的簇。使用功能选择方法选择每个文档中的信息性功能。遗传算法(GA),和声搜索(HS)算法和粒子群优化(PSO)算法是使用新型加权方案(长度特征权重(LFW))建立的最成功的特征选择方法,该方案取决于词频和其他文档中功能的外观。还提供了一种新的动态降维(DDR)方法,以减少聚类中使用的特征数量,从而提高算法的性能。最后,k-mean是一种流行的聚类方法,用于基于通过动态归约获得的术语(或特征)对文本文档集进行聚类。评估了七个大小和复杂程度不同的文本挖掘基准文本数据集。用k均值分析表明,具有长度特征权重和动态缩减的粒子群优化为几乎所有测试数据集产生了最佳结果。本文为文本挖掘社区提供了使用聚类和信息功能来聚类文本文档的新方法。 (C)2017 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号