...
首页> 外文期刊>Journal of computer sciences >Improving the Performance of Multivariate Bernoulli Model based Documents Clustering Algorithms using Transformation Techniques | Science Publications
【24h】

Improving the Performance of Multivariate Bernoulli Model based Documents Clustering Algorithms using Transformation Techniques | Science Publications

机译:使用转换技术提高基于多元伯努利模型的文档聚类算法的性能科学出版物

获取原文
           

摘要

> Problem statement: Document clustering is the most important areas of data mining since they are very much and currently the subject of significant global research since such areas strengthen the enterprises of web intelligence, web mining, web search engine design and so forth. Generative models based on the multivariate Bernoulli and multinomial distributions have been widely used for text classification. Approach: This study explores the suitability of multivariate Bernoulli model based probabilistic algorithm for text clustering application. In a multivariate Bernoulli model, a document is represented as a binary vector over the space of words with 0 and 1, indicating that whether word occurs or not in the document. The number of occurrences is not considered. So the word frequency information is lost due to this nature of implementation. In this work, we propose a FFT based transformation technique for improving clustering performance of multivariate Bernoulli model based probabilistic algorithm. We are using the transformation technique to transform the actual term frequency count data in to a time domain signal. So, the weight of frequency of each word will be distributed throughout each row of records. Now if we apply multivariate Bernoulli model on values less than zero and greater than zero, the performance will get increased since there is no information loss in this kind of data representation. Results: In this work, Bernoulli model-based clustering and an improved version of the same will be implemented and evaluated using suitable metrics and the results are shown. Conclusion: The transformation technique in multivariate Bernoulli model improves the performance of document clustering significantly.
机译: > 问题陈述:文档聚类是数据挖掘的最重要领域,因为它们是非常重要的,并且由于这些领域增强了Web Intelligence的企业地位,因此目前是全球范围内重要的研究课题,网络挖掘,网络搜索引擎设计等。基于多元伯努利和多项式分布的生成模型已被广泛用于文本分类。 方法:本研究探讨了基于多元伯努利模型的概率算法在文本聚类中的适用性。在多元伯努利模型中,文档被表示为具有0和1的单词空间上的二进制向量,指示单词是否在文档中出现。不考虑出现次数。因此,由于实施的这种性质,单词频率信息丢失了。在这项工作中,我们提出了一种基于FFT的变换技术,以提高基于多元伯努利模型的概率算法的聚类性能。我们正在使用转换技术将实际项频率计数数据转换为时域信号。因此,每个单词的频率权重将分布在记录的每一行中。现在,如果我们在小于零且大于零的值上应用多元伯努利模型,由于在这种数据表示中没有信息丢失,因此性能将得到提高。 结果:在这项工作中,将使用适当的指标来实施和评估基于伯努利模型的聚类及其改进版本,并显示结果。 结论:多元Bernoulli模型中的转换技术显着提高了文档聚类的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号