...
首页> 外文期刊>International Journal of Data Mining & Knowledge Management Process >Arabic Text Summarization Based on Latent Semantic Analysis to Enhance Arabic Documents Clustering
【24h】

Arabic Text Summarization Based on Latent Semantic Analysis to Enhance Arabic Documents Clustering

机译:基于潜在语义分析的阿拉伯文本摘要增强阿拉伯文档聚类

获取原文
           

摘要

Arabic Documents Clustering is an important task for obtaining good results with the traditional Information Retrieval (IR) systems especially with the rapid growth of the number of online documents present in Arabic language. Documents clustering aim to automatically group similar documents in one cluster using different similarity/distance measures. This task is often affected by the documents length, useful information on the documents is often accompanied by a large amount of noise, and therefore it is necessary to eliminate this noise while keeping useful information to boost the performance of Documents clustering. In this paper, we propose to evaluate the impact of text summarization using the Latent Semantic Analysis Model on Arabic Documents Clustering in order to solve problems cited above, using five similarity/distance measures: Euclidean Distance, Cosine Similarity, Jaccard Coefficient, Pearson Correlation Coefficient and Averaged Kullback-Leibler Divergence, for two times: without and with stemming. Our experimental results indicate that our proposed approach effectively solves the problems of noisy information and documents length, and thus significantly improve the clustering performance.
机译:阿拉伯文档聚类是一项重要任务,要通过传统的信息检索(IR)系统获得良好的结果,尤其是阿拉伯语在线文档的数量迅速增长。文档聚类旨在使用不同的相似性/距离度量将一个相似的文档自动分组到一个聚类中。此任务通常受文档长度的影响,文档上的有用信息通常会伴随大量噪音,因此有必要在保留有用信息的同时消除这些噪音,以提高文档聚类的性能。在本文中,我们建议使用潜在语义分析模型评估文本摘要对阿拉伯文档聚类的影响,以解决上述问题,并使用五个相似度/距离度量:欧氏距离,余弦相似度,雅卡德系数,皮尔森相关系数和两次Kullback-Leibler平均散度:无词根和有词根。我们的实验结果表明,我们提出的方法有效地解决了嘈杂的信息和文档长度问题,从而显着提高了聚类性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号