首页> 外文期刊>International Journal of Computer Systems Science & Engineering >Term expansion on the categorization of summarized documents
【24h】

Term expansion on the categorization of summarized documents

机译:摘要文件分类的术语扩展

获取原文
获取原文并翻译 | 示例
           

摘要

Several researches emphasize on the application of using documents' summaries as feature vector inputs in text categorization tasks. The performance of this kind of approach is often poor when the coverage rate of summarization is low with oversimplified feature vectors. Therefore, in this study we propose to expand the terms in the summaries using supervised distributional clustering method to improve the categorization performance. In the training stage of our approach, we input documents' summaries to generate classifiers (KNN and Naive Bayes) and term clusters (using KL divergence as dissimilarity measure) as well. In the test stage, we classify a new document by inputting its expanded feature vector of its summary into the generated classifiers. That is, terms in the feature vector will be expanded using related terms in the same cluster in order to alleviate the term mismatch problem. Three experiments are conducted accordingly. The results show that our proposed approach can effectively resolve the problem of term mismatch problem and improve the categorization accuracy. In a word, our approach makes the idea of using automatic summarization to replace for the feature selection in text categorization tasks more practical and feasible.
机译:一些研究强调在文本分类任务中使用文档摘要作为特征向量输入的应用。当摘要的覆盖率较低且特征向量过于简单时,这种方法的性能通常很差。因此,在这项研究中,我们建议使用监督分布聚类方法来扩展摘要中的术语,以提高分类性能。在我们方法的训练阶段,我们还输入文档摘要以生成分类器(KNN和朴素贝叶斯)和术语聚类(使用KL散度作为相异性度量)。在测试阶段,我们通过将摘要的扩展特征向量输入到生成的分类器中来对新文档进行分类。也就是说,将使用同一聚类中的相关术语来扩展特征向量中的术语,以缓解术语不匹配问题。相应地进行了三个实验。结果表明,本文提出的方法可以有效地解决术语不匹配问题,提高分类的准确性。简而言之,我们的方法使使用自动摘要代替文本分类任务中的特征选择的想法更加实用和可行。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号