...
首页> 外文期刊>ACM Transactions on Information Systems >ELSA: A Multilingual Document Summarization Algorithm Based on Frequent Itemsets and Latent Semantic Analysis
【24h】

ELSA: A Multilingual Document Summarization Algorithm Based on Frequent Itemsets and Latent Semantic Analysis

机译:ELSA:基于频繁项集和潜在语义分析的多语言文档摘要算法

获取原文
获取原文并翻译 | 示例
           

摘要

Sentence-based summarization aims at extracting concise summaries of collections of textual documents. Summaries consist of a worthwhile subset of document sentences. The most effective multilingual strategies rely on Latent Semantic Analysis (LSA) and on frequent itemset mining, respectively. LSA-based summarizers pick the document sentences that cover the most important concepts. Concepts are modeled as combinations of single-document terms and are derived from a term-by-sentence matrix by exploiting Singular Value Decomposition (SVD). Itemset-based summarizers pick the sentences that contain the largest number of frequent itemsets, which represent combinations of frequently co-occurring terms. The main drawbacks of existing approaches are (i) the inability of LSA to consider the correlation between combinations of multiple-document terms and the underlying concepts, (ii) the inherent redundancy of frequent itemsets because similar itemsets may be related to the same concept, and (iii) the inability of itemset-based summarizers to correlate itemsets with the underlying document concepts. To overcome the issues of both of the abovementioned algorithms, we propose a new summarization approach that exploits frequent itemsets to describe all of the latent concepts covered by the documents under analysis and LSA to reduce the potentially redundant set of itemsets to a compact set of =correlated concepts. The summarizer selects the sentences that cover the latent concepts with minimal redundancy. We tested the summarization algorithm on both multilingual and English-language benchmark document collections. The proposed approach performed significantly better than both itemset- and LSA-based summarizers, and better than most of the other state-of-the-art approaches.
机译:基于句子的摘要旨在提取文本文档集合的简洁摘要。摘要由有价值的文档句子子集组成。最有效的多语言策略分别依赖于潜在语义分析(LSA)和频繁项集挖掘。基于LSA的摘要器选择涵盖最重要概念的文档句子。概念被建模为单文档术语的组合,并通过利用奇异值分解(SVD)从逐项术语矩阵中得出。基于项目集的摘要程序选择包含最多数量的频繁项目集的句子,这些句子表示频繁出现的常见词语的组合。现有方法的主要缺点是:(i)LSA无法考虑多文档术语组合与基本概念之间的相关性;(ii)频繁项目集的固有冗余性,因为相似的项目集可能与同一概念相关, (iii)基于项目集的汇总器无法将项目集与基础文档概念相关联。为了克服上述两种算法的问题,我们提出了一种新的汇总方法,该方法利用频繁的项目集来描述分析文档和LSA所涵盖的所有潜在概念,以将潜在的冗余项目集减少为=的紧凑集合。相关概念。摘要器选择覆盖潜在概念的句子,且冗余度最小。我们在多语言和英语基准文档集上测试了汇总算法。与基于项目集和基于LSA的汇总器相比,所提出的方法的性能要好得多,并且比大多数其他最新技术也要好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号