首页> 外文期刊>Data & Knowledge Engineering >TopicBank: Collection of coherent topics using multiple model training with their further use for topic model validation
【24h】

TopicBank: Collection of coherent topics using multiple model training with their further use for topic model validation

机译:TopicBank:使用多种模型培训的相干主题的集合,并进一步用于主题模型验证

获取原文
获取原文并翻译 | 示例
           

摘要

Probabilistic topic modeling of a text collection is a tool for unsupervised learning of the inherent thematic structure of the collection. Given only the text of documents as input, the topic model aims to reveal latent topics as probability distributions over words. The shortcomings of topic models are that they are unstable in the sense that topics may depend on the random initialization, and incomplete in the sense that each new run of the model on the same collection may discover some new topics. This means that data exploration using topic modeling usually requires too many experiments for looking over many topic models and tuning their parameters in search of a model that describes the data best. To deal with the instability and incompleteness of topic models, we propose to gradually accumulate interpretable topics in a "topic bank"using multiple model training. To add topics into the bank, we learn a child level in a hierarchical topic model, then we analyze the coherence of child subtopics and their relationships with parent bank topics in order to exclude irrelevant and duplicate subtopics instead of adding them to the bank. Then we introduce a new way to topic model evaluation by comparing the topics found by the model with the ones that were collected beforehand in a bank. Our experiments with several datasets and topic models show that the proposed method does help in finding a model with more interpretable topics.
机译:文本集合的概率主题建模是一种无监督学习的收集内固有专题结构的工具。仅给出文件的文本作为输入,主题模型旨在将潜在主题视为单词的概率分布。主题模型的缺点是它们在某种意义上是不稳定的,即主题可能取决于随机初始化,并且在同一收集中的模型的每个新运行可能发现一些新主题的感觉中不完整。这意味着使用主题建模的数据探索通常需要太多的实验来查找许多主题模型并调整其参数以搜索最佳描述数据的模型。要处理主题模型的不稳定和不完整性,我们建议使用多种模型培训逐步逐步积累解释主题。要将主题添加到银行中,我们在分层主题模型中学习一个儿童级别,然后我们分析了子副主题的一致性及其与父存储主题的关系,以排除无关紧要的软数据,而不是将它们添加到银行。然后,我们通过比较模型中的主题与在银行中收集的主题进行比较来介绍模型评估的新方法。我们具有多个数据集和主题模型的实验表明,该方法确实有助于找到具有更多可解释主题的模型。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号