TopicBank: Collection of coherent topics using multiple model training with their further use for topic model validation

Alekseev Vasiliy; Egorov Evgeny; Vorontsov Konstantin; Goncharov Alexey; Nurumov Kaidar; Buldybayev Timur

首页> 外文期刊>Data & Knowledge Engineering >TopicBank: Collection of coherent topics using multiple model training with their further use for topic model validation

【24h】

TopicBank: Collection of coherent topics using multiple model training with their further use for topic model validation

机译：TopicBank：使用多种模型培训的相干主题的集合，并进一步用于主题模型验证

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Probabilistic topic modeling of a text collection is a tool for unsupervised learning of the inherent thematic structure of the collection. Given only the text of documents as input, the topic model aims to reveal latent topics as probability distributions over words. The shortcomings of topic models are that they are unstable in the sense that topics may depend on the random initialization, and incomplete in the sense that each new run of the model on the same collection may discover some new topics. This means that data exploration using topic modeling usually requires too many experiments for looking over many topic models and tuning their parameters in search of a model that describes the data best. To deal with the instability and incompleteness of topic models, we propose to gradually accumulate interpretable topics in a "topic bank"using multiple model training. To add topics into the bank, we learn a child level in a hierarchical topic model, then we analyze the coherence of child subtopics and their relationships with parent bank topics in order to exclude irrelevant and duplicate subtopics instead of adding them to the bank. Then we introduce a new way to topic model evaluation by comparing the topics found by the model with the ones that were collected beforehand in a bank. Our experiments with several datasets and topic models show that the proposed method does help in finding a model with more interpretable topics.

机译：文本集合的概率主题建模是一种无监督学习的收集内固有专题结构的工具。仅给出文件的文本作为输入，主题模型旨在将潜在主题视为单词的概率分布。主题模型的缺点是它们在某种意义上是不稳定的，即主题可能取决于随机初始化，并且在同一收集中的模型的每个新运行可能发现一些新主题的感觉中不完整。这意味着使用主题建模的数据探索通常需要太多的实验来查找许多主题模型并调整其参数以搜索最佳描述数据的模型。要处理主题模型的不稳定和不完整性，我们建议使用多种模型培训逐步逐步积累解释主题。要将主题添加到银行中，我们在分层主题模型中学习一个儿童级别，然后我们分析了子副主题的一致性及其与父存储主题的关系，以排除无关紧要的软数据，而不是将它们添加到银行。然后，我们通过比较模型中的主题与在银行中收集的主题进行比较来介绍模型评估的新方法。我们具有多个数据集和主题模型的实验表明，该方法确实有助于找到具有更多可解释主题的模型。

著录项

来源
《Data & Knowledge Engineering》 |2021年第9期|101921.1-101921.13|共13页
作者
Alekseev Vasiliy; Egorov Evgeny; Vorontsov Konstantin; Goncharov Alexey; Nurumov Kaidar; Buldybayev Timur;
展开▼
作者单位

Natl Res Univ Moscow Inst Phys & Technol 9 Inst Skiy Dolgoprudnyi Russia;

Natl Res Univ Moscow Inst Phys & Technol 9 Inst Skiy Dolgoprudnyi Russia;

Natl Res Univ Moscow Inst Phys & Technol 9 Inst Skiy Dolgoprudnyi Russia;

Natl Res Univ Moscow Inst Phys & Technol 9 Inst Skiy Dolgoprudnyi Russia;

Informat Analyt Ctr JSC Nur Sultan Kazakhstan;

Informat Analyt Ctr JSC Nur Sultan Kazakhstan;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Topic modeling; Multiple model training; Topic coherence; Stability; Regularization;

机译：主题建模;多模型培训;主题连贯;稳定性;正规化;

相似文献

外文文献
中文文献
专利

1. Combining semantic graph and probabilistic topic models for discovering coherent topics [J] . Allahyari Mehdi, Pouriyeh Seyedamin, Kochut Krys Web Intelligence and Agent Systems . 2019,第4期

机译：结合语义图和概率主题模型以发现相干主题
2. Exploring coherent topics by topic modeling with term weighting [J] . Li Ximing, Zhang Ang, Li Changchun, Information Processing & Management . 2018,第6期

机译：通过主题建模和术语加权来探索连贯的主题
3. Probabilistic Topic Modeling for Comparative Analysis of Document Collections [J] . Hua Ting, Lu Chang-Tien, Choo Jaegul, ACM transactions on knowledge discovery from data . 2020,第2期

机译：用于文档收集比较分析的概率主题建模
4. A author topic model based unsupervised algorithm for learning topics from large text collections [C] . Shalinie S. Mercy, Sundarakantham K., Pushparathi S. International Conference on Recent Trends in Information Technology . 2011

机译：基于作者主题模型的无监督算法，可从大型文本集中学习主题
5. The Ensemble MeSH-Term Query Expansion Models Using Multiple LDA Topic Models and ANN Classifiers in Health Information Retrieval [D] . You, Sukjin. 2020

机译：使用多个LDA主题模型和健康信息检索的ANN分类器的集合网格术语查询型号
6. Guest Editorial—Special Collection Topic: Statistical Systems Theory in Cancer Modeling Diagnosis and Therapy [O] . Edward R Dougherty, Anne-Laure Boulesteix, Lori A Dalton, 2018

机译：客座社论—特别收藏主题：癌症建模诊断和治疗中的统计系统理论
7. Understanding Topic Models in Context: A Mixed-Methods Approach to the Meaningful Analysis of Large Document Collections [O] . Eickhoff Matthias, Wieneke Runhild 2018

机译：理解语境中的主题模型：大文档集合有意义分析的混合方法
8. New Data Collection System for Ionospheric Modelling and Related Topics [R] . Sheehan, R. E. 1993

机译：新的电离层建模数据收集系统及相关主题

TopicBank: Collection of coherent topics using multiple model training with their further use for topic model validation

摘要

著录项

相似文献

相关主题

期刊订阅