...
首页> 外文期刊>Neurocomputing >Bag-of-concepts: Comprehending document representation through clustering words in distributed representation
【24h】

Bag-of-concepts: Comprehending document representation through clustering words in distributed representation

机译:概念袋:通过在分布式表示形式中聚类单词来理解文档表示形式

获取原文
获取原文并翻译 | 示例
           

摘要

Two document representation methods are mainly used in solving text mining problems. Known for its intuitive and simple interpretability, the bag-of-words method represents a document vector by its word frequencies. However, this method suffers from the curse of dimensionality, and fails to preserve accurate proximity information when the number of unique words increases. Furthermore, this method assumes every word to be independent, disregarding the impact of semantically similar words on preserving document proximity. On the other hand, doc2vec, a basic neural network model, creates low dimensional vectors that successfully preserve the proximity information. However, it loses the interpretability as meanings behind each feature are indescribable. This paper proposes the bag-of-concepts method as an alternative document representation method that overcomes the weaknesses of these two methods. This proposed method creates concepts through clustering word vectors generated from word2vec, and uses the frequencies of these concept clusters to represent document vectors. Through these data-driven concepts, the proposed method incorporates the impact of semantically similar words on preserving document proximity effectively. With appropriate weighting scheme such as concept frequency-inverse document frequency, the proposed method provides better document representation than previously suggested methods, and also offers intuitive interpretability behind the generated document vectors. Based on the proposed method, subsequently constructed text mining models, such as decision tree, can also provide interpretable and intuitive reasons on why certain collections of documents are different from others. (C) 2017 Elsevier B.V. All rights reserved.
机译:解决文本挖掘问题主要使用两种文档表示方法。词袋方法以其直观和简单的解释性而闻名,它通过词频来表示文档向量。但是,该方法遭受维度的诅咒,并且当唯一词的数量增加时,无法保存准确的接近度信息。此外,此方法假定每个单词都是独立的,而忽略了语义相似的单词对保持文档邻近性的影响。另一方面,基本的神经网络模型doc2vec创建低维向量,该向量成功保存了邻近信息。但是,它失去了可解释性,因为每个功能背后的含义都难以描述。本文提出了概念包方法,作为克服了这两种方法的缺点的另一种文档表示方法。该方法通过聚类从word2vec生成的词向量来创建概念,并使用这些概念簇的频率来表示文档向量。通过这些数据驱动的概念,提出的方法结合了语义相似的单词对有效保存文档邻近性的影响。借助适当的加权方案(例如概念频率与文档频率成反比),与以前建议的方法相比,所提出的方法可以提供更好的文档表示,并且在生成的文档向量后也可以提供直观的解释性。基于提出的方法,随后构造的文本挖掘模型(例如决策树)还可以提供有关某些文档集合为何与其他文档不同的可解释和直观的原因。 (C)2017 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号