首页> 外文会议>IEEE International Conference on Machine Learning and Applications >Unsupervised Topic Model Based Text Network Construction for Learning Word Embeddings
【24h】

Unsupervised Topic Model Based Text Network Construction for Learning Word Embeddings

机译:基于无监督的主题模型学习词嵌入式的文本网络构建

获取原文

摘要

Distributed word embeddings have proven remarkably effective at capturing word level semantic and syntactic regularities in language for many natural language processing tasks. One recently proposed semi-supervised representation learning method called Predictive Text Embedding (PTE) utilizes both semantically labeled and unlabeled data in information networks to learn the embedding of text that produces state of-the-art performance when compared to other embedding methods. However, PTE uses supervised label information to construct one of the networks and many other possible ways of constructing such information networks are left untested. We present two unsupervised methods that can be used in constructing a large scale semantic information network from documents by combining topic models that have emerged as a powerful technique of finding useful structure in an unstructured text collection as it learns distributions over words. The first method uses Latent Dirichlet Allocation (LDA) to build a topic model over text, and constructs a word topic network with edge weights proportional to the word-topic probability distributions. The second method trains an unsupervised neural network to learn the word-document distribution, with a single hidden layer representing a topic distribution. The two weight matrices of the neural net are directly reinterpreted as the edge weights of heterogeneous text networks that can be used to train word embeddings to build an effective low dimensional representation that preserves the semantic closeness of words and documents for NLP tasks. We conduct extensive experiments to evaluate the effectiveness of our methods.
机译:分布式单词嵌入式在语言中捕获语言的单词级语义和句法规律,以获得许多自然语言处理任务,已经证明了显着有效。最近提出了称为预测文本嵌入(PTE)的半监督表示学习方法利用信息网络中的语义标记和未标记的数据,以了解与其他嵌入方法相比,在嵌入文本中产生最新性能的文本。然而,PTE使用监督标签信息来构建其中一个网络,并且许多构建此类信息网络的许多其他可能的方式被留下了未经测试。我们提出,可以通过组合已经成为它在学习单词分布非结构化文本集合中发现有用的结构的强大技术主题模型构建从文档的大规模语义信息网络使用两个无监督的方法。第一个方法使用潜在的Dirichlet分配(LDA)在文本中构建主题模型,并构造与与单词主题概率分布成比例的边缘权重的单词主题网络。第二种方法列举了一个无监督的神经网络来学习单词文档分发,其中单个隐藏层表示主题分布。神经网络的两个权重矩阵被直接重新诠释为可用于培训Word Embeddings的异构文本网络的边缘权重,以构建有效的低维表示,以保留用于NLP任务的单词和文档的语义闭合。我们对评估我们方法的有效性进行了广泛的实验。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号