首页> 外文会议>Canadian conference on artificial intelligence >General Topic Annotation in Social Networks: A Latent Dirichlet Allocation Approach
【24h】

General Topic Annotation in Social Networks: A Latent Dirichlet Allocation Approach

机译:社交网络中的常规主题注释:潜在的Dirichlet分配方法

获取原文

摘要

In this article, we present a novel document annotation method that can be applied on corpora containing short documents such as social media texts. The method applies Latent Dirichlet Allocation (LDA) on a corpus to initially infer some topical word clusters. Each document is assigned one or more topic clusters automatically. Further document annotation is done through a projection of the topics extracted and assigned by LDA into a set of generic categories. The translation from the topical clusters to the small set of generic categories is done manually. Then the categories are used to automatically annotate the general topics of the documents. It is remarkable that the number of the topical clusters that need to be manually mapped to the general topics is far smaller than the number of postings of a corpus that normally need to be annotated to build training and testing sets manually. We show that the accuracy of the annotation done through this method is about 80% which is comparable with inter-human agreement in similar tasks. Additionally, using the LDA method, the corpus entries are represented by low-dimensional vectors which lead to good classification results. The lower-dimensional representation can be fed into many machine learning algorithms that cannot be applied on the conventional high-dimensional text representation methods.
机译:在本文中,我们提出了一种新的文档注释方法,可以应用于包含短文如社交媒体文本的Corpora。该方法在语料库上应用潜在的Dirichlet分配(LDA),以最初推断出一些主题字集群。每个文档都会自动分配一个或多个主题群集。进一步的文档注释是通过提取的主题的投影来完成的,并由LDA分配成一组通用类别。从局部集群转换到一小组通用类别的翻译是手动完成的。然后,类别用于自动注释文档的一般主题。非常值得注意的是,需要手动映射到一般主题的局部集群的数量远远小于通常需要注释以手动构建训练和测试集的语料库的帖子数量。我们表明,通过该方法完成的注释的准确性约为80%,与类似任务中的人际间协议相当。另外,使用LDA方法,语料库条目由低维向量表示,这导致良好的分类结果。可以将较低的尺寸表示馈入许多机器学习算法,该算法不能应用于传统的高维文本表示方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号