首页> 外文会议>Canadian conference on artificial intelligence >General Topic Annotation in Social Networks: A Latent Dirichlet Allocation Approach
【24h】

General Topic Annotation in Social Networks: A Latent Dirichlet Allocation Approach

机译:社交网络中的通用主题注释:潜在的狄利克雷分配方法

获取原文

摘要

In this article, we present a novel document annotation method that can be applied on corpora containing short documents such as social media texts. The method applies Latent Dirichlet Allocation (LDA) on a corpus to initially infer some topical word clusters. Each document is assigned one or more topic clusters automatically. Further document annotation is done through a projection of the topics extracted and assigned by LDA into a set of generic categories. The translation from the topical clusters to the small set of generic categories is done manually. Then the categories are used to automatically annotate the general topics of the documents. It is remarkable that the number of the topical clusters that need to be manually mapped to the general topics is far smaller than the number of postings of a corpus that normally need to be annotated to build training and testing sets manually. We show that the accuracy of the annotation done through this method is about 80% which is comparable with inter-human agreement in similar tasks. Additionally, using the LDA method, the corpus entries are represented by low-dimensional vectors which lead to good classification results. The lower-dimensional representation can be fed into many machine learning algorithms that cannot be applied on the conventional high-dimensional text representation methods.
机译:在本文中,我们提出了一种新颖的文档注释方法,该方法可应用于包含诸如社交媒体文本之类的简短文档的语料库。该方法对语料库应用潜在狄利克雷分配(LDA),最初可以推断出一些主题词簇。每个文档都会自动分配一个或多个主题组。通过对由LDA提取并分配给一组通用类别的主题进行投影,可以完成进一步的文档注释。从主题类到一小类通用类别的转换是手动完成的。然后使用类别自动注释文档的常规主题。值得注意的是,需要手动映射到一般主题的主题聚类的数量远远少于通常需要注解以手动构建训练和测试集的语料库的发布数量。我们表明,通过这种方法完成的注释的准确性约为80%,与类似任务中的人际协议具有可比性。另外,使用LDA方法,语料词条由低维向量表示,这导致良好的分类结果。可以将低维表示形式输入许多无法应用于常规高维文本表示方法的机器学习算法中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号