首页> 外文会议>International conference on applications of natural language to information systems >Using Word Embeddings for Computing Distances Between Texts and for Authorship Attribution
【24h】

Using Word Embeddings for Computing Distances Between Texts and for Authorship Attribution

机译:使用单词嵌入来计算文本之间的距离和作者身份

获取原文

摘要

In this paper, word embeddings are used for the task of supervised authorship attribution. While previous methods have for instance been looking at characters (n-grams), syntax and most importantly token frequencies, the method presented focusses on the implications of semantic relationships between words. With this instead of authors word choices, semantic networks of entities as perceived by authors may come closer into focus. We find that those can be used reliably for authorship attribution. The method is generally applicable as a tool to compare different texts and/or authors through word embeddings which have been trained separately. This is achieved by not comparing vectors directly, but by comparing sets of most similar words for words shared between texts and then aggregating and averaging similarities per text pair. On two literary corpora (German, English), we compute embeddings for each text separately. The similarities are then used to detect the author of an unknown text.
机译:在本文中,单词嵌入被用于监督作者身份归属的任务。例如,虽然先前的方法一直在研究字符(n-gram),语法和最重要的标记频率,但提出的方法着重于单词之间语义关系的含义。以此来代替作者的单词选择,作者所感知的实体的语义网络可能会变得更加集中。我们发现这些可以可靠地用于作者身份归属。该方法通常可用作通过分别训练的词嵌入来比较不同文本和/或作者的工具。这不是通过不直接比较向量,而是通过比较文本之间共享的单词的最相似单词的集合,然后对每个文本对的相似度进行汇总和平均来实现的。在两个文学语料库(德语,英语)上,我们分别计算每个文本的嵌入量。然后使用相似性来检测未知文本的作者。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号