首页> 外文会议>International conference on computational linguistics >Sub-Word Similarity based Search for Embeddings: Inducing Rare-Word Embeddings for Word Similarity Tasks and Language Modelling
【24h】

Sub-Word Similarity based Search for Embeddings: Inducing Rare-Word Embeddings for Word Similarity Tasks and Language Modelling

机译:基于子词相似性的嵌入搜索:为词相似性任务和语言建模引入稀有词嵌入

获取原文

摘要

Training good word embeddings requires large amounts of data. Out-of-vocabulary words will still be encountered at test-time, leaving these words without embeddings. To overcome this lack of embeddings for rare words, existing methods leverage morphological features to generate embeddings. While the existing methods use computationally-intensive rule-based (Soricut and Och, 2015) or tool-based (Botha and Blunsom, 2014) morphological analysis to generate embeddings, our system applies a computationally-simpler sub-word search on words that have existing embeddings. Embeddings of the sub-word search results are then combined using string similarity functions to generate rare word embeddings. We augmented pre-trained word embeddings with these novel embeddings and evaluated on a rare word similarity task, obtaining up to 3 times improvement in correlation over the original set of embeddings. Applying our technique to embeddings trained on larger datasets led to on-par performance with the existing state-of-the-art for this task. Additionally, while analysing augmented embeddings in a log-bilinear language model, we observed up to 50% reduction in rare word perplexity in comparison to other more complex language models.
机译:训练好的单词嵌入需要大量数据。在测试时仍然会遇到词汇外的单词,这些单词没有嵌入。为了克服稀有单词缺乏嵌入的问题,现有方法利用形态学特征来生成嵌入。现有方法使用基于计算的密集型规则(Soricut和Och,2015)或基于工具的计算(Botha和Blunsom,2014)进行形态学分析以生成嵌入词,而我们的系统对具有以下特征的单词应用计算简单的子词搜索现有的嵌入。然后,使用字符串相似度函数组合子词搜索结果的嵌入,以生成稀有词嵌入。我们使用这些新颖的嵌入增强了预训练词嵌入,并在罕见词相似性任务上进行了评估,与原始嵌入集相比,其相关性提高了3倍。将我们的技术应用于在更大的数据集上训练的嵌入,可以利用此任务的现有最新技术实现出色的性能。此外,在分析对数双线性语言模型中的增强嵌入时,与其他更复杂的语言模型相比,我们发现稀有单词的困惑度降低了50%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号