首页> 外文会议>International conference on computational linguistics >Unsupervised Multiword Segmentation of Large Corpora using Prediction-Driven Decomposition of n-grams
【24h】

Unsupervised Multiword Segmentation of Large Corpora using Prediction-Driven Decomposition of n-grams

机译:使用n-gram的预测驱动分解对大型语料库进行无监督多词分割

获取原文

摘要

We present a new, efficient unsupervised approach to the segmentation of corpora into multiword units. Our method involves initial decomposition of common n-grams into segments which maximize within-segment predictability of words, and then further refinement of these segments into a multiword lexicon. Evaluating in four large, distinct corpora, we show that this method creates segments which correspond well to known multiword expressions; our model is particularly strong with regards to longer (3+ word) multiword units, which are often ignored or minimized in relevant work.
机译:我们提出了一种新的,有效的无监督方法将语料库分割成多字单元。我们的方法包括将普通n元语法初始分解为可最大程度地提高单词在词内的可预测性的片段,然后将这些片段进一步细化为多词词典。通过对四个大型,不同的语料库进行评估,我们证明了该方法创建的片段与已知的多词表达式非常吻合。对于更长(3个单词)的多单词单元,我们的模型特别强大,在相关工作中通常会忽略或最小化这些单词。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号