首页> 外文会议>Conference on empirical methods in natural language processing >Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees
【24h】

Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees

机译:紧凑,高效和无限的能力:带有压缩后缀树的语言建模

获取原文

摘要

Efficient methods for storing and querying language models are critical for scaling to large corpora and high Markov orders. In this paper we propose methods for modeling extremely large corpora without imposing a Markov condition. At its core, our approach uses a succinct index - a compressed suffix tree - which provides near optimal compression while supporting efficient search. We present algorithms for on-the-fly computation of probabilities under a Kneser-Ney language model. Our technique is exact and although slower than leading LM toolkits, it shows promising scaling properties, which we demonstrate through ∞-order modeling over the full Wikipedia collection.
机译:存储和查询语言模型的有效方法对于扩展到大型语料库和高马尔可夫阶至关重要。在本文中,我们提出了在不施加马尔可夫条件的情况下对超大型语料库建模的方法。从本质上讲,我们的方法使用简洁的索引-压缩后缀树-提供近乎最佳的压缩效果,同时支持有效的搜索。我们提出了一种在Kneser-Ney语言模型下即时计算概率的算法。我们的技术是精确的,尽管比领先的LM工具包要慢,但它显示了很有希望的缩放属性,我们通过整个Wikipedia集合的∞阶建模来证明这一点。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号