Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees

机译：紧凑，高效和无限的能力：带有压缩后缀树的语言建模

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Efficient methods for storing and querying language models are critical for scaling to large corpora and high Markov orders. In this paper we propose methods for modeling extremely large corpora without imposing a Markov condition. At its core, our approach uses a succinct index - a compressed suffix tree - which provides near optimal compression while supporting efficient search. We present algorithms for on-the-fly computation of probabilities under a Kneser-Ney language model. Our technique is exact and although slower than leading LM toolkits, it shows promising scaling properties, which we demonstrate through ∞-order modeling over the full Wikipedia collection.

机译：存储和查询语言模型的有效方法对于扩展到大型语料库和高马尔可夫阶至关重要。在本文中，我们提出了在不施加马尔可夫条件的情况下对超大型语料库建模的方法。从本质上讲，我们的方法使用简洁的索引-压缩后缀树-提供近乎最佳的压缩效果，同时支持有效的搜索。我们提出了一种在Kneser-Ney语言模型下即时计算概率的算法。我们的技术是精确的，尽管比领先的LM工具包要慢，但它显示了很有希望的缩放属性，我们通过整个Wikipedia集合的∞阶建模来证明这一点。

著录项

来源
《Conference on empirical methods in natural language processing》|2015年|2409-2418|共10页
会议地点
作者
Ehsan Shareghi; Matthias Petri; Gholamreza Haffari; Trevor Cohn;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Linearized Suffix Tree: an Efficient Index Data Structure with the Capabilities of Suffix Trees and Suffix Arrays [J] . Dong Kyue Kim, Minhwan Kim, Heejin Park Algorithmica . 2008,第3期

机译：线性化后缀树：具有后缀树和后缀数组功能的高效索引数据结构
2. Linearized Suffix Tree: An Efficient Index Data Structure With The Capabilities Of Suffix Trees And Suffix Arrays [J] . Dong Kyue Kim, Minhwan Kim, Heejin Park Algorithmica . 2008,第3期

机译：线性化后缀树：具有后缀树和后缀数组功能的高效索引数据结构
3. Using the Sadakane Compressed Suffix Tree to Solve the All-Pairs Suffix-Prefix Problem [J] . Maan Haj Rachid, Qutaibah Malluhi, Mohamed Abouelhoda BioMed research international . 2014,第16期

机译：使用Sadakane压缩后缀树来解决全对后缀前缀问题
4. Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees [C] . Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, Conference on empirical methods in natural language processing . 2015

机译：紧凑，高效和无限的容量：用压缩后缀树建模语言建模
5. Suffix trees and suffix arrays in primary and secondary storage [D] . Ko, Pang 2007

机译：主存储和辅助存储中的后缀树和后缀数组
6. Using the Sadakane Compressed Suffix Tree to Solve the All-Pairs Suffix-Prefix Problem [O] . Maan Haj Rachid, Qutaibah Malluhi, Mohamed Abouelhoda -1

机译：使用Sadakane压缩后缀树来解决全对后缀前缀问题
7. Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees [O] . Ehsan Shareghi, Matthias Petri, Gholamreza Haffari, 2015

机译：紧凑，高效和无限的容量：使用压缩后缀树进行语言建模

Compact, Efficient and Unlimited Capacity: Language Modeling with Compressed Suffix Trees

摘要

著录项

相似文献

相关主题

期刊订阅