Unsupervised Multiword Segmentation of Large Corpora using Prediction-Driven Decomposition of n-grams

机译：使用n-gram的预测驱动分解对大型语料库进行无监督多词分割

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present a new, efficient unsupervised approach to the segmentation of corpora into multiword units. Our method involves initial decomposition of common n-grams into segments which maximize within-segment predictability of words, and then further refinement of these segments into a multiword lexicon. Evaluating in four large, distinct corpora, we show that this method creates segments which correspond well to known multiword expressions; our model is particularly strong with regards to longer (3+ word) multiword units, which are often ignored or minimized in relevant work.

机译：我们提出了一种新的，有效的无监督方法将语料库分割成多字单元。我们的方法包括将普通n元语法初始分解为可最大程度地提高单词在词内的可预测性的片段，然后将这些片段进一步细化为多词词典。通过对四个大型，不同的语料库进行评估，我们证明了该方法创建的片段与已知的多词表达式非常吻合。对于更长（3个单词）的多单词单元，我们的模型特别强大，在相关工作中通常会忽略或最小化这些单词。

著录项

来源
《International conference on computational linguistics》|2014年|753-761|共9页
会议地点
作者
Julian Brooke; Vivian Tsang; Graeme Hirst; Fraser Shein;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Advances in unsupervised audio classification and segmentation for the broadcast news and NGSW corpora [J] . Rongqing Huang, Hansen J.H.L. IEEE transactions on audio, speech and language processing . 2006,第3期

机译：广播新闻和NGSW语料库的无监督音频分类和分段的进展
2. Mean Field Decomposition of A Posteriori Probability for MRF-Based Image Segmentation: Unsupervised Multispectral Textured Image Segmentation [J] . Hideki NODA, Mehdi N. SHIRAZI, Nobuteru TAKAO IEICE Transactions on Information and Systems . 1999,第12期

机译：基于MRF的图像分割的后验概率平均场分解：无监督多光谱纹理图像分割
3. Unsupervised texture segmentation using multichannel decomposition and hidden Markov models [J] . Jia-Lin Chen, Kundu A. IEEE Transactions on Image Processing . 1995,第5期

机译：使用多通道分解和隐马尔可夫模型的无监督纹理分割
4. Unsupervised Multiword Segmentation of Large Corpora using Prediction-Driven Decomposition of n-grams [C] . Julian Brooke, Vivian Tsang, Graeme Hirst, International conference on computational linguistics . 2014

机译：使用预测驱动的N-GRAMS的大公司的无监督多字分割
5. Robust texture identification and unsupervised texture segmentation using multichannel decomposition and hidden Markov model. [D] . Chen, Jia-Lin. 1992

机译：使用多通道分解和隐马尔可夫模型的稳健纹理识别和无监督纹理分割。
6. Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification [O] . Uxoa Inurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, 2020

机译：从Corpora学习言论学的语言论：ullwword表达识别的语言上积极的方法
7. Unsupervised Acquisition of Comprehensive Multiword Lexicons using Competition in an n-gram Lattice [O] . Julian Brooke, Jan Šnajder, Timothy Baldwin 2017

机译：无监督在N-GRAM格子中使用竞争的综合多语词汇的收购
8. Data Analysis Project: Leveraging Massive Textual Corpora Using n-Gram Statistics [R] . Carlson, A., Mitchell, T. M., Fette, I. 2008

机译：数据分析项目：利用n-Gram统计学利用大规模文本语料库

Unsupervised Multiword Segmentation of Large Corpora using Prediction-Driven Decomposition of n-grams

摘要

著录项

相似文献

相关主题

期刊订阅