首页> 外文期刊>ACM transactions on Asian language information processing >A Bayesian Alignment Approach to Transliteration Mining
【24h】

A Bayesian Alignment Approach to Transliteration Mining

机译:贝叶斯对齐方法在音译挖掘中的应用

获取原文
获取原文并翻译 | 示例
           

摘要

In this article we present a technique for mining transliteration pairs using a set of simple features derived from a many-to-many bilingual forced-alignment at the grapheme level to classify candidate transliteration word pairs as correct transliterations or not. We use a nonparametric Bayesian method for the alignment process, as this process rewards the reuse of parameters, resulting in compact models that align in a consistent manner and tend not to over-fit. Our approach uses the generative model resulting from aligning the training data to force-align the test data. We rely on the simple assumption that correct transliteration pairs would be well modeled and generated easily, whereas incorrect pairs—being more random in character—would be more costly to model and generate. Our generative model generates by concatenating bilingual grapheme sequence pairs. The many-to-many generation process is essential for handling many languages with non-Roman scripts, and it is hard to train well using a maximum likelihood techniques, as these tend to over-fit the data. Our approach works on the principle that generation using only grapheme sequence pairs that are in the model results in a high probability derivation, whereas if the model is forced to introduce a new parameter in order to explain part of the candidate pair, the derivation probability is substantially reduced and severely reduced if the new parameter corresponds to a sequence pair composed of a large number of graphemes. The features we extract from the alignment of the test data are not only based on the scores from the generative model, but also on the relative proportions of each sequence that are hard to generate. The features are used in conjunction with a support vector machine classifier trained on known positive examples together with synthetic negative examples to determine whether a candidate word pair is a correct transliteration pair. In our experiments, we used all data tracks from the 2010 Named-Entity Workshop (NEWS'10) and use the performance of the best system for each language pair as a reference point. Our results show that the new features we propose are powerfully predictive, enabling our approach to achieve levels of performance on this task that are comparable to the state of the art.
机译:在本文中,我们提出了一种技术,该技术使用一组简单的特征来挖掘音译对,该简单特征源自字素级别的多对多双语强制对齐,以将候选音译词对归类为是否正确音译。我们在对齐过程中使用非参数贝叶斯方法,因为该过程奖励了参数的重用,从而导致紧凑的模型以一致的方式对齐并且不会过度拟合。我们的方法使用了生成模型,该模型是通过对齐训练数据来强制对齐测试数据而产生的。我们依靠一个简单的假设,即正确的音译对将被很好地建模和生成,而错误的对(字符更具随机性)将被建模和生成的代价更高。我们的生成模型是通过将双语字素序列对连接而生成的。多对多生成过程对于使用非罗马脚本处理多种语言是必不可少的,并且使用最大似然技术很难很好地进行训练,因为这些技术往往会过度拟合数据。我们的方法基于以下原则:仅使用模型中的字素序列对生成会导致高概率推导,而如果为了解释部分候选对而强制引入模型的新参数,则推导概率为如果新参数对应于由大量字素组成的序列对,则显着减少和严重减少。我们从测试数据的比对中提取的特征不仅基于生成模型的分数,还基于难以生成的每个序列的相对比例。将这些特征与在已知肯定示例和综合否定示例上训练的支持向量机分类器结合使用,以确定候选单词对是否为正确的音译对。在我们的实验中,我们使用了2010年命名实体研讨会(NEWS'10)的所有数据,并将每种语言对的最佳系统性能作为参考点。我们的结果表明,我们提出的新功能具有强大的预测能力,使我们的方法能够在这项任务上达到与最新技术水平相当的性能水平。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号