...
首页> 外文期刊>International Journal of Computer Systems Science & Engineering >Using cross ambiguity model improves the effect of vietnamese word segmentation
【24h】

Using cross ambiguity model improves the effect of vietnamese word segmentation

机译:使用交叉歧义模型可提高越南语分词的效果

获取原文
获取原文并翻译 | 示例
           

摘要

The ambiguity problem is widely distributed in Vietnamese sentences and impacts the accuracy of word segmentation. In this paper, we proposed a Vietnamese word segmentation method based on CRF (Condition Random Field) and cross ambiguity models, which we combined with Vietnamese lexical features to incorporate essential characteristics of Vietnamese into Condition Random Fields. Overall,5377 ambiguity fragments were extracted from the training corpus, selected statistical features, ambiguity field internal features and ambiguity contextual features and placed into the maximum entropy model and cross ambiguity model, and then incorporated into the segmentation model. The training corpus is divided into ten copies evenly for the cross validation experiment; the segmentation accuracy reached 96.55%. And compared with the Vietnamese segmentation tool,VnTokenizer, the experimental results suggest that our proposed method for Vietnamese segmentation performs well and is precise. The precision and recall rates of the proposed model are increased by 1.34% and 0.63% over VnTokenizer, and alignment error rate (AER) is reduced by 0.98%.
机译:歧义问题广泛分布在越南语句子中,并影响分词的准确性。本文提出了一种基于条件随机场(CRF)和交叉歧义模型的越南语分词方法,并结合越南语的词法特征将越南语的基本特征纳入条件随机场中。总体上,从训练语料库中提取了5377个歧义片段,选择了统计特征,歧义字段内部特征和歧义上下文特征,并将其放入最大熵模型和交叉歧义模型中,然后将其纳入分割模型中。训练语料库平均分为十份,用于交叉验证实验;分割精度达到96.55%。实验结果表明,与越南文分割工具VnTokenizer相比,本文提出的越南文分割方法效果良好且精确。与VnTokenizer相比,该模型的精度和召回率分别提高了1.34%和0.63%,对齐错误率(AER)降低了0.98%。

著录项

  • 来源
  • 作者单位

    Kunming Univ Sci & Technol, Sch Informat Engn & Automat, Kunming 650500, Peoples R China;

    Kunming Univ Sci & Technol, Sch Informat Engn & Automat, Kunming 650500, Peoples R China;

    Kunming Univ Sci & Technol, Sch Informat Engn & Automat, Kunming 650500, Peoples R China|Yunnan Coll, Key Lab Pattern Recognit & Intelligent Comp, Kunming 650500, Peoples R China;

    Kunming Univ Sci & Technol, Sch Informat Engn & Automat, Kunming 650500, Peoples R China|Yunnan Coll, Key Lab Pattern Recognit & Intelligent Comp, Kunming 650500, Peoples R China;

    Kunming Univ Sci & Technol, Sch Informat Engn & Automat, Kunming 650500, Peoples R China|Yunnan Coll, Key Lab Pattern Recognit & Intelligent Comp, Kunming 650500, Peoples R China;

    Kunming Univ Sci & Technol, Sch Informat Engn & Automat, Kunming 650500, Peoples R China|Yunnan Coll, Key Lab Pattern Recognit & Intelligent Comp, Kunming 650500, Peoples R China;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Vietnamese corpus; CRFs; Vietnamese segmentation; Maximum Entropy; Cross ambiguity model; VnTokenizer;

    机译:越南语料库;CRFs;越南语分割;最大熵;交叉歧义模型;VnTokenizer;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号