首页> 外文会议>International Conference on Knowledge and Smart Technology >Dzongkha Word Segmentation using Deep Learning
【24h】

Dzongkha Word Segmentation using Deep Learning

机译:使用深度学习的宗喀语分词

获取原文

摘要

Natural Language Processing (NLP) has been applied to machine translation, chatbots, speech recognition, question and answer systems, document summarization and so on. The Dzongkha language of Bhutan, however, has not been considered in NLP systems, due, presumably, to the fact that the language is complex and written as a string of syllables without proper word boundaries. Thus, Dzongkha word segmentation is the essential first step in building the NLP applications. The novelty of our research is in applying Deep Learning to the task of Dzongkha word segmentation, avoiding the need for manual feature engineering. The segmentation problem is formulated as a syllable tagging task. We also incorporate the windows approach where the tag of a syllable depends on its surrounding syllables. Two sets of experiments were designed, with four models of varying context sizes in each set. We evaluated our models using the syllable-tagged-corpus prepared by Dzongkha Development Commission. The model with context size 2 achieved the highest F-score of 94.40% with 94.47% Precision and 94.35% Recall.
机译:自然语言处理(NLP)已应用于机器翻译,聊天机器人,语音识别,问答系统,文档摘要等。但是,由于语言很复杂并且被编写为没有适当单词边界的一串音节,因此在NLP系统中并未考虑不丹的宗喀语。因此,宗喀语分词是构建NLP应用程序必不可少的第一步。我们研究的新颖之处在于将深度学习应用于宗喀语分词的任务,而无需进行人工特征工程。分割问题被表述为音节标记任务。我们还结合了Windows方法,其中音节的标签取决于其周围的音节。设计了两组实验,每组实验中有四个具有不同上下文大小的模型。我们使用宗喀发展委员会准备的音节标记语料库评估了我们的模型。具有上下文大小2的模型以94.47%的精度和94.35%的召回率实现了94.40%的最高F分数。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号