首页> 外文会议>International conference on computational linguistics >Token Level Identification of Linguistic Code Switching
【24h】

Token Level Identification of Linguistic Code Switching

机译:语言代码转换的令牌级别识别

获取原文

摘要

Typically native speakers of Arabic mix dialectal Arabic and Modern Standard Arabic in the same utterance. This phenomenon is known as linguistic code switching (LCS). It is a very challenging task to identify these LCS points in written text where we don't have an accompanying speech signal. In this paper, we address automatic identification of LCS points in Arabic social media text by identifying token level dialectal words. We present an unsupervised approach that employs a set of dictionaries, sound-change rules, and language models to tackle this problem. We tune and test the performance of our approach against human-annotated Egyptian and Levantine discussion fora datasets. Two types of annotations on the token level are obtained for each dataset: context sensitive and context insensitive annotation. We achieve a token level F_β=1 score of 74% and 72.4% on the context-sensitive development and test datasets, respectively. On the context insensitive annotated data, we achieve a token level F_β=1 score of 84.4% and 84.9% on the development and test datasets, respectively.
机译:通常,以阿拉伯语为母语的人将方言阿拉伯语和现代标准阿拉伯语混在一起使用。这种现象称为语言代码切换(LCS)。在没有伴随语音信号的书面文本中识别这些LCS点是一项非常艰巨的任务。在本文中,我们通过识别标记级别的方言词来解决阿拉伯社交媒体文本中LCS点的自动识别。我们提出了一种无监督的方法,该方法采用了一组词典,变声规则和语言模型来解决此问题。我们针对人为注释的埃及和黎凡特讨论数据集对我们的方法的性能进行了调整和测试。对于每个数据集,在令牌级别上获得两种类型的注释:上下文敏感注释和上下文不敏感注释。我们在上下文相关的开发和测试数据集上分别获得了74%和72.4%的令牌级别F_β= 1分数。在上下文无关的注释数据上,我们在开发和测试数据集上分别获得了84.4%和84.9%的令牌级别F_β= 1分数。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号