首页> 外文会议>International conference on computational linguistics >The creation of large-scale annotated corpora of minority languages using UniParser and the EANC platform
【24h】

The creation of large-scale annotated corpora of minority languages using UniParser and the EANC platform

机译:使用UniParser和EANC平台创建大规模的带注释的少数民族语言语料库

获取原文

摘要

This paper is devoted to the use of two tools for creating morphologically annotated linguistic corpora: UniParser and the EANC platform. The EANC platform is the database and search framework originally developed for the Eastern Armenian National Corpus and later adopted for other languages. UniParser is an automated morphological analysis tool developed specifically for creating corpora of languages with relatively small numbers of native speakers for which the development of parsers from scratch is not feasible. It has been designed for use with the EANC platform and generates XML output in the EANC format. UniParser and the EANC platform have already been used for the creation of the corpora of several languages: Albanian, Kalmyk, Lezgian, Ossetic, of which the Ossetic corpus is the largest (5 million tokens, 10 million planned for 2013), and are currently being employed in construction of the corpora of Buryat and Modern Greek languages. This paper will describe the general architecture of the EANC platform and UniParser, providing the Ossetic corpus as an example of the advantages and disadvantages of the described approach.
机译:本文致力于使用两种工具来创建具有形态学注释的语言语料库:UniParser和EANC平台。 EANC平台是数据库和搜索框架,最初是为东亚美尼亚国家语料库开发的,后来被其他语言采用。 UniParser是一种自动的形态学分析工具,专门用于创建具有相对较少数量的母语使用者的语言集,因此从头开始开发解析​​器是不可行的。它设计用于EANC平台,并以EANC格式生成XML输出。 UniParser和EANC平台已用于创建多种语言的语料库:阿尔巴尼亚语,卡尔梅克语,勒兹吉安语,奥塞梯语,其中奥塞梯语料库最大(500万个令牌,2013年计划为1000万个)。被用于建造布里亚特语和现代希腊语的语料库。本文将描述EANC平台和UniParser的一般体系结构,并提供Ossetic语料库作为该方法优缺点的示例。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号