The creation of large-scale annotated corpora of minority languages using UniParser and the EANC platform

机译：使用UniParser和EANC平台创建大规模的带注释的少数民族语言语料库

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper is devoted to the use of two tools for creating morphologically annotated linguistic corpora: UniParser and the EANC platform. The EANC platform is the database and search framework originally developed for the Eastern Armenian National Corpus and later adopted for other languages. UniParser is an automated morphological analysis tool developed specifically for creating corpora of languages with relatively small numbers of native speakers for which the development of parsers from scratch is not feasible. It has been designed for use with the EANC platform and generates XML output in the EANC format. UniParser and the EANC platform have already been used for the creation of the corpora of several languages: Albanian, Kalmyk, Lezgian, Ossetic, of which the Ossetic corpus is the largest (5 million tokens, 10 million planned for 2013), and are currently being employed in construction of the corpora of Buryat and Modern Greek languages. This paper will describe the general architecture of the EANC platform and UniParser, providing the Ossetic corpus as an example of the advantages and disadvantages of the described approach.

机译：本文致力于使用两种工具来创建具有形态学注释的语言语料库：UniParser和EANC平台。 EANC平台是数据库和搜索框架，最初是为东亚美尼亚国家语料库开发的，后来被其他语言采用。 UniParser是一种自动的形态学分析工具，专门用于创建具有相对较少数量的母语使用者的语言集，因此从头开始开发解析器是不可行的。它设计用于EANC平台，并以EANC格式生成XML输出。 UniParser和EANC平台已用于创建多种语言的语料库：阿尔巴尼亚语，卡尔梅克语，勒兹吉安语，奥塞梯语，其中奥塞梯语料库最大（500万个令牌，2013年计划为1000万个）。被用于建造布里亚特语和现代希腊语的语料库。本文将描述EANC平台和UniParser的一般体系结构，并提供Ossetic语料库作为该方法优缺点的示例。

著录项

来源
《International conference on computational linguistics》|2012年|83-91|共9页
会议地点
作者
Timofey ARKHANGELSKIY; Oleg BELYAEV; Arseniy VYDRIN;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
corpus linguistics; automated morphological analysis; language documentation; Iranian languages; Ossetic;

机译：语料库语言学;自动形态分析;语言文件;伊朗语言;听觉的;

相似文献

外文文献
中文文献
专利

1. Cross-lingual training of summarization systems using annotated corpora in a foreign language [J] . Marina Litvak, Mark Last Information Retrieval . 2013,第5期

机译：使用外语带注释语料库的汇总系统的跨语言培训
2. Cross-lingual training of summarization systems using annotated corpora in a foreign language [J] . Marina Litvak, Mark Last Information retrieval . 2013,第5期

机译：使用外语带注释语料库的汇总系统的跨语言培训
3. Annotating discourse markers in spontaneous speech corpora on an example for the Slovenian language [J] . Darinka Verdonik, Matej Rojc, Marko Stabej Language Resources and Evaluation . 2007,第2期

机译：以斯洛文尼亚语为例，在自发语音语料中注释话语标记
4. The creation of large-scale annotated corpora of minority languages using UniParser and the EANC platform [C] . Timofey ARKHANGELSKIY, Oleg BELYAEV, Arseniy VYDRIN International conference on computational linguistics . 2012

机译：使用自卸车和EANC平台创建少数民族语言的大规模注释语料库
5. Parallel Sentence Detection in Comparable Corpora with Bilingual Word Embeddings for Low-Resource Languages [D] . Cadigan, John. 2018

机译：与低资源语言的双语单词嵌入式的同类语料中的并行句子检测
6. MEGGASENSE – The Metagenome/Genome Annotated Sequence Natural Language Search Engine: A Platform for the Construction of Sequence Data Warehouses [O] . Ranko Gacesa, Jurica Zucko, Solveig K. Petursdottir, 2017

机译：MEGGASENSE –元基因组/基因组注释序列自然语言搜索引擎：构建序列数据仓库的平台
7. Sustainability of annotated resources in linguistics: A web-platform for exploring, querying, and distributing linguistic corpora and other resources [O] . Rehm Georg, Schonefeld Oliver, Witt Andreas, 2015

机译：语言学中带注释资源的可持续性：用于探索，查询和分配语言语料库和其他资源的Web平台

The creation of large-scale annotated corpora of minority languages using UniParser and the EANC platform

摘要

著录项

相似文献

相关主题

期刊订阅