首页> 外文期刊>Quality Control, Transactions >Corpulyzer: A Novel Framework for Building Low Resource Language Corpora
【24h】

Corpulyzer: A Novel Framework for Building Low Resource Language Corpora

机译:Corpuirezer:建立低资源语言语料库的新框架

获取原文
获取原文并翻译 | 示例
       

摘要

The rapid proliferation of artificial intelligence has led to the development of sophisticated cutting-edge systems in natural language processing and computational linguistics domains. These systems heavily rely on high-quality dataset/corpora for the training of deep-learning algorithms to develop precise models. The preparation of a high-quality gold standard corpus for natural language processing on a large scale is a challenging task due to the need of huge computational resources, accurate language identification models, and precise content parsing tools. This task is further exacerbated in case of regional languages due to the scarcity of web content. In this article, we propose a generic framework of Corpus Analyzer – Corpulyzer – a novel framework for building low resource language corpora. Our framework consists of corpus generation and corpus analyzer module. We demonstrate the efficacy of our framework by creating a high-quality large scale corpus for the Urdu language as a case study. Leveraging dataset from Common Crawl Corpus (CCC), first, we prepare a list of seed URLs by filtering the Urdu language webpages. Next, we use Corpulyzer to crawl the World-Wide-Web (WWW) over a period of four years (2016–2020). We build Urdu web corpus “UrduWeb20” that consists of 8.0 million Urdu webpages crawled from 6,590 websites. In addition, we propose Low-Resource Language (LRL) website scoring algorithm and content-size filter for language-focused crawling to achieve optimal use of computational resources. Moreover, we analyze UrduWeb20 using variety of traditional metrics such as web-traffic-rank, URL depth, duplicate documents, and vocabulary distribution along with our newly defined content-richness metrics. Furthermore, we compare different characteristics of our corpus with three datasets of CCC. In general, we observe that contrary to CCC that focuses on crawling the limited number of webpages from highly ranked Urdu websites, Corpulyzer performs an in-depth crawling of Urdu content-rich websites. Finally, we made available Corpulyzer framework for the research community for corpus building.
机译:人工智能的快速增殖导致自然语言处理和计算语言学域中复杂的尖端系统的开发。这些系统严重依赖于高质量数据集/语料库,用于培训深度学习算法,以开发精确的模型。由于需要巨大的计算资源,准确的语言识别模型和精确的内容解析工具,制备大规模的自然语言处理的高质量金标准语料库是一个具有挑战性的任务。由于Web内容稀缺,此任务在区域语言中进一步加剧。在本文中,我们提出了语料库分析仪的通用框架 - <斜体XMLNS:MML =“http://www.w3.org/1998/math/mathml”xmlns:xlink =“http://www.w3.org / 1999 / xlink“> corpuirzer - 建立低资源语言语言的新框架。我们的框架包括语料库生成和语料库分析仪模块。我们通过为乌尔都语语言创造高质量的大规模语料库来证明我们的框架的功效作为案例研究。首先利用常见爬网语料库(CCC)的数据集,通过过滤Urdu语言网页来准备种子URL列表。接下来,我们使用Corpuirezer在四年(2016-2020)的时间内爬网(www)。我们构建URDU Web语料库“URDUWeb20”,由80万乌尔德网页组成,从6,590个网站爬行。此外,我们提出低资源语言(LRL)网站评分算法和<斜体XMLNS:MML =“http://www.w3.org/1998/math/mathml”xmlns:xlink =“http://www。 w3.org/1999/xlink"incontent-size过滤器用于侧重式爬网,以实现最佳计算资源的使用。此外,我们使用多种传统指标分析URDUWeb20,例如Web - 流量级,URL深度,重复文档和词汇分布以及我们新定义的内容丰富度量。此外,我们将语料库与CCC的三个数据集进行比较不同的特征。通常,我们观察到与CCC相反,侧重于爬行来自高度排名乌尔都语网站的有限数量的网页,Corpuirzer对乌尔都语内容的网站进行了深入的追逐。最后,我们为有关科技大楼的研究界进行了Corpiuszer框架。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号