Corpulyzer: A Novel Framework for Building Low Resource Language Corpora

Bilal Tahir; Muhammad Amir Mehmood

首页> 外文期刊>Quality Control, Transactions >Corpulyzer: A Novel Framework for Building Low Resource Language Corpora

【24h】

Corpulyzer: A Novel Framework for Building Low Resource Language Corpora

机译：Corpuirezer：建立低资源语言语料库的新框架

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The rapid proliferation of artificial intelligence has led to the development of sophisticated cutting-edge systems in natural language processing and computational linguistics domains. These systems heavily rely on high-quality dataset/corpora for the training of deep-learning algorithms to develop precise models. The preparation of a high-quality gold standard corpus for natural language processing on a large scale is a challenging task due to the need of huge computational resources, accurate language identification models, and precise content parsing tools. This task is further exacerbated in case of regional languages due to the scarcity of web content. In this article, we propose a generic framework of Corpus Analyzer – Corpulyzer – a novel framework for building low resource language corpora. Our framework consists of corpus generation and corpus analyzer module. We demonstrate the efficacy of our framework by creating a high-quality large scale corpus for the Urdu language as a case study. Leveraging dataset from Common Crawl Corpus (CCC), first, we prepare a list of seed URLs by filtering the Urdu language webpages. Next, we use Corpulyzer to crawl the World-Wide-Web (WWW) over a period of four years (2016–2020). We build Urdu web corpus “UrduWeb20” that consists of 8.0 million Urdu webpages crawled from 6,590 websites. In addition, we propose Low-Resource Language (LRL) website scoring algorithm and content-size filter for language-focused crawling to achieve optimal use of computational resources. Moreover, we analyze UrduWeb20 using variety of traditional metrics such as web-traffic-rank, URL depth, duplicate documents, and vocabulary distribution along with our newly defined content-richness metrics. Furthermore, we compare different characteristics of our corpus with three datasets of CCC. In general, we observe that contrary to CCC that focuses on crawling the limited number of webpages from highly ranked Urdu websites, Corpulyzer performs an in-depth crawling of Urdu content-rich websites. Finally, we made available Corpulyzer framework for the research community for corpus building.

机译：人工智能的快速增殖导致自然语言处理和计算语言学域中复杂的尖端系统的开发。这些系统严重依赖于高质量数据集/语料库，用于培训深度学习算法，以开发精确的模型。由于需要巨大的计算资源，准确的语言识别模型和精确的内容解析工具，制备大规模的自然语言处理的高质量金标准语料库是一个具有挑战性的任务。由于Web内容稀缺，此任务在区域语言中进一步加剧。在本文中，我们提出了语料库分析仪的通用框架 - <斜体XMLNS：MML =“http://www.w3.org/1998/math/mathml”xmlns：xlink =“http://www.w3.org / 1999 / xlink“> corpuirzer - 建立低资源语言语言的新框架。我们的框架包括语料库生成和语料库分析仪模块。我们通过为乌尔都语语言创造高质量的大规模语料库来证明我们的框架的功效作为案例研究。首先利用常见爬网语料库（CCC）的数据集，通过过滤Urdu语言网页来准备种子URL列表。接下来，我们使用Corpuirezer在四年（2016-2020）的时间内爬网（www）。我们构建URDU Web语料库“URDUWeb20”，由80万乌尔德网页组成，从6,590个网站爬行。此外，我们提出低资源语言（LRL）网站评分算法和<斜体XMLNS：MML =“http://www.w3.org/1998/math/mathml”xmlns：xlink =“http：//www。 w3.org/1999/xlink"incontent-size过滤器用于侧重式爬网，以实现最佳计算资源的使用。此外，我们使用多种传统指标分析URDUWeb20，例如Web - 流量级，URL深度，重复文档和词汇分布以及我们新定义的内容丰富度量。此外，我们将语料库与CCC的三个数据集进行比较不同的特征。通常，我们观察到与CCC相反，侧重于爬行来自高度排名乌尔都语网站的有限数量的网页，Corpuirzer对乌尔都语内容的网站进行了深入的追逐。最后，我们为有关科技大楼的研究界进行了Corpiuszer框架。

著录项

来源
《Quality Control, Transactions》 |2021年第1期|8546-8563|共18页
作者
Bilal Tahir; Muhammad Amir Mehmood;
展开▼
作者单位

Al-Khawarizmi Institute of Computer Science University of Engineering and Technology Lahore Pakistan;

Al-Khawarizmi Institute of Computer Science University of Engineering and Technology Lahore Pakistan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Buildings; Tools; Task analysis; Social networking (online); Crawlers; Computational modeling; Vocabulary;

机译：建筑物;工具;任务分析;社交网络（在线）;爬行;计算建模;词汇;
入库时间 2022-08-18 22:58:53

相似文献

外文文献
中文文献
专利

1. Neural machine translation for low-resource languages without parallel corpora [J] . Alina Karakanta, Jon Dehdari, Josef van Genabith Machine translation . 2018,第1a2期

机译：无需并行语料库的低资源语言的神经机器翻译
2. From a Smoking Gun to Spent Fuel: Principled Subsampling Methods for Building Big Language Data Corpora from Monitor Corpora [J] . Jacqueline Hettel Tidwell Data . 2019,第2期

机译：从抽烟的枪到消耗的燃料：从Monitor Corpora构建大语言数据语料库的原则性子采样方法
3. Extraction of Bilingual Dictionary from Comparable Corpora for Resource Scarce Languages [J] . Journal of computational and theoretical nanoscience . 2020,第1期

机译：从可比语料库中提取双语词典的资源稀缺语言
4. A Review on Building Bilingual Comparable Corpora for Resource-limited Languages [C] . Nurul Amelina Nasharuddin, Muhamad Taufik Abdullah, Azreen Azman, International Conference on Information Retrieval and Knowledge Management . 2018

机译：资源受限语言的双语可比语料库研究述评
5. Parallel Sentence Detection in Comparable Corpora with Bilingual Word Embeddings for Low-Resource Languages [D] . Cadigan, John. 2018

机译：与低资源语言的双语单词嵌入式的同类语料中的并行句子检测
6. Building Gold Standard Corpora for Medical Natural Language Processing Tasks [O] . Louise Deleger, Qi Li, Todd Lingren, 2012

机译：构建用于医学自然语言处理任务的金标准语料库
7. Corpulyzer: A Novel Framework for Building Low Resource Language Corpora [O] . Bilal Tahir, Muhammad Amir Mehmood 2021

机译：Corpuirezer：建立低资源语言语料库的新框架

Corpulyzer: A Novel Framework for Building Low Resource Language Corpora

摘要

著录项

相似文献

相关主题

期刊订阅