首页> 外文会议>International Conference on Advanced Intelligent Systems and Informatics >ARARSS: A System for Constructing and Updating Arabic Textual Resources
【24h】

ARARSS: A System for Constructing and Updating Arabic Textual Resources

机译:ararss:一种构建和更新阿拉伯语文本资源的系统

获取原文

摘要

The growth of electronically readable Arabic content available on the web has become a rich source from which to build new corpora or update the existing ones. The availability of such corpora will be beneficial for Arabic corpus linguistics, computational linguistics, and natural language processing. In this paper, we present ARARSS, a tool capable of automatically constructing and updating textual corpora benefiting from the Rich Site Summary (RSS) feeds. ARARSS is capable of collecting the texts in a properly categorized manner according to user needs, in addition to their metadata (for example, location, time, and topic) as provided by RSS sources. We used ARARSS to construct a modern standard Arabic corpus comprising 117,819 texts and more than 28 million words. ARARSS is an open source tool and freely available to download (http://corpus.kacst.edu.sa/more_info.jsp) along with the constructed corpus.
机译:网络上可用的电子可读阿拉伯语内容的增长已成为一个丰富的源,从中建立新的语料库或更新现有的源。 此类公司的可用性将有利于阿拉伯语语料库语言学,计算语言学和自然语言处理。 在本文中,我们提出了一个能够自动构建和更新受益于丰富的站点摘要(RSS)源的文本语料库的工具。 除了由RSS源提供的元数据(例如,位置,时间和主题)之外,ARARS还能够根据用户需求以适当的分类方式收集文本。 我们使用ararss构建一个现代标准的阿拉伯语语料库,包括117,819个文本和超过2800万字。 Arars是一个开源工具,并自由地下载(http://corpus.kacst.edu.sa/more_info.jsp)以及构造的语料库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号