...
首页> 外文期刊>Modern Applied Science >The Design and the Construction of the Traditional Arabic Lexicons Corpus (The TAL-Corpus)
【24h】

The Design and the Construction of the Traditional Arabic Lexicons Corpus (The TAL-Corpus)

机译:传统阿拉伯语Lexicons语料库(TAL-Corpus)的设计和构建

获取原文
           

摘要

Arabic lexicography is a well-established and deep-rooted art of Arabic literature. Computational lexicography, invests computational and storage powers of modern computers, to accelerate long-term efforts in lexicographic projects. A collection of 23 machine-readable dictionaries, which are freely available on the web, were used to build the Corpus of Traditional Arabic lexicons (the TAL-Corpus). The purpose for constructing the TAL-Corpus is to collect and organize well-established and long traditions of traditional Arabic lexicons which can also be used to create new corpus-based Arabic dictionaries. The compilation of the TAL-Corpus followed standard design and development criteria that informed major decisions for corpus creation. The corpus building process involved extracting information from disparate formats and merging traditional Arabic lexicons. As a result, the TAL-Corpus contains more than 14 million words and over 2 million word types (different words). The TAL-Copus was applied to create useful morphological database. This database was automatically constructed using a new algorithm which is informed by Arabic linguistics theory. The newly developed algorithm processed the text of the TAL-Corpus and as result it extracted 2 781 796 entries. These entries were stored in the morphological database where each represents a word-root pair (i.e. an Arabic word and its root). A comparative evaluation of the TAL-Corpus and other three Arabic corpora showed that the lexical diversity of its vocabulary scored higher. Moreover, its coverage was computed by comparing words and lemmas against their equivalents of other corpora where it scored about 67% when comparing words and 82% when comparing lemmas.
机译:阿拉伯词典技术是阿拉伯文学中根深蒂固的艺术。计算词典技术投资了现代计算机的计算和存储能力,以加快词典技术项目的长期工作。网络上免费提供了23种机器可读词典的集合,这些词典用于构建传统阿拉伯词典的语料库(TAL-Corpus)。构建TAL-Corpus的目的是收集和组织建立良好且悠久的传统阿拉伯词典,这些传统也可以用于创建基于语料库的新阿拉伯词典。 TAL-Corpus的编译遵循了标准的设计和开发标准,该标准为制定语料库的主要决策提供了依据。语料库的构建过程涉及从不同的格式中提取信息并合并传统的阿拉伯语词典。结果,TAL-Corpus包含超过1400万个单词和超过200万个单词类型(不同的单词)。 TAL-Copus用于创建有用的形态数据库。该数据库是使用一种新的算法自动构建的,该算法以阿拉伯语言学理论为基础。新开发的算法处理了TAL-Corpus的文本,结果提取了2 781 796个条目。这些条目存储在形态数据库中,每个条目代表一个词根对(即阿拉伯词及其根)。对TAL-Corpus和其他三个阿拉伯语语料库的比较评估显示,其词汇的词汇多样性得分较高。此外,它的覆盖范围是通过将词和引理与其他语料库的等效词进行比较来计算的,在比较词时它的得分约为67%,在比较词组时其得分为82%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号