首页> 外文OA文献 >Arabic Language Processing for Text Classification. Contributions to Arabic Root Extraction Techniques, Building An Arabic Corpus, and to Arabic Text Classification Techniques.
【2h】

Arabic Language Processing for Text Classification. Contributions to Arabic Root Extraction Techniques, Building An Arabic Corpus, and to Arabic Text Classification Techniques.

机译:用于文本分类的阿拉伯语言处理。对阿拉伯语根提取技术,建立阿拉伯语语料库和阿拉伯文本分类技术的贡献。

摘要

The impact and dynamics of Internet-based resources for Arabic-speaking users is increasing in significance, depth and breadth at highest pace than ever, and thus requires updated mechanisms for computational processing of Arabic texts. Arabic is a complex language and as such requires in depth investigation for analysis and improvement of available automatic processing techniques such as root extraction methods or text classification techniques, and for developing text collections that are already labeled, whether with single or multiple labels.udThis thesis proposes new ideas and methods to improve available automatic processing techniques for Arabic texts. Any automatic processing technique would require data in order to be used and critically reviewed and assessed, and here an attempt to develop a labeled Arabic corpus is also proposed. This thesis is composed of three parts: 1- Arabic corpus development, 2- proposing, improving and implementing root extraction techniques, and 3- proposing and investigating the effect of different pre-processing methods on single-labeled text classification methods for Arabic.udThis thesis first develops an Arabic corpus that is prepared to be used here for testing root extraction methods as well as single-label text classification techniques. It also enhances a rule-based root extraction method by handling irregular cases (that appear in about 34% of texts). It proposes and implements two expanded algorithms as well as an adjustment for a weight-based method. It also includes the algorithm that handles irregular cases to all and compares the performances of these proposed methods with original ones. This thesis thus develops a root extraction system that handles foreign Arabized words by constructing a list of about 7,000 foreign words. The outcome of the technique with best accuracy results in extracting the correct stem and root for respective words in texts, which is an enhanced rule-based method, is used in the third part of this thesis. This thesis finally proposes and implements a variant term frequency inverse document frequency weighting method, and investigates the effect of using different choices of features in document representation on single-label text classification performance (words, stems or roots as well as including to these choices their respective phrases). This thesis applies forty seven classifiers on all proposed representations and compares their performances. One challenge for researchers in Arabic text processing is that reported root extraction techniques in literature are either not accessible or require a long time to be reproduced while labeled benchmark Arabic text corpus is not fully available online. Also, by now few machine learning techniques were investigated on Arabic where usual preprocessing steps before classification were chosen. Such challenges are addressed in this thesis by developing a new labeled Arabic text corpus for extended applications of computational techniques.udResults of investigated issues here show that proposing and implementing an algorithm that handles irregular words in Arabic did improve the performance of all implemented root extraction techniques. The performance of the algorithm that handles such irregular cases is evaluated in terms of accuracy improvement and execution time. Its efficiency is investigated with different document lengths and empirically is found to be linear in time for document lengths less than about 8,000. The rule-based technique is improved the highest among implemented root extraction methods when including the irregular cases handling algorithm. This thesis validates that choosing roots or stems instead of words in documents representations indeed improves single-label classification performance significantly for most used classifiers. However, the effect of extending such representations with their respective phrases on single-label text classification performance shows that it has no significant improvement. Many classifiers were not yet tested for Arabic such as the ripple-down rule classifier. The outcome of comparing the classifiers' performances concludes that the Bayesian network classifier performance is significantly the best in terms of accuracy, training time, and root mean square error values for all proposed and implemented representations.
机译:基于Internet的资源对说阿拉伯语的用户的影响和动态以前所未有的最高速度在重要性,深度和广度上不断增加,因此需要更新的机制来处理阿拉伯文本。阿拉伯语是一种复杂的语言,因此需要进行深入调查,以分析和改进可用的自动处理技术(例如,根提取方法或文本分类技术),以及开发已经被标记的文本集合(无论是单个还是多个标签)。论文提出了新的思想和方法,以改进阿拉伯文本的自动处理技术。任何自动处理技术都需要数据才能使用,严格审查和评估,在此还提出了开发标记阿拉伯语语料库的尝试。本文由三个部分组成:1-阿拉伯语料库开发,2-提出,改进和实施根提取技术,以及3-提出和调查不同预处理方法对阿拉伯语单标签文本分类方法的影响。本文首先开发了一种阿拉伯语语料库,准备将其用于测试词根提取方法以及单标签文本分类技术。它还通过处理不规则的案例(约占文本的34%)来增强基于规则的根提取方法。它提出并实现了两种扩展算法以及针对基于权重的方法的调整。它还包括处理所有异常情况的算法,并将这些方法与原始方法的性能进行比较。因此,本文通过构建大约7,000个外来词的列表,开发了一种处理外来阿拉伯化词的词根提取系统。本论文的第三部分使用了一种最精确的技术结果,即为文本中各个单词提取正确的词根和词根,这是一种基于规则的增强方法。本文最后提出并实现了一种变项词频逆文档频率加权方法,并研究了在文档表示中使用不同特征选择对单标签文本分类性能(单词,词干或词根以及将这些选择包括在内)的影响。相应的词组)。本文对所有提出的表示应用了47个分类器,并比较了它们的性能。阿拉伯文本处理研究人员的一个挑战是,文献中报道的词根提取技术要么无法访问,要么需要很长时间才能被复制,而带有标签的基准阿拉伯文本语料库却无法在网上完全获得。而且,到目前为止,几乎没有针对阿拉伯语的机器学习技术进行研究,其中选择了分类之前的常规预处理步骤。通过开发一种新的带标签的阿拉伯文本语料库以扩展计算技术的应用,本论文解决了这些挑战。技术。根据准确性的提高和执行时间来评估处理此类不规则情况的算法的性能。用不同的文档长度研究了其效率,并根据经验发现,对于小于8,000的文档长度,其效率在时间上是线性的。当包括不规则案例处理算法时,基于规则的技术在已实施的根提取方法中得到了最大的改进。本文证明,对于大多数使用的分类器而言,在文档表示中选择词根或词干而不是单词确实可以显着提高单标签分类性能。但是,将此类表示形式及其相应的短语扩展对单标签文本分类性能的影响表明,它没有明显的改进。许多分类器尚未针对阿拉伯语进行测试,例如波纹下降规则分类器。比较分类器性能的结果得出结论,就所有提议和实施的表示形式而言,贝叶斯网络分类器性能在准确性,训练时间和均方根误差值方面均显着最佳。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号