首页> 外文会议>2018 IEEE 2nd International Workshop on Arabic and Derived Script Analysis and Recognition >A Novel Term Weighting Scheme and an Approach for Classification of Agricultural Arabic Text Complaints
【24h】

A Novel Term Weighting Scheme and an Approach for Classification of Agricultural Arabic Text Complaints

机译:一种新颖的术语加权方案和一种农业阿拉伯文本投诉分类方法

获取原文
获取原文并翻译 | 示例

摘要

In this paper, a machine learning based approach for classification of farmers’ complaints which are in Arabic text into different crops has been proposed. Initially, the complaints are preprocessed using stop word removal, auto correction of words, handling some special cases and stemming to extract only the content terms. Some of the domain specific special cases which may affect the classification performance are handled. A new term weighting scheme called Term Class Weight-Inverse Class Frequency (TCW-ICF) is then used to extract the most discriminating features with respect to each class. The extracted features are then used to represent the preprocessed complaints in the form of feature vectors for training a classifier. Finally, an unlabeled complaint is classified as a member of one of the crop classes by the trained classifier. Nevertheless, a relatively large dataset consisting of more than 5000 complaints of the farmers described in Arabic script from eight different crops has been created. The proposed approach has been experimentally validated by conducting an extensive experimentation on the newly created dataset using KNN classifier. It has been argued that the proposed outperforms the baseline Vector Space Model (VSM). Further, the superiority of the proposed term weighting scheme in selecting the best set of discriminating features has been demonstrated through a comparative analysis against four well-known feature selection techniques. The new term is applied on Arabic script as a case study but it can be applied on any text data from any language.
机译:本文提出了一种基于机器学习的方法,将阿拉伯文中的农民投诉分类为不同的农作物。最初,使用停用词删除,自动纠正单词,处理某些特殊情况以及仅提取内容项的词干来对投诉进行预处理。处理了可能影响分类性能的某些特定于领域的特殊情况。然后使用称为术语类别权重-反向类别频率(TCW-ICF)的新术语加权方案来提取关于每个类别的最具区别性的特征。然后,将提取的特征用于以特征向量的形式表示预处理的投诉,以训练分类器。最后,由训练有素的分类员将未标记的投诉分类为一种农作物类别的成员。不过,已经创建了一个相对较大的数据集,其中包含5000多种阿拉伯文描述的来自八种不同农作物的农民投诉。通过使用KNN分类器对新创建的数据集进行了广泛的实验,对所提出的方法进行了实验验证。有人认为,该提议的性能优于基线向量空间模型(VSM)。此外,通过与四种众所周知的特征选择技术的比较分析,已经证明了所提出的术语加权方案在选择最佳的区分特征集方面的优越性。该新术语作为案例研究应用于阿拉伯文字,但可以应用于来自任何语言的任何文本数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号