...
首页> 外文期刊>Journal of computer sciences >Statistical Bayesian Learning for Automatic Arabic Text Categorization
【24h】

Statistical Bayesian Learning for Automatic Arabic Text Categorization

机译:用于自动阿拉伯文本分类的统计贝叶斯学习

获取原文
获取原文并翻译 | 示例
           

摘要

Problem statement: The rapid increasing of online Arabic documents necessitated applying Text Categorization techniques which are commonly used for English language to categorize them automatically. The complex morphology of Arabic language and its large vocabulary size make using these techniques difficult and costly in time and effort. Approach: We have investigated Bayesian learning models in order to enhance Arabic ATC. Results: Three classifiers based on Bayesian theorem had been implemented which are Simple Naive Bayes (NB), Multi-variant Bernoulli Naive Bayes (MBNB) and Multinomial Naive Bayes (MNB) models. TREC-2002 Light Stemmer was applied for Arabic stemming. For text representation, BOW and character-level 3, 4 and 5 g had been used. In order to reduce the dimensionality of feature space, we have used several feature selection methods; Mutual Information (MI), CHI-Square statistic (CHI), Odds Ratio (OR) and GSS-coefficient (GSS). Conclusion: MBNB classifier outperforms both of NB and MNB classifiers. BOW representation type leads to the best classification performance, nevertheless using character level n-gram leads to satisfied results by Bayesian learning for Arabic ATC.
机译:问题陈述:在线阿拉伯语文档的迅速增长使得必须使用文本分类技术,该技术通常用于英语以对其进行自动分类。阿拉伯语言的复杂形态及其庞大的词汇量使使用这些技术既困难又费时。方法:我们已经研究了贝叶斯学习模型,以增强阿拉伯ATC。结果:已经实现了基于贝叶斯定理的三个分类器,分别是简单朴素贝叶斯(NB),多元贝努利朴素贝叶斯(MBNB)和多项式朴素贝叶斯(MNB)模型。 TREC-2002轻型词干被应用于阿拉伯词干。对于文本表示,使用了BOW以及字符级别3、4和5 g。为了减少特征空间的维数,我们使用了几种特征选择方法。互信息(MI),卡方统计(CHI),赔率(OR)和GSS系数(GSS)。结论:MBNB分类器优于NB和MNB分类器。 BOW表示类型可导致最佳的分类性能,但是使用字符级n-gram可以使贝叶斯学习阿拉伯ATC的结果令人满意。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号