【24h】

Tibetan Text Classification Method Based on BiLSTM Model

机译:基于BiLSTM模型的藏文文本分类方法

获取原文

摘要

Text classification is a key technology in the field of information retrieval and data mining. It can effectively solve the problem of information clutter and locate effective information. This paper proposes a method of merging Word2vec and TF-IDF Tibetan text representation based on class frequence variance. Based on the representation method, BiLSTM network model is used to classify Tibetan text. First of all, it proposes to perform pre-processing work such as word segmentation on the Tibetan classification text, construction of a basic stop word list, and calculation of word frequency. Then the text representation uses the method of merging Word2vec and the TF-IDF algorithm based on class frequence variance, which takes into account both the importance of words and the distribution of words. Finally, the word vector is transmitted to the classification model to train the Tibetan text classifier, and the trained classifier is used to classify the unclassified Tibetan text. The experimental results show that the text representation method combined with Word2vec and TF-IDF based on class frequency variance can effectively improve the effect of text classification. The accuracy of Tibetan text classifier based on BiLSTM can reach 89.03%, which is significantly better than RNN LSTM.
机译:文本分类是信息检索和数据挖掘领域的关键技术。它可以有效解决信息混乱的问题,定位有效的信息。提出了一种基于类频率方差的Word2vec和TF-IDF藏文文本表示的融合方法。基于表示方法,使用BiLSTM网络模型对藏文进行分类。首先,它建议进行预处理工作,例如对藏文分类文本进行分词,构建基本的停用词表以及计算词频。然后,文本表示使用合并Word2vec的方法和基于类频率方差的TF-IDF算法,该方法同时考虑了单词的重要性和单词的分布。最后,将词向量传递给分类模型,训练藏文文本分类器,然后使用训练后的分类器对未分类的藏文文本进行分类。实验结果表明,基于类频差的Word2vec和TF-IDF相结合的文本表示方法可以有效地提高文本分类的效果。基于BiLSTM的藏文文本分类器的准确率可以达到89.03%,明显优于RNN LSTM。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号