首页> 中文期刊>计算机技术与发展 >一种基于词加权LDA模型的专利文献分类方法

一种基于词加权LDA模型的专利文献分类方法

     

摘要

When the traditional topic model carries on the text classification, its characteristic words choose the high frequency words under the law of statistics. However, in the patent literature classification, most professional words are often overwhelmed by high frequency words, resulting in the low accuracy of the topic model in the classification of patent documents. Therefore, we present a supervised LDA topic model based on word weighted for the classification of patent documents. Based on the co-occurrence relationship between professional words and high-frequency words, KeyGraph algorithm is used to select the keywords with better characterization, and the mutual information function is used to calculate the weight of each keyword to establish a professional word dictionary. On this basis, a supervised LDA model is built, the word weighted is extended to the LDA model and Gibbs Sampling is used to estimate the parameters. Compared with the LDA model and its two variant models, the classification accuracy of the model is improved by 4.62%, 3.74% and 3.26% respectively on the patent documents. It shows that the high degree of specialization words selected by the model has a higher degree of relevance to the topic, and the classification efficiency and accuracy are significantly improved.%传统的主题模型在进行文本分类时,特征词多选取统计规律下的高频词,而在专利文献分类中,多数专业词汇往往被高频词所淹没,造成主题模型在专利文献分类的准确率不高.对此,提出一种基于词加权的有监督LDA主题模型用于专利文献的分类.从专业词与高频词的共现关系出发,利用KeyGraph算法选取特征表征能力更优的关键词,再利用互信息函数计算各关键词权重,建立专业词字典.在此基础上,建立一个有监督的LDA模型,将词加权扩展至LDA模型,并采用Gibbs Sampling进行参数估计.在专利文献上进行分类实验,与LDA模型及其两种变型模型相比,该模型分类准确率分别平均提高了4.62%、3.74%和3.26%.表明该模型选取的高区分度的专业词汇与主题关联度更高,分类效率和准确率均有明显提高.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号