中文分词是中文信息处理的基础.基于二元统计的HMM中文分词算法表现良好,但也存在易将包含常用介、副词的词进行误拆分的问题.改进的分词算法运用逆向最大匹配的思想,在计算粗分集权重的过程中,考虑了分词的词长及词序对正确切分的有利影响.该算法首先计算出二元统计粗分模型有向边的权值,然后根据词长修定权值,最后运用最短路径法求出分词结果.实验结果表明,该算法有效的解决了过分拆分的问题,分词效果良好.%Chinese word segmentation is a basic work for Chinese information processing. 2-Gram HMM algorithm for Chinese word segmentation is widely used, but easy to bring on wrong adverb word segmentation. Using reverse directional maximum match method(RDM) can lessen the error rate. In the process of calculating rough segmentation set, the improved algorithm adjusts the weights by the word length and words order and obtains the word segmentation result with the shortest path method. Experiment results show that the error rate of the improved algorith is decreased, and the algorithm performs better than the original.
展开▼