首页> 外文期刊>Expert systems with applications >Automatic Extraction Of New Words Based On Google News Corpora For Supporting Lexicon-based Chinese Word Segmentation Systems
【24h】

Automatic Extraction Of New Words Based On Google News Corpora For Supporting Lexicon-based Chinese Word Segmentation Systems

机译:基于Google新闻语料库的自动提取新词以支持基于词典的中文分词系统

获取原文
获取原文并翻译 | 示例
           

摘要

Chinese word segmentation is an essential step in a processing of Chinese natural language because it is beneficial to the Chinese text mining and information retrieval. Currently, the lexicon-based Chinese word segmentation scheme is widely adopted, which can correctly identify Chinese sentences as distinct words from Chinese language texts in real-word applications. However, the word identification ability of the lexicon-based scheme is highly dependent with a well prepared lexicon with sufficient amount of lexical entries which covers all of the Chinese words. In particular, this scheme cannot perform Chinese word segmentation process well for highly changeable texts with time, such as newspaper articles and web documents. This is because highly changeable documents often contain many new words that cannot be identified by a lexicon-based Chinese word segmentation system with a constant lexicon. Moreover, to maintain a lexicon by manpower is an inefficient and time-consuming job. Therefore, this study proposes a novel statistics-based scheme for extraction of new words based on the categorized corpora of Google News retrieved automatically from the Google News site to promote the word identification ability for lexicon-based Chinese word segmentation systems. Since corpora of news almost contain all words used in daily life, to extract news words from corpora of news and to incrementally add them into lexicon for lexicon-based Chinese word segmentation systems provide benefits in terms of automatically constructing a professional lexicon and enhancing word identification capability. Compared to another proposed scheme of new word extraction, the experimental results indicated that the proposed extraction scheme of new words not only more correctly retrieves new words from the categorized corpora of Google News, but also obtains larger amount of new words. Moreover, the proposed scheme of new word extraction has been applied to automatically expand the lexicon of the Chinese word segmentation system ECScanner (A Chinese Lexicon Scanner with Lexicon Extension). Currently, the ECScanner has been published on the Web to provide Chinese word segmentation service based on Web service. Experimental results also confirmed that ECScanner is superior to CK.IP (Chinese knowledge information processing) in identifying meaningful Chinese words.
机译:中文分词是中文自然语言处理中必不可少的步骤,因为它有利于中文文本的挖掘和信息检索。目前,基于词典的汉语分词方案被广泛采用,可以在真实词应用中正确地将汉语句子识别为与汉语文本不同的词。然而,基于词典的方案的单词识别能力高度依赖于准备充分的词典,该词典具有足够数量的词汇条目,其覆盖所有中文单词。尤其是,该方案不能很好地对带有时间变化的文本(例如报纸文章和Web文档)进行很好的中文分词处理。这是因为高度易变的文档通常包含许多新词,而这些新词无法被具有恒定词典的基于词典的中文分词系统识别。此外,以人工维护词典是一项低效且耗时的工作。因此,本研究提出了一种基于统计的新颖方案,该方案基于从Google新闻站点自动检索的Google新闻分类语料库提取新单词,以提高基于词典的中文分词系统的单词识别能力。由于新闻语料库几乎包含了日常生活中使用的所有单词,因此从新闻语料库中提取新闻单词并将其逐步添加到词典中,以便基于词典的中文分词系统在自动构建专业词典和增强单词识别方面提供了好处能力。与另一个提出的新词提取方案相比,实验结果表明,提出的新词提取方案不仅可以更正确地从Google新闻分类语料库中检索新词,而且可以获取更多的新词。此外,提出的新单词提取方案已被应用于自动扩展中文分词系统ECScanner(带有Lexicon扩展功能的中文词典扫描器)的词典。目前,ECScanner已在Web上发布,以基于Web服务提供中文分词服务。实验结果还证实,ECScanner在识别有意义的中文单词方面优于CK.IP(中文知识信息处理)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号