...
首页> 外文期刊>IEICE transactions on information and systems >Creating Chinese-English Comparable Corpora
【24h】

Creating Chinese-English Comparable Corpora

机译:创建中英文可比语料库

获取原文
   

获取外文期刊封面封底 >>

       

摘要

Comparable Corpora are valuable resources for many NLP applications, and extensive research has been done on information mining based on comparable corpora in recent years. While there are not enough large-scale available public comparable corpora at present, this paper presents a bi-directional CLIR-based method for creating comparable corpora from two independent news collections in different languages. The original Chinese document collections and English documents collections are crawled from XinHuaNet respectively and formatted in a consistent manner. For each document from the two collections, the best query keywords are extracted to represent the essential content of the document, and then the keywords are translated into the language of the other collection. The translated queries are run against the collection in the same language to pick up the candidate documents in the other language and candidates are aligned based on their publication dates and the similarity scores. Results show that our approach significantly outperforms previous approaches to the construction of Chinese-English comparable corpora.
机译:可比语料库是许多NLP应用程序的宝贵资源,近年来,基于可比语料库的信息挖掘已经进行了广泛的研究。虽然目前没有足够的大规模公共可比语料,但是本文提出了一种基于双向CLIR的方法,用于从两个独立的新闻集以不同语言创建可比语料。原始中文文档集和英文文档集分别从XinHuaNet抓取并以一致的方式格式化。对于两个集合中的每个文档,提取最佳查询关键字以表示该文档的基本内容,然后将关键字翻译为另一个集合的语言。将对翻译后的查询以相同语言针对集合进行操作,以选择其他语言的候选文档,然后根据候选者的发布日期和相似性得分对候选者进行对齐。结果表明,我们的方法明显优于先前构建汉英可比语料库的方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号