Document Multiplicity Elimination and Corpora Management

机译：文档多重消除和语料库管理

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper deals with the process of corpora (large text collections) creation, their storing and retrieving. It is advantageous to include WWW sources easily accessible on the Internet into a new built corpus. It is true especially for less frequent languages, the example of which is Czech. However, the consequence of such approach is relatively high document multiplicity. The first part of this paper presents the method of document multiplicity elimination. The second part then deals with corpora management tools, considers its strengths and gives the possible directions of future developments of these systems.

机译：本文涉及Grouda（大型文本收集）创建，他们的存储和检索过程。在互联网上容易地访问WWW源是有利的，可以在互联网上进入新的构建语料库中。特别是对于较少的语言，这是捷克语的例子。然而，这种方法的结果是相对较高的文件多重性。本文的第一部分呈现了文档多重消除的方法。然后，第二部分处理Corpora Management Tools，考虑其优势，并提供了这些系统未来发展的可能指示。

著录项

来源
《World multiconference on systems, cybernetics and informatics》|1999年||共5页
会议地点
作者
Pavel Rychly; Pavel Smrz; Pavel Filipensky;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类信息资源及其管理;
关键词

相似文献

外文文献
中文文献
专利

1. Text-Mining, Structured Queries, and Knowledge Management on Web Document Corpora [J] . Hamid Mousavi, Maurizio Atzori, Shi Gao, SIGMOD record . 2014,第3期

机译：Web文档语料库上的文本挖掘，结构化查询和知识管理
2. Finding the Minimum Document Length for Reliable Clustering of Multi-Document Natural Language Corpora [J] . Hermann Moisla* Journal of Quantitative Linguistics . 2011,第1期

机译：寻找最小文档长度以可靠地聚类多文档自然语言语料库
3. A scaleable document clustering approach for large document corpora [J] . Niall Rooney, David Patterson, Mykola Galushka, Information Processing & Management . 2006,第5期

机译：大型文档语料库的可扩展文档聚类方法
4. Document Multiplicity Elimination and Corpora Management [C] . Pavel Rychly, Pavel Smrz, Pavel Filipensky World multiconference on systems, cybernetics and informatics . 1999

机译：文档多重消除和语料库管理
5. Edits Based Categorization of Crowd Sourced Document Corpora with Application to Wikipedia [D] . Fang, Yue 2018

机译：基于人群的文档库的基于编辑的分类及其在维基百科中的应用
6. FacetGist: Collective Extraction of Document Facets in Large Technical Corpora [O] . Tarique Siddiqui, Xiang Ren, Aditya Parameswaran, -1

机译：FacetGist：大型技术语料库中文档构面的集体提取
7. Text-Mining, Structured Queries, and Knowledge Management on Web Document Corpora [O] . Hamid Mousavi, Maurizio Atzori, Shi Gao, 2015

机译：Web文档语料库中的文本挖掘，结构化查询和知识管理

Document Multiplicity Elimination and Corpora Management

摘要

著录项

相似文献

相关主题

期刊订阅