This paper deals with the process of corpora (large text collections) creation, their storing and retrieving. It is advantageous to include WWW sources easily accessible on the Internet into a new built corpus. It is true especially for less frequent languages, the example of which is Czech. However, the consequence of such approach is relatively high document multiplicity. The first part of this paper presents the method of document multiplicity elimination. The second part then deals with corpora management tools, considers its strengths and gives the possible directions of future developments of these systems.
展开▼