Initially proposed as a data interchange format, XML aims also at becoming a format for data storage and management. However, XML documents in their textual form are rather verbose and tend to predate disk space, due to the textual and repetitive nature of the XML tags and of several XML types. One solution to this space occupancy problem consists of compressing XML. The XMill project proposed an XML-specific compression method: it compresses the structure (XML tags) separately from the content (data nodes, leaves of the XML tree), which is squeezed into a set of semantically uniform containers: for example, one container stores the text values of all elements in the document, another container stores all honeNo> etc. Each container is again separately compressed, by using the best suited compression algorithm; thus, XMill makes maximal use of inherent structure commonalities among semantically similar items. However, an XMill-compressed document is opaque to a query processor: thus, one must fully decompress a full chunk of data before being able to query it. The XGrind project [9] pioneered the field of query processing on compressed XML documents. XGrind does not separate data from structure: an XGrind-compressed XML document is still an XML document, whose tags have been dictionary-encoded, and whose data nodes have been compressed using the Huffmann algorithm [6] and left at their place in the document. XGrind's query processor can be considered an extended SAX parser, which can handle exact-match and prefix-match queries on compressed values and partial-match and range queries on decompressed values. However, XGrind does not support several operations in the compressed domain such as non-equality selections, joins, aggregations, nested queries or (construct) operations. Such operations occur in many XML query scenarios, as illustrated by XML benchmarks (e.g., all but the first two of the 20 queries in XMark [8]).
展开▼