This paper proposes a novel approach to integrating heterogeneous XML DTDs. With this approach, an information agent can be easily extended to integrate heterogeneous XML-based contents and perform federated searches. Based on a tree grammar inference technique, this approach derives an integrated view and source descriptions of XML DTDs in an information integration framework. The derivation takes advantage of naming and structural similarities among DTDs in similar domains. The complete approach consists of three main steps. (1) DTD clustering clusters DTDs of similar domains into classes. (2) Schema learning takes the DTDs in a class as input and applies a tree grammar inference technique to generate a set of tree grammar rules. (3) Minimization optimizes the rules previously generated and transforms them into an integrated view as well as source descriptions. We have implemented the proposed approach into a system called DEEP and had the system tested in artificial and real domains. Experimental results reveal that the DEEP can effectively and efficiently integrate radically different DTDs.
展开▼