Document Server@UHasselt >
Research >
Research publications >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/7741

Title: Inferring XML Schema Definitions from XML Data
Authors: BEX, Geert Jan
NEVEN, Frank
Issue Date: 2007
Publisher: VLDB Endowment
Citation: Proceedings of the 33rd International Conference on Very Large Data Bases, University of Vienna, Austria, September 23-27, 2007. p. 998-1009.
Abstract: Although the presence of a schema enables many optimizations for operations on XML documents, recent studies have shown that many XML documents in practice either do not refer to a schema, or refer to a syntactically incorrect one. It is therefore of utmost importance to provide tools and techniques that can automatically generate schemas from sets of sample documents. While previous work in this area has mostly focused on the inference of Document Type Definitions (DTDs for short), we will consider the inference of XML Schema Definitions (XSDs for short) -- the increasingly popular schema formalism that is turning DTDs obsolete. In contrast to DTDs where the content model of an element depends only on the element's name, the content model in an XSD can also depend on the context in which the element is used. Hence, while the inference of DTDs basically reduces to the inference of regular expressions from sets of sample strings, the inference of XSDs also entails identifying from a corpus of sample documents the contexts in which elements bear different content models. Since a seminal result by Gold implies that no inference algorithm can learn the complete class of XSDs from positive examples only, we focus on a class of XSDs that captures most XSDs occurring in practice. For this class, we provide a theoretically complete algorithm that always infers the correct XSD when a sufficiently large corpus of XML documents is available. In addition, we present a variant of this algorithm that works well on real-world (and therefore incomplete) data sets.
URI: http://hdl.handle.net/1942/7741
Link to publication: http://www.vldb.org/conf/2007/papers/research/p998-bex.pdf
ISBN: 978-1-59593-649-3
Category: C1
Type: Proceedings Paper
Appears in Collections: Research publications

Files in This Item:

Description SizeFormat
Preprint273.56 kBAdobe PDF

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.