Document Server@UHasselt >
Research >
Research publications >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/1416

Title: Inference of Concise DTDs from XML Data
Authors: BEX, Geert Jan
NEVEN, Frank
Schwentick, T
Issue Date: 2006
Publisher: ACM Press
Citation: Dayal, Umeshwar & Whang, Kyu-Young & Lomet, David B. (Ed.) Proceedings of the 32nd International Conference on Very Large Databases (VLDB' 06). p. 115-126.
Abstract: We consider the problem to infer a concise Document Type Definition (DTD) for a given set of XML-documents, a problem which basically reduces to learning of concise regular expressions from positive example strings. We identify two such classes: single occurrence regular expressions (SOREs) and chain regular expressions (CHAREs). Both classes capture the far majority of the regular expressions occurring in practical DTDs and are succinct by definition. We present the algorithm iDTD (infer DTD) that learns SOREs from strings by first inferring an automaton by known techniques and then translating that automaton to a corresponding SORE, possibly by repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. We show that iDTD outperforms existing systems in accuracy, conciseness and speed. In a scenario where only a very small amount of XML data is available, for instance when generated by Web service requests or by answers to queries, iDTD produces regular expressions which are too specific. Therefore, we introduce a novel learning algorithm crx that directly infers CHAREs (which form a subclass of SOREs) without going through an automaton representation. We show that crx performs very well within its target class on very small data sets. Finally, we discuss incremental computation, noise, numerical predicates, and the generation of XML Schemas.
URI: http://hdl.handle.net/1942/1416
Category: C2
Type: Proceedings Paper
Appears in Collections: Research publications

Files in This Item:

Description SizeFormat
Postprint635.38 kBAdobe PDF

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.