Document Server@UHasselt >
Research >
Research publications >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/11342

Title: Inference of Concise Regular Expressions and DTDs
Authors: BEX, Geert Jan
NEVEN, Frank
Schwentick, Thomas
Issue Date: 2010
Abstract: We consider the problem of inferring a concise Document Type Definition (DTD) for a given set of XML-documents, a problem that basically reduces to learning concise regular expressions from positive examples strings. We identify two classes of concise regular expressions-the single occurrence regular expressions (SOREs) and the chain regular expressions (CHAREs)-that capture the far majority of expressions used in practical DTDs. For the inference of SOREs we present several algorithms that first infer an automaton for a given set of example strings and then translate that automaton to a corresponding SORE, possibly repairing the automaton when no equivalent SORE can be found. In the process, we introduce a novel automaton to regular expression rewrite technique which is of independent interest. When only a very small amount of XML data is available, however ( for instance when the data is generated by Web service requests or by answers to queries), these algorithms produce regular expressions that are too specific. Therefore, we introduce a novel learning algorithm CRX that directly infers CHAREs ( which form a subclass of SOREs) without going through an automaton representation. We show that CRX performs very well within its target class on very small datasets.
Notes: [Bex, Geert Jan; Neven, Frank] Hasselt Univ, Database & Theoret Comp Sci Res Grp, B-3590 Diepenbeek, Belgium. [Bex, Geert Jan; Neven, Frank] Transnatl Univ Limburg, B-3590 Diepenbeek, Belgium. [Schwentick, Thomas] TU Dortmund, Fak Informat, D-44227 Dortmund, Germany. [Vansummeren, Stijn] Univ Libre Bruxelles, Res Lab Web & Informat Technol WIT, B-1050 Brussels, Belgium. geertjan.bex@uhasselt.be; frank.neven@uhasselt.be; thomas.schwentick@udo.edu; stijn.vansummeren@ulb.ac.be
URI: http://hdl.handle.net/1942/11342
DOI: 10.1145/1735886.1735890
ISI #: 000277925600004
ISSN: 0362-5915
Category: A1
Type: Journal Contribution
Validation: ecoom, 2011
Appears in Collections: Research publications

Files in This Item:

There are no files associated with this item.

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.