www.uhasselt.be
DSpace

Document Server@UHasselt >
Research >
Research publications >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/8888

Title: Design and analysis of query languages for structured documents. A formal and logical approach
Authors: Neven, Frank
Advisors: Van den Bussche, Jan
Issue Date: 1999
Publisher: UHasselt Diepenbeek
Abstract: This dissertation introduces query languages for structured documents and investigates their expressiveness and optimization. It is natural to model structured documents, like for example XML documents, by labeled trees where the children of a node are ordered. We therefore revisit some of the established formal language theory formalisms for computations on trees but approach them from a query language perspective. We start with a study of the classical formalism of attribute grammars to query documents modeled as derivation trees of context-free grammars. By restricting the attributes to Booleans and relations, and the semantic rules to propositional logic and first-order logic formulas, respectively, we obtain powerful query languages well suited for expressing unary and relational queries. Further restrictions as well as generalizations lead to a complete picture of the expressiveness of such languages. Interestingly, we show that the above formalisms can be readily implemented on top of a deductive database by exhibiting a translation into datalog with negation. In the rest of the dissertation, we focus on formalisms expressing unary, also called selection, queries. These are important as, on the one hand, they constitute the most simple and common form of document querying, and, on the other hand, they form the basis of more general query languages transforming documents into other documents. Specifically, we introduce extensions of attribute grammars (extended AGs) to query documents modeled by extended context-free grammars which correspond more closely to the actual XML document type definitions (DTDs). A fundamental difference is that now derivation trees are no longer ranked, which means that nodes need not have a fixed maximal number of children. This seemingly innocent difference greatly complicates the definition of attribute grammars. We give full account of the expressiveness of the query languages based on this formalism and obtain the exact complexity of various optimization questions. Next, we abandon attribute grammars and turn to another well-studied computation model for trees: the tree automaton. In particular, we want to understand how such automata, on both ranked and unranked trees, can be used to express unary structured document queries. Concretely, we define a query automaton (QA) as a two-way deterministic finite automaton over trees that can select nodes depending on the state and the label at those nodes. We study the expressiveness of QAs and investigate the exact complexity of various optimization questions. Finally, we apply the techniques and results obtained in this work. We drastically improve the upper bound on the complexity of the equivalence test of Region Algebra expressions from iterated exponential to EXPTIME by essentially translating the latter into equivalent extended AGs. By employing the techniques used to obtain our expressiveness results, we establish the expressiveness as a pattern language of the actual XML transformation language XSLT. Further, we obtain that our languages are more expressive than most current query languages for structured documents and semi-structured data. As argued by Suciu [Suc98], dealing with the inherent order of children of nodes in documents is a major research issue in the design of query languages for semistructured data and XML. An important contribution of this work is that all proposed query languages can take this ordering into account.
URI: http://hdl.handle.net/1942/8888
Type: Theses and Dissertations
Appears in Collections: PhD theses
Research publications

Files in This Item:

Description SizeFormat
N/A14.7 MBAdobe PDF

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.