|
Developer XML
Updating XQuery
By Jason Hunter
XQuery has some new possibilities, from atomization to trace to file structures.
In "X Is for XQuery" (Oracle Magazine, May/June 2003), I introduced XQuery, a technology under development by the World Wide Web Consortium (W3C) intended for querying and manipulating XML data
or anything that can have an XML facade, such as a relational database. That introductory article covered the XQuery draft specification published
in November 2002. In May 2003,
the W3C issued a new XQuery draft
specification, and this follow-up article explains the most interesting changes and new features added to the May draft specification, including: library modules; prolog variables; external functions; and new functions for debugging, error handling, and formatting.
Changes
The May draft added numerous new
features, but first I'll discuss the
changes made to existing features.
Some changes were cosmetic. For example, the document() input function (the function that returns a document
at a given Uniform Resource Identifier [URI]) has the new shorter name doc(). Also, comments that once were written { comment } have a new smiling syntax (: comment :). Yes, now every comment's a joke.
Some changes were more fundamental. As perhaps the most significant change, the distinct-values() function no longer returns nodes (nodes are XML constructs, such as elements, documents, comments, and text nodes); it returns only atomic values (such as integers and strings). The function still accepts both nodes and atomic values but returns only atomics. Any nodes that get passed in are "atomized" and treated as atomics for the comparison and then returned in atomic form.
The rules for atomization are complex, but here are the basics: An element defined by a schema as a Boolean will be atomized to a true/false Boolean value. An element defined as an integer will be atomized to an integer. An element not defined by a schema
will be atomized to the XPath string value of the node (the text nodes concatenated together recursively).
To demonstrate:
distinct-values(<item>apple</item>,
<item>banana</item>, "grape")
returns ("apple", "banana", "grape"), assuming <item> isn't declared in a schema. In the following example,
if we assume <status> is defined by
a schema as holding a Boolean, then the following:
distinct-values(<status>0</status>,
<status>false</status>)
returns false() because that's the atomic value of both elements. Remember that false() is the XQuery constant indicating false.
Now you may be thinking, "Ah, but there's a distinct-nodes() function when I want to return nodes." Yes, however it only removes duplicate nodes based on node identity (nodes that are the exact same node, akin to reference equality in Java). There's no function to remove equivalent nodes. This can complicate queries because there's no way to easily eliminate equivalent nodes.
Looking back at my earlier "X Is
for XQuery" article, you'll find several examples impacted by this change.
The following query from the article returns a unique list of artist names, each wrapped with <Artist> tags:
distinct-values(document("itunes.xml")
/itunes/Tracks/Track/Artist)
The sample output looked like this:
<Artist>Marc Cohn</Artist>
<Artist>Pink Floyd</Artist>
Executed today, the same query would return the atomized values:
Marc Cohn
Pink Floyd
Because distinct-values() will now strip the <Artist> tags as part of the atomization process, you have to add
the tags after the distinct-values() call finishes, as follows:
let $artists :=
distinct-values(doc("itunes.xml")
/itunes/Tracks/Track/Artist)
for $a in $artists
return <Artist>{ $a }</Artist>
Not every situation is so easy to solve. Look at how Example 1.1.9.4 from the W3C Use Case document
has changed between November 2002 and May 2003. This example returns
a list of books written by each author. It uses distinct-values() and looks like this against the November 2002 specification:
<results>
{
for $a in distinct-values(
document("http://www.bn.com/bib.xml")
//author)
return
<result>
{ $a }
{
for $b in document(
"http://www.bn.com/bib.xml")
/bib/book
where some $ba in $b/author
satisfies deep-equal($ba,$a)
return $b/title
}
</result>
}
</results>
You can't directly tell by the query, but each <author> element has a
<first> and <last> name component.
The distinct-values() call returned the list of <author> elements with unique names. For the May 2003 specification, the query now has to run distinct-values() on the first and last names
separately, and inside the nested FLWOR expression also loses the reference to
$a as a single unique author:
<results>
{
let $a :=
doc("http://www.bn.com/bib/bib.xml")
//author
for $last in distinct-values($a/last),
$first in distinct-values(
$a[last=$last]/first)
return
<result>
{ $last, $first }
{
for $b in
doc("http://www.bn.com/bib.xml")
/bib/book
where some $ba in $b/author
satisfies ($ba/last = $last and
$ba/first=$first)
return $b/title
}
</result>
}
</results>
There's no better way to do it, short of writing a user-defined distinct-deep-equal(), which isn't performant in pure XQuery. (Note: FLWORpronounced "flower"expressions are the building blocks of XQuery. The name comes from the For, Let, Where, Order by, and Return keywords that make up the expression.)
New Functions
The May 2003 XQuery specification draft added three new functions that promise to be tremendously helpful. First:
trace($value as item()*,
$label as xs:string) as item()*
The trace() function allows printf-style debugging in the middle of a query. The function takes two parameters: a value to print (it can be a sequence of any number of items)
and a string label to be printed along with the value. The function returns the passed in $value for convenience. The location of the trace() output depends on your engine.
The function allows you to peer into the internal operations of a query. For example, the following query returns the document URIs for all documents in the XQuery engine sorted according to name. By adding a trace() call, I am able to view each URI as it's returned before the sorting:
define function uris() as xs:string* {
for $n in input()
return trace(
xs:string(document-uri($n)), "base:")
}
for $u in uris() order by $u return $u
The output might look like this:
2003-08-01 14:40:46 base: census.xml
2003-08-01 14:40:46 base: ipo.xml
When using trace() and other such functions, remember that in XQuery everything is an expression. There are
no statements! To allow statement-like behavior, you can use the trick of putting commas between expressions to create
a sequence. Each expression then gets evaluated independently. Any expression returning an empty value gets ignored in the final resulting sequence. For example, here I've made two trace() calls without impacting the results:
trace((), "starting query"),
let $time := current-dateTime()
let $ignored := trace($time, "Got time")
return
<html>
<head></head>
<body>Current time is { $time }</body>
</html>
Note the comma after the first trace() call. This causes the query to return a sequence of two items, the first of which will be the empty results of the trace() call and thus ignored. It's a little-known fact, but top-level queries naturally return a sequence, and in this special case surrounding parentheses aren't required. The query "5, <foo/>"
is perfectly valid. Also, in this example you see that when writing a FLWOR expression, you can execute arbitrary code in the right-hand side of a let clause and just ignore the value.
The May draft also adds an error() function:
error($srcval as item()?)
This function lets the user report an error, akin to throwing an exception. The $srcval is typed as item()?, which means it can be an XML construct or atomic, and it's marked with the question mark to show that it's optional. Here are some sample usages:
error()
error("Missing source document")
error(<span>A <i>beautifully</i>
formatted error</span>)
The error() call unwinds the stack like an exception. Unfortunately, XQuery still has no try/catch ability.
So while you can throw an error, you can't recover from one.
The last interesting new function added in May has the odd name round-half-to-even(). It has two forms:
round-half-to-even($srcval as numeric?)
as numeric?
round-half-to-even($srcval as numeric?,
$precision as xs:integer) as numeric?
In the one-argument case it behaves like the round() function, except that when a number falls exactly halfway between two others it rounds the argument to the nearest even value. Number theorists will tell you this is a more statistically accurate rounding algorithm. To demonstrate:
round-half-to-even(1.5) = 2.0
round-half-to-even(2.5) = 2.0
round-half-to-even(2.51) = 3.0
The second argument case makes the function interesting. It's a precision level and allows the function to be used to format decimal values. For example:
round-half-to-even(3.567, 2) = 3.57
round-half-to-even(1113.567, -2) = 1100.0
round-half-to-even(1 div 9, 3) = 0.111
In case you're wondering, the numeric datatype used in the declaration is a shortcut for xs:decimal, xs:integer, xs:float, xs:double, and any types derived by restriction from these. It's used in the XQuery specification, but you can't use it in your own queries.
New Query File Structure
The most overarching change in the May 2003 draft involves the XQuery file structure. The draft introduces the notion of main modules and library modules, and with them adds the much-needed ability to build queries from reusable components. The draft also expands the query prolog to include numerous features, such as
an optional version statement, new declarations, new imports, external definition of variables, and definition of functions.
Prologs look like this (with strings sometimes written on two lines, due to margin constraints):
module "http://www.w3.org/2003/05/
xpath-functions"
default element namespace=
"http://www.w3.org/1999/xhtml"
declare namespace xs=
"http://www.w3.org/2001/XMLSchema"
import module
"http://www.w3.org/2003/05/
xpath-functions" at "logo.xq"
define function addLogo(
$root as node()) as node()* { }
(: etc :)
I can be more precise in the description by looking at the Backus Naur Form (BNF) borrowed from the May specification:
Module ::= MainModule |
LibraryModule
MainModule ::= Prolog QueryBody
LibraryModule ::= ModuleDecl Prolog
ModuleDecl ::= "module" StringLiteral
Prolog ::= Version? (NamespaceDecl
| XMLSpaceDecl
| DefaultNamespaceDecl
| DefaultCollationDecl
| SchemaImport
| ModuleImport
| VarDefn
| ValidationDecl)*
FunctionDefn*
QueryBody ::= Expr
It's important to be able to read BNF whenever you're trying to understand the structure of a language or file format. The meta-symbol ::= means "is defined as," | means "or," and the modifiers ?, +, and * mean 0 or 1, 1 or more, and 0 or more, respectively.
This BNF says each module consists of either a MainModule or a LibraryModule. A MainModule is a prolog followed by
a QueryBody, while a LibraryModule is
a ModuleDecl followed by a prolog but noticeably without a QueryBody. A ModuleDecl begins with the string "module" and then a StringLiteral, which in this case happens to be a URI. A prolog consists of an optional initial version followed by various declarations, imports, and definitions of any number in any order. After those come any number of function definitions. And finally, the QueryBody referenced earlier in the MainModule is defined as a single Expr expression. The definition of Expr and other nonterminals can be found elsewhere in the specification.
I show the BNF here because it's the most precise way to understand the new query file structure, and it's also extremely helpful to look at when something doesn't work as expected and you have to turn to the specification.
Working through the Prolog, the first thing I find is a Version. An XQuery module may declare its version number at the beginning of the prolog. The Version indicates the XQuery version against which the code was designed
to operate; a processor can throw an error when encountering an unknown version. The lack of a version statement implies "1.0". Unfortunately, versions aren't being utilized for the numerous prerelease drafts, so this feature provides no help today. The version declaration looks like this:
module "http://www.w3.org/2003/05/
xpath-functions"
xquery version "1.0"
(: etc :)
Notice how the BNF indicates that the version declaration must always come after the module declaration.
Namespaces
To understand the other XQuery components, it's important to first understand XML namespaces, including some details that most people ignore. XML namespaces have traditionally allowed elements with the same
local name (for example, <table>)
to have different semantic meanings. Namespaces are specified by URIs
such as http://www.w3.org/1999/xhtml for an HTML table and http://furnitureworld.com for a four-legged table. Although they look like HTTP URLs, namespaces are just opaque names. They have absolutely nothing to do with the HTTP protocol, and while there might be something at
that URL on the Web, there doesn't have to be! In fact, beginning namespaces with the prefix "http://" is just
a convention started by the W3C;
any prefix could be used.
Yes, this is confusing. Yes, it would have been easier to understand if instead they'd standardized on URIs such as ns:org.w3.1999.xhtml.
XML elements and attributes are always associated with a namespace, even if it's just the nonexistent namespace. A special xmlns attribute declares a namespace prefix alias that's used to place elements into namespaces. An
alias is in scope (available for use)
for the element declaring the xmlns attribute and all the attributes and elements held within that element. Within that scope any elements or attributes using this prefix are interpreted to be
in the namespace associated with that alias. For example, the following code has nodes in three namespaces:
<xhtml:table
xmlns:xhtml=
"http://www.w3.org/1999/xhtml"
xmlns:xlink=
"http://www.w3.org/1999/xlink">
<xhtml:tr width="200" xlink:href="#x">
<xhtml:td/>
</xhtml:tr>
</xhtml:table>
The table element declares two namespace aliases, xhtml and xlink, associated with two different, standard URIs. The <table>, <tr>, and <td> elements
are placed in the namespace http://www.w3.org/1999/xhtml, while the href attribute is placed in the namespace http://www.w3.org/1999/xlink. The width attribute is in the nonexistent namespace mentioned earlier (that's the third namespace). Remember that the important part of the namespace is the URI, not the prefix alias, and elements are considered to be in the same namespace if their URIs match, even if their prefix aliases might not.
A "default namespace" is the namespace for elements with no prefix. It's assigned by a special xmlns attribute that doesn't include a prefix. The default namespace is in scope and applies for that element and its content and affects all nonprefixed elements within that scope. Despite common sense, a default namespace does not affect nonprefixed attributes. Take this example:
<table
xmlns="http://www.w3.org/1999/xhtml"
xmlns:xlink=
"http://www.w3.org/1999/xlink">
<tr width="200" xlink:href="#x">
<td/>
</tr>
</table>
This XML is semantically identical to the XML shown previously. The namespace formerly aliased to "xhtml" is now the default namespace and applies to all nonprefixed elements within the <table> element. Thus the <table>, <tr>, and <td> elements are still in the namespace http://www.w3.org/1999/xhtml. The
href attribute is still in the namespace http://www.w3.org/1999/xlink. And because nonprefixed attributes don't go into the default namespace, the width attribute remains in no namespace.
The lack of a namespace can be represented by the empty string URI. This allows explicit placement into the nonexistent namespace. For example:
<table xmlns="http://www.w3.org/1999/xhtml">
<tr> <td/> </tr>
<data xmlns=""> <subdata/> </data>
</table>
Here the <table>, <tr>, and <td> elements reside in the default namespace, which is http://www.w3.org/1999/xhtml, while the data and subdata elements reside in the nonexistent namespace.
In XQuery, you can use normal namespace declarations and rules when writing XML constructs:
let $bug :=
<x:bug xmlns:x="http://www.bug.com/ns">
<x:desc>Order entry fails</x:desc>
</x:bug>
Or you can declare a namespace alias in the prolog:
declare namespace x =
"http://www.bug.com/ns"
let $bug :=
<x:bug>
<x:desc>Order entry fails</x:desc>
</x:bug>
You can also declare a default element namespace in the prolog:
default element namespace =
"http://www.bug.com/ns"
declare namespace xhtml =
"http://www.w3.org/1999/xhtml"
let $bug :=
<bug>
<desc>Order entry fails</desc>
</bug>
let $report :=
<xhtml:span>{$bug}</xhtml:span>
When using default element namespaces, either declared with the xmlns attribute or with a declaration, you have to be careful. Path expressions are affected by the namespaces in scope at their location. There's no differentiation in XQuery between input and output namespaces. To understand why this matters, try to determine if this query will work:
let $bug :=
<bug xmlns="http://www.bug.com/ns">
<desc>Order entry fails</desc>
<cause>{ input()/bugdb
//item[@id="123"] }
</cause>
</bug>
What namespace must the <bugdb> and <item> elements be in for it to work? How about the id attribute? The answer is that because the path expression /bugdb//item[@id="123"] occurs within the scope of the default element namespace http://www.bug.com/ns, the nonprefixed path expression components are executed within that namespace. This may or may not be correct. For this query to work, bugdb and item must be in the http://www.bug.com/ns namespace. The id attribute isn't affected by the default element namespace because it's an attribute. The earlier query is the same as this one:
let $bug :=
<x:bug xmlns:x="http://www.bug.com/ns">
<x:desc>Order entry fails</x:desc>
<x:cause>{ input()/x:bugdb
//x:item[@id="123"] }
</x:cause>
</x:bug>
This problem happens frequently when generating XHTML content. The Microsoft Internet Explorer browser doesn't understand prefixed XHTML, but using nonprefixed XHTML with the default element namespace affects all queries appearing inside the XHTML content. One way to avoid this problem is to write functions outside the default element scope, as shown below:
define function get-body() as element() {
doc("test.xml")/body
}
<html xmlns="http://www.w3.org/1999/xhtml">
<head/>
{ get-body() }
</html>
The other is to explicitly declare a prefix for no namespace, like this:
declare namespace e = ""
<html xmlns="http://www.w3.org/1999/xhtml">
<head/>
{ doc("test.xml")/e:body }
</html>
Function Namespaces
XQuery predefines five namespace prefixes:
xml
http://www.w3.org/XML/1998/namespace
xs
http://www.w3.org/2001/XMLSchema
xsi
http://www.w3.org/2001/XMLSchema-instance
fn
http://www.w3.org/2003/05/xpath-functions
xdt
http://www.w3.org/2003/05/xpath-datatypes
XML (xml), XML Schema (xs) , and XML Schema Instance (xsi) are XML namespace and XML Schema standards. The XQuery May draft introduces the last twoXPath function (fn) and XPath datatype (xdt).
The default function namespace is http://www.w3.org/2003/05/xpath-functions, with the standard "fn" default. You can change this, but then you must use the built-in "fn" prefix to qualify any built-in function calls. For example:
default function namespace =
"http://example.com/functions"
fn:string-length("foo")
A module import loads the functions and variables (and only the functions and variables) from an external library module into a main module. When doing a module import, you specify a target namespace and it needs to match what's declared in the module. It's akin to importing a Java package where the import statement path must match the package statement path. During the import you also specify a location from which to load the module, given as a URI. Any function or variable name collision throws an error. Here you see the common.xq library module file and a following main.xq main module file:
(: common.xq file :)
module "http://www.w3.org/2003/05/
xpath-functions"
define function uris() as xs:string* {
for $n in input()
return xs:string(document-uri($n))
}
(: no body allowed in library modules :)
(: main.xq file :)
import module
"http://www.w3.org/2003/05/
xpath-functions" at "common.xq"
uris()
The common.xq file (a standard file extension for XQuery files has yet to emerge) holds a definition for the uris() function shown earlier. With this declaration it places its functions in the http://www.w3.org/ 2003/ 05/xpath-functions namespace. The main.xq file imports the module using the same namespace. Because the uris() function resides in the default function namespace, main.xq can call uris() without a prefix.
Writing functions in the default function namespace is convenient and means you don't have to prefix your calls. However, high-profile modules should use an alternate namespace to avoid collisions. My advice is to treat the "fn" namespace like the Java default package: it's useful for prototyping but not for finished library modules. To load a module with an alternate prefix, do the following:
import module namespace x="ns://foo" at "common.xq"
x:uris()
Of course, common.xq would have to declare the same namespace. Both sides must agree for the import to succeed, just as with Java packages.
Remember, on these imports the "common.xq" URI doesn't refer to a file pathonly to an opaque name that somehow the server maps to the common.xq content. The specification doesn't impose too many rules on mapping URIs to resources because queries may not always reside in files.
Functions can have "external" declarations also. An external declaration indicates that the function must be provided by the system. If it's not provided, it's an error. This feature opens the door for externally written support code in a language other than XQueryperhaps Java. By declaring the external function in the query, the engine's static analysis can ensure that the function exists externally and with the right signature before the query begins. External functions use the keyword "external" in place of the function body. For example:
define function sort($elts as element()*)
as element()* external
sort((<c/>, <a/>, <b/>))
This external function sorts a sequence of elements. The implementation of the sorting routing may be in another language or provided otherwise by the engine. The exact semantics for passing state to and from an external environment have yet to be clarified. Once this sees wide adoption, it'll be interesting to discover whether it's more common for Java code to make use of XQuery calls or for XQuery code to make use of Java calls.
Prolog Variables
You can also declare and define variables in the prolog. The variable type is optional and inferred if not provided. The value is surrounded by curly braces. For example:
define variable $x as xs:integer {7}
define variable $y {7.5}
(: infer xs:double here :)
Declaring variables in the prolog doesn't make much sense for a single query, but it comes in handy with imported modules. Recall that an imported module exposes both function and variable definitions to the main module. This lets prolog variables be used as global constants:
module "http://x-query.com/math"
define variable $PI as xs:decimal
{ 3.1415926535897932384626433 }
A variable also can be marked "external" to indicate that its value will come from the external environment. This opens the door for stored procedure parameter passing and alternate input sources. For example:
define variable $input
as item()* external
define variable $quantity
as xs:integer external
$input/item[quantity >= $quantity]
Conclusion
The XQuery May 2003 draft introduced several new important featuresfrom simple new utility methods like trace() and error()
to overarching changes like library modules, function namespaces, prolog variables, and external functions and variables. Historically it has taken most vendors some time to upgrade their XQuery engines to support each new version, so be aware that the code samples shown here won't necessarily work on every engine right away.
To help you track what queries
work where, Mike Clark and I have
created an XQuery test harness called BumbleBee. It's like JUnit for XQuery. It contains a set of standard test queries and allows you to write your own. It's great for fans of test-driven development. Write sample input
and expected output; then write the query. With each upgrade, test your query again to make sure it still
works. You'll find BumbleBee at xquery.com/bumblebee.
Jason Hunter (jasonhunter@servlets.com) is a consultant, author of Java Servlet Programming and coauthor of Java Enterprise Best Practices (both from O'Reilly & Associates), and a publisher of Servlets.com.
|
XQJ
Not introduced in the May 2003 XQuery draft but announced at about the same time was a proposal from Oracle and IBM to create a common API for XQuery/Java interaction. As JDBC is for SQL, so this API will be for XQuery. The proposal was made to Sun's Java Community Process (JCP) and accepted as JSR-225, titled "XQuery API for Java (XQJ)." The API is likely to live in the javax.xml.xquery package.
Among the JSR's stated goals:
- A stylistic similarity with JDBC and Java API for XML Processing (JAXP)
- A connection-oriented interface with transactional support (interesting because XQuery 1.0 will have no standard update mechanism)
- A connectionless interface for single-shot queries
- The ability to create an XQJ connection from a JDBC connection for engines where that makes sense
- The ability to compile queries for repeated execution
- Support for parameterized queries and discovery/binding of input parameters
- Support for processing results with JAXP and Streaming API for XML (StAX)
- The ability to handle any legal result including a general sequence
- The ability to serialize query results
This JSR should prove extremely useful for Java and J2EE programmers. Right now vendors have to create custom APIs for interacting with XQuery engines, and only the best engines even recognize that the results can be any sequence of items, not single XML documents. Using this JSR should enforce good behavior and additionally provide for easy back-end plugability.
|
|