Wednesday, February 18, 2009

MarkUtils-XML: NamespaceContextMap, PrettyPrint, Date Format

Adding to my collection of MarkUtils, this is my introduction of MarkUtils-XML. It is available on ziesemer.java.net under the GPL license, complete with source code, a compiled .jar, generated JavaDocs, and a suite of JUnit tests. Download the com.ziesemer.utils.xml-*.zip distribution from here.

NamespaceContextMap

I think that XML Namespaces are a great solution for avoiding naming collisions. I also think that XPath is a very useful tool for pulling data out of XML documents. Unfortunately, using XPath with XML Namespaces involves a little bit of extra work, especially in the current version of Java.

The most common issue I see other developers run into when first working with this combination is finding that their XPath isn't returning any results. This is because unless otherwise specified, the XPath only searches for nodes declared without a namespace. XML nodes declared with namespaces can be referenced using namespace prefixes, where each prefix is assigned to a specific namespace URI. It should be noted that prefixes only function as placeholders for the namespace URIs. Even though an XML document may have one prefix assigned to a given namespace, it cannot be assumed that prefix will remain unchanged. Many times, these prefixes are generated automatically and/or as needed for each namespace used in a XML document. Two XML documents should be considered equal if the only difference between them are the prefixes used for a common namespace. For example, XSLT uses a namespace URI of "http://www.w3.org/1999/XSL/Transform". It is commonly prefixed to either "xs:" or "xsl:", though other prefixes are also used and valid. As such, any application should explicitly map any desired namespaces to a local prefix that can be used to reference XML nodes declared with a namespace.

In Java, the XPath class accepts a NamespaceContext instance for resolving namespace prefixes to namespace URIs, and vice-versa. Unfortunately, Java does not currently provide an implementation of the NamespaceContext interface, as reported in Sun's bug 6376058. It is relatively easy to write a simple implementation, which can optionally be included as either an inner-class or an anonymous inner-class. However, this can quickly become quite repetitive, especially when needing to support multiple namespace mappings in the same context.

My solution is the NamespaceContextMap class. It implements both NamespaceContext and Map<String, String>, making it very easy to configure and use. It accepts both prefix/URI pairs, as well as QName instances. Lookups are first resolved against the instance's configured list of mappings, then the default mappings as defined in NamespaceContext and XMLConstants. It also follows all the guidelines listed in the interface's Javadoc.

Here is some basic, example usage:

NamespaceContextMap ncm = new NamespaceContextMap();
ncm.put("xslt", "http://www.w3.org/1999/XSL/Transform");
ncm.put("xhtml", "http://www.w3.org/1999/xhtml");

XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
xpath.setNamespaceContext(ncm);

// Do XPath operations here...

Two maps are internally maintained for performance regardless of lookup type - one map is keyed by prefix, the other by URI. The later are stored a Set with a backing List, which guarantees that multiple prefixes are supported per URI, and that the getPrefixes(String) method returns them in the order that they were added (FIFO) - essentially, an ordered map.

In the implementation, I struggled with finding a solid way of enforcing consistency and constraints. In particular, the entrySet(), keySet(), and values() methods of the Map interface make it very difficult (but not impossible) to intercept add/remove operations, something that I previously posted about in Java Collections Listeners. For now, these methods return unmodifiable collections.

XML PrettyPrint

While XML may commonly be sent as a single-line or without indentation for compactness and efficiency, it is usually most easily viewed with increased indentation at each level, commonly referred to as "pretty printing". This styling is presented by default in most web browsers, including both Mozilla Firefox and Microsoft Internet Explorer, as well as many IDEs and text editors. However, performing this formatting from an automated fashion within Java doesn't seem to be a feature that is readily available, stable, or easy to use. See the "Java 1.5 doesn't want to indent XML output" forum thread for some related discussion, including a copy of my solution.

My solution is an XSLT that reformats the XML with indentation, accounting for existing whitespace, and without any necessary references to "xml.apache.org". It also accepts configurable XSLT parameters for the indentation and newline character sequences. A PrettyPrint class is provided that handles loading the XSLT as a class resource, and returns a reusable, thread-safe Templates instance. For some details on this, including notes on how to chain it into an existing transformation or serialization for increased performance, see my previous post: XML and XSLT Tips and Tricks for Java.

As noted in my "Tips and Tricks" post, please be sure to upgrade to the latest version of Apache Xalan, 2.7.1 or newer. Otherwise, there is a particular issue where generated comments tend to disappear. This isn't an issue specific to my transformation, and can be reproduced even with an identity transformation. See the comment at the beginning of PrettyPrint.xslt for details.

XmlDateFormat

A frequent task I encounter is generating valid XML schema dateTime-formatted values. This format is a profile of the ISO 8601 standard, and is further detailed in RFC 3339. Unfortunately, Java doesn't currently provide a standard DateFormat that matches this specification. Included in my package is a XmlDateFormat class with a getDateFormat() method that returns a properly-configured DateFormat. As with most Format instances, the returned DateFormat instance is not guaranteed to be thread-safe and should not be re-used across threads.

No comments: