Sunday, January 11, 2009

XML and XSLT Tips and Tricks for Java

1. Get the latest versions

Starting with Java 1.4, the Java runtime has included a default XML parser and transformer implementation as part of the Java API for XML Processing (JAXP).

However, the included versions aren't up-to-date - not even to the latest versions available when each Java version was released.

As of this writing, the latest Apache Xerces2-J version is 2.9.1 (2.11.0 as of November 2010), and the latest Apache Xalan-J version is 2.7.1. I strongly recommend using the latest versions, as the versions built-in to Java are both somewhat limited and buggy. Xalan's FAQ page gives some strict notes concerning using a newer version under Java 1.4, due to the Endorsed Standards Override Mechanism. However, these instructions make it rather difficult - if not impossible - to use an updated library for a particular application on a shared JRE. Fortunately, these steps appear to be no longer required starting with Java 1.5 / 5.0. In these later versions, Sun has repackaged the Apache libraries into rt.jar as com.sun.org.apache.*, and properly load any desired implementation based on the "META-INF/services/javax.xml.*" files found on the classpath. Implementations including these files, including Apache Xerces2-J and Xalan-J, will automatically be used by default if included on the classpath.

2. Use Templates to reuse Transformations

Most of the JAXP interfaces are not thread-safe, including the factories and the instances obtained from them. I.E., neither instance of DocumentBuilderFactory nor DocumentBuilder should be stored statically or in another such way where they could be accessed by multiple threads.

The same applies to a Transformer. While it can be used repeatedly within a given thread, it is not thread-safe for use across multiple threads.

The solution is to use a Templates object, which can be thought of as a compiled-form of a stylesheet. Per the JavaDoc, "Templates must be threadsafe for a given instance over multiple threads running concurrently, and may be used multiple times in a given session." Additionally, use of Templates for repeated transformations will probably provide a performance improvement, as the transformation source (usually an XSLT) doesn't need to be re-read, re-parsed, and re-compiled.

Here is some simple, typical code of performing a transformation without a Templates object:

TransformerFactory tf = TransformerFactory.newInstance();
StreamSource myStylesheetSrc = new StreamSource(
  getClass().getResourceAsStream("MyStylesheet.xslt"));
Transformer t = tf.newTransformer(myStylesheetSrc);
t.transform(new StreamSource(System.in), new StreamResult(System.out));

Here is the improved code, which makes use of a reusable Templates object:

TransformerFactory tf = TransformerFactory.newInstance();
if(!tf.getFeature(SAXTransformerFactory.FEATURE)){
  throw new RuntimeException(
    "Did not find a SAX-compatible TransformerFactory.");
}
SAXTransformerFactory stf = (SAXTransformerFactory)tf;
StreamSource myStylesheetSrc = new StreamSource(
  getClass().getResourceAsStream("MyStylesheet.xslt"));
Templates templates = stf.newTemplates(myStylesheetSrc);

// templates can now be stored and re-used from practically anywhere.

Transformer t = templates.newTransformer();
t.transform(new StreamSource(System.in), new StreamResult(System.out));

3. Chaining Transformations

When multiple, successive transformations are required to the same XML document, be sure to avoid unnecessary parsing operations. I frequently run into code that transforms a String to another String, then transforms that String to yet another String. Not only is this slow, but it can consume a significant amount of memory as well, especially if the intermediate Strings aren't allowed to be garbage collected.

Most transformations are based on a series of SAX events. A SAX parser will typically parse an InputStream or another InputSource into SAX events, which can then be fed to a Transformer. Rather than having the Transformer output to a File, String, or another such Result, a SAXResult can be used instead. A SAXResult accepts a ContentHandler, which can pass these SAX events directly to another Transformer, etc.

Here is one approach, and the one I usually prefer as it provides more flexibility for various input and output sources. It also makes it fairly easy to create a transformation chain dynamically and with a variable number of transformations.

SAXTransformerFactory stf = (SAXTransformerFactory)TransformerFactory.newInstance();

// These templates objects could be reused and obtained from elsewhere.
Templates templates1 = stf.newTemplates(new StreamSource(
  getClass().getResourceAsStream("MyStylesheet1.xslt")));
Templates templates2 = stf.newTemplates(new StreamSource(
  getClass().getResourceAsStream("MyStylesheet2.xslt")));

TransformerHandler th1 = stf.newTransformerHandler(templates1);
TransformerHandler th2 = stf.newTransformerHandler(templates2);

th1.setResult(new SAXResult(th2));
th2.setResult(new StreamResult(System.out));

Transformer t = stf.newTransformer();
t.transform(new StreamSource(System.in), new SAXResult(th1));

// th1 feeds th2, which in turn feeds System.out.

Here is another approach, which makes use of XMLFilter's. This approach is also documented in Sun's J2EE 1.4 Tutorial.

SAXTransformerFactory stf = (SAXTransformerFactory)TransformerFactory.newInstance();

// These templates objects could be reused and obtained from elsewhere.
Templates templates1 = stf.newTemplates(new StreamSource(
  getClass().getResourceAsStream("MyStylesheet1.xslt")));
Templates templates2 = stf.newTemplates(new StreamSource(
  getClass().getResourceAsStream("MyStylesheet2.xslt")));

SAXParserFactory spf = SAXParserFactory.newInstance();
SAXParser parser = spf.newSAXParser();
XMLReader reader = parser.getXMLReader();

XMLFilter filter1 = stf.newXMLFilter(templates1);
XMLFilter filter2 = stf.newXMLFilter(templates2);

filter1.setParent(reader);
filter2.setParent(filter1);

Transformer t = stf.newTransformer();
t.transform(
  new SAXSource(filter2, new InputSource(System.in)),
  new StreamResult(System.out));

Note how in this later approach, the filter is applied at the source instead of the result.

4. Input Validation

Prior to Java 1.5 / 5.0, the only way to control validation through the JAXP API was to set custom attributes. This is described quite well in Sun's J2EE 1.4 Tutorial in "Validating with XML Schema", so I won't repeat it all here. However, do pay attention to the end of the page, which explains that schemas can be loaded from several different sources, including InputStreams and other InputSources - not just local Files or URLs, which many developers seem to overlook.

Starting with Java 1.5 / 5.0, the function of setValidating on DocumentBuilderFactory seems to have changed slightly. It now essentially controls only DTD validation, not modern schema validation e.g. W3C XML Schema or RELAX NG. Instead, a setSchema method is available, which accepts a compiled Schema object. Like the Templates object above, this is one of the few JAXP classes that is thread-safe and is meant for reuse.

One advantage with the DocumentBuilderFactory's setSchema method is that a document can be checked not only for well-formedness and for validity against a schema, but also for validity against a particular, pre-defined schema. Additionally, by default, the parsing process will follow URLs out to the Internet to resolve schemas, etc. Passing in a Schema object built from locally-kept files can improve performance, and eliminate the need for accessing the Internet. However, if there are additional references to be resolved, further attempts may still be made. These can be intercepted by registering an EntityResolver to the DocumentBuilder.

For ensuring that a particular DTD is used, use the extended EntityResolver2. I've found that if the DOCTYPE is missing, getExternalSubset is called. To use a particular DOCTYPE by default, this method could call and return the result from resolveEntity, after passing in the desired publicId and/or systemId. If the XML to be validated already includes a DOCTYPE, then resolveEntity will be called directly. This can be written to either throw an exception or silently return the desired entity when an unexpected entity is received.

5. XML Creation using XSLT

XSLT is a well-known method for transforming XML, but it can also be used for XML generation. The easiest way it to use XSLT as a transformation, similar to the above methods, but with an empty input source. This is additionally noted in the Transformer.transform(…) JavaDoc.

Using XSLT for XML generation works particularly well when the XML is rather static, or when the XSLT can be used as a template. The Transformer's setParameter(…) can be used to pass parameters into the transformation which can then be used as variables. To avoid possible naming collisions, especially when using larger or 3rd-party XSLTs, I strongly recommend using the namespace prefixes.

Below is a sample XSLT with namespaced parameters, then populated by Java code. It generates a valid XHTML document, with the document title and a message in the body passed-in as parameters:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:z="http://namespaces.ziesemer.com/example"
    exclude-result-prefixes="z">
    
  <xsl:output
    method="html"
    doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"
    doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>
  
  <xsl:param name="z:title"/>
  <xsl:param name="z:message"/>
  
  <xsl:template match="/">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en">
      <head>
        <title><xsl:value-of select="$z:title"/></title>
      </head>
      <body>
        <h1><xsl:value-of select="$z:title"/></h1>
        <p><xsl:value-of select="$z:message"/></p>
      </body>
    </html>
  </xsl:template>
  
</xsl:stylesheet>
final String NAMESPACE_PREFIX = "{http://namespaces.ziesemer.com/example}";

SAXTransformerFactory stf = (SAXTransformerFactory)TransformerFactory.newInstance();
Templates templates = stf.newTemplates(new StreamSource(
  getClass().getResourceAsStream("XHTMLMessage.xslt")));

// templates can now be stored and re-used from practically anywhere.

Transformer t = templates.newTransformer();
t.setParameter(NAMESPACE_PREFIX + "title",
  "Example Title");
t.setParameter(NAMESPACE_PREFIX + "message",
  "Example Message");

t.transform(new DOMSource(), new StreamResult(System.out));

This approach has a number of advantages. It is fairly easy to see what is happening, and it is easy to make changes or additions to the output. It guarantees valid XML output, as an exception will be thrown if the XSLT is invalid. It also performs quite well.

6. XSLT Inheritance

Just as common functionality can be factored out of Java classes into shared parent classes, XSLT can also make similar use of inheritance by using Stylesheet Imports. Here is an example split into a parent and child:

XHTMLTemplate.xslt:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:z="http://namespaces.ziesemer.com/example"
    exclude-result-prefixes="z">
    
  <xsl:output
    method="html"
    doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"
    doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>
  
  <xsl:param name="z:title"/>
  
  <xsl:template match="/">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en">
      <head>
        <title><xsl:value-of select="$z:title"/></title>
      </head>
      <body>
        <h1><xsl:value-of select="$z:title"/></h1>
        <xsl:call-template name="z:Message"/>
      </body>
    </html>
  </xsl:template>
  
  <!-- This should be overridden by child stylesheets. -->
  <xsl:template name="z:Message"/>
  
</xsl:stylesheet>

XHTMLMessage.xslt:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:z="http://namespaces.ziesemer.com/example"
    exclude-result-prefixes="z">
  
  <xsl:import href="XHTMLTemplate.xslt"/>
  
  <xsl:param name="z:message"/>
  
  <xsl:template name="z:Message">
    <p xmlns="http://www.w3.org/1999/xhtml">
      <xsl:value-of select="$z:message"/>
    </p>
  </xsl:template>
  
</xsl:stylesheet>

Depending upon the type of location the dependent files, a custom URIResolver will probably be needed to properly resolve the resources. In my example, I'm reading from the Java classpath. Other possibilities could include the local file system, or HTTP URLs. Only the necessary changes to the above Java code are shown below:

URIResolver resolver = new URIResolver(){
  @Override
  public Source resolve(String href, String base) throws TransformerException{
    return new StreamSource(getClass().getResourceAsStream(href));
  }};

SAXTransformerFactory stf = (SAXTransformerFactory)TransformerFactory.newInstance();
stf.setURIResolver(resolver);
Templates templates = stf.newTemplates(resolver.resolve("XHTMLMessage.xslt", null));

7. XSLT Extensions

Using parameters is a start, but the limitations are quickly visible. However, when combined with extension mechanisms, XSLT generation should be able to solve almost any requirement. Reading http://xml.apache.org/xalan-j/extensions.html is an excellent starting point. When properly used, extensions can feed into the transformation process and keep the memory footprint to a minimum.

Following is an example that uses an XSLT extension to output a variable number of messages. Additionally, this method allows for the properties to be calculated dynamically on each iteration, rather than pre-processing and storing the formatted messages - which can save memory.

XHTMLMessage.xslt:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:z="http://namespaces.ziesemer.com/example"
    xmlns:zMessageGenExt="com.ziesemer.example.MessageGenerator"
    xmlns:zMessageExt="com.ziesemer.example.Message"
    extension-element-prefixes="zMessageGenExt zMessageExt"
    exclude-result-prefixes="z zMessageGenExt zMessageExt">
    
  <xsl:output
    method="html"
    doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"
    doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>
  
  <xsl:param name="z:title"/>
  <xsl:param name="z:ext"/>
  
  <xsl:template match="/">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en">
      <head>
        <title><xsl:value-of select="$z:title"/></title>
      </head>
      <body>
        <h1><xsl:value-of select="$z:title"/></h1>
        <xsl:call-template name="z:Messages"/>
      </body>
    </html>
  </xsl:template>
  
  <xsl:template name="z:Messages">
    <xsl:variable name="z:message" select="zMessageGenExt:getNextMessage($z:ext)"/>
    <xsl:if test="string($z:message)">
      <p xmlns="http://www.w3.org/1999/xhtml">
        <b>
          <xsl:value-of select="zMessageExt:getTitle($z:message)"/>
        </b><xsl:text>: </xsl:text>
        <xsl:value-of select="zMessageExt:getDescription($z:message)"/>
      </p>
      <xsl:call-template name="z:Messages"/>
    </xsl:if>
  </xsl:template>
  
</xsl:stylesheet>

XHTMLExample.java:

t.setParameter(NAMESPACE_PREFIX + "ext",
  new MessageGenerator());

IMessage.java:

package com.ziesemer.example;

public interface IMessage{
  String getTitle();
  String getDescription();
}

MesssageGenerator.java

package com.ziesemer.example;

public class MessageGenerator{
  
  protected int index = 0;
  protected int size = 5;
  
  public IMessage getNextMessage(){
    if(index < size){
      IMessage msg = new TestMessage();
      index++;
      return msg;
    }
    return null;
  }
  
  protected class TestMessage implements IMessage{
    @Override
    public String getTitle(){
      return String.format("This is title %d.", index);
    }
    
    @Override
    public String getDescription(){
      return String.format("This is description %d.", index);
    }
  }
}

Output:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
<title>Example Title</title>
</head>
<body>
<h1>Example Title</h1>
<p>
<b>This is title 1.</b>: This is description 1.</p>
<p>
<b>This is title 2.</b>: This is description 2.</p>
<p>
<b>This is title 3.</b>: This is description 3.</p>
<p>
<b>This is title 4.</b>: This is description 4.</p>
<p>
<b>This is title 5.</b>: This is description 5.</p>
</body>
</html>

XSLT is more of a functional language than procedural, and the demonstrated use of recursion is really the only way to implement a loop. Unfortunately, this can lead to stack overflow errors within the Java implementation. This can be mitigated by increasing the stack size. While this can be set globally for the JVM, it is certainly not the best option. There is a Thread constructor that allows for the thread's stack size to be specified, but it is platform-depdendent, and is still only a mitigation. There is a good article on IBM developerWorks, "Use recursion effectively in XSL" (Jared Jackson, 2002-10-01), that specifically addresses this stack overflow issue with XSL recursion. Unfortunately, the examples provided require either a pre-known list size, and/or only calculate within the loop rather than producing output.

Here are some modifications to my above XSLT that recurses down a tree, splitting into 2 children at each level, and making the maximum necessary depth log2n. (A traditional divide & conquer algorithm.) However, by supporting a loop of an unknown size, the depth cannot be calculated to the appropriate minimum level in advance. In my approach, a pre-defined depth of a "sufficient size" is used. I chose 32, as 2^32 = 4,294,967,296, and would be equal to Java's Integer if it were non-signed. (As Integers are signed in Java, the maximum value of an int is 2,147,483,647.) Also, my trials have shown that the default stack size supports over 1,000 recursions before overflowing, so 32 should be a more than safe value. The tree will be filled "depth-first", and the tree will then continue to grow in "width". Some additional work is done to test if each call still resulted in output. If not, an xsl-if prevents further recursion, otherwise the entire tree would still be built and traversed. At 2 billion+ potential nodes in the tree, completing the recursion would require an unacceptable amount of time.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:z="http://namespaces.ziesemer.com/example"
    xmlns:zMessageGenExt="com.ziesemer.example.MessageGenerator"
    xmlns:zMessageExt="com.ziesemer.example.Message"
    extension-element-prefixes="zMessageGenExt zMessageExt"
    exclude-result-prefixes="z zMessageGenExt zMessageExt">
    
  <xsl:output
    method="html"
    doctype-public="-//W3C//DTD XHTML 1.0 Strict//EN"
    doctype-system="http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"/>
  
  <xsl:param name="z:title"/>
  <xsl:param name="z:ext"/>
  
  <xsl:template match="/">
    <html xmlns="http://www.w3.org/1999/xhtml" lang="en">
      <head>
        <title><xsl:value-of select="$z:title"/></title>
      </head>
      <body>
        <h1><xsl:value-of select="$z:title"/></h1>
        <xsl:call-template name="z:MessagesRecursive">
          <xsl:with-param name="z:depth" select="0"/>
        </xsl:call-template>
      </body>
    </html>
  </xsl:template>
  
  <xsl:template name="z:MessagesRecursive">
    <xsl:param name="z:depth"/>
    <xsl:variable name="x">
      <xsl:call-template name="z:Messages"/>
    </xsl:variable>
    <xsl:if test="string($x) and $z:depth &lt; 32">
      <xsl:copy-of select="$x"/>
      <xsl:call-template name="z:MessagesRecursive">
        <xsl:with-param name="z:depth" select="$z:depth + 1"/>
      </xsl:call-template>
      <xsl:call-template name="z:MessagesRecursive">
        <xsl:with-param name="z:depth" select="$z:depth + 1"/>
      </xsl:call-template>
    </xsl:if>
  </xsl:template>
  
  <xsl:template name="z:Messages">
    <xsl:variable name="z:message" select="zMessageGenExt:getNextMessage($z:ext)"/>
    <xsl:if test="string($z:message)">
      <p xmlns="http://www.w3.org/1999/xhtml">
        <b>
          <xsl:value-of select="zMessageExt:getTitle($z:message)"/>
        </b><xsl:text>: </xsl:text>
        <xsl:value-of select="zMessageExt:getDescription($z:message)"/>
      </p>
    </xsl:if>
  </xsl:template>
  
</xsl:stylesheet>

8. Beware of classloaders

One frustrating issue I recently dealt with was related to multiple classloaders, where extension classes and methods were being reported as "not found", as mentioned in the Xalan-J FAQ. Fortunately, the Xalan implementation appears to handle multiple classloaders in a quite robust fashion, by using the context ClassLoader. In the environment where I was working, the Xalan classes were in a parent classloader from the extension classes; however, this never posed to be a problem for me previously. The actual error in my particular case was that the servlet engine was old and buggy, and was not setting the context ClassLoader on new threads. I worked around this by calling setContextClassLoader(…) in my servlet's service(…) method, before calling super.service(…).

9. Using DocumentFragments

This is really the first use I found of DocumentFragment where it seemed appropriate. Even when using the XSLT approach, there may be instances where it is necessary or easier to build and include a section of XML from within Java rather than XSLT. If used excessively, fragments are counter-productive to the advantages of using XSLT. Understand that while XSLT can stream the content in a pipelined-fashion as it is prodcued, each DocumentFragment must be completely built and returned before it can be streamed, which will increase memory requirements with the size of the fragments.

As doocumented on the Xalan Extensions page, DocumentFragments are a valid return type from an extension. They are also far easier to produce than the other Node-Set types. Here is the best method I found to make use of this functionality:

Java extension method:

public DocumentFragment fill(Node n){
  Document doc = (Document)n;
  DocumentFragment df = doc.createDocumentFragment();
  
  // Append any number of children and/or sub-children...
  Element e = doc.createElement("Example");
  df.appendChild(e);
  
  return df;
}

XSLT:

<xsl:copy-of select="extensionPrefix:fill($instanceVariable, .)"/>

10. XSLT vs. JAXB and JibX, Castor, etc.

While several people I know are big fans of XML data binding frameworks, I try my best to avoid them. For almost any use that I've seen of these frameworks, I would contend that XSLT and/or one of the XML generation techniques I previously described would be a better fit. In general, these frameworks introduce additional complexities and dependencies, along with usually artificial limitations. I've seen several performance comparisons and presentations between such frameworks, but none that dare to include XSLT and the other direct approaches.

5 comments:

Anonymous said...

You may also want to look at vtd-xml, the latest and most advanced XML processing
API

http://vtd-xml.sf.net

Anonymous said...

Mark, I'm not sure why you chain 3 transforms in your example: Transformer t = stf.newTransformer(); simply instantiates an "identity transform." You then chain that one to your first Transform that you created from a template, but all you really need to do is get the first one by calling th1.getTransformer().

Mark A. Ziesemer said...

Anonymous - these are simple examples for demonstration purposes. Each transformer may contain additional properties, e.g. output properties and error listeners, which may not be proper to be used on the final transformation if they had been set for the stylesheet-specific transformations.

SlideGuitarist said...

Sorry about the earlier anonymous posting! I've done this with Apache Xalan and Saxon-B; no problem. If I use the default XSLT processor (i.e. the version of Xalan that Sun wrecked), the first method of chaining XSLTs, at least, fails. Most distressing...

Yannick said...

Hi, I read your tip about chaining transformations. What I am trying to achieve is slightly different.

I am trying to use the result of a xsl transform (from XMLSource1 and StyleSheet1) as a stylesheet for another transform (from XMLSource2) and then output the result

How could I achieve this?
Cheers,