Wednesday, June 27, 2007

XML Generation in Java

I can't begin this post any better than with reference to an article I found by Henri Yandell, "Generating XML via Java". An excerpt (highlighting added):

Most of the attention in the XML world focuses on parsing XML and walking an XML structure. The W3C provides the DOM and SAX specifications to parse data, Sun provides the Java XML Pack, and Apache has Xerces and Xalan. However, very little attention is paid to the techniques for XML output. Projects are looking into turning JavaBeans and Swing components into XML, but most of the time, developers simply want to output a data structure in a custom-formatted way.

Henri goes on to describe different ways of generating XML in Java, using a StringBuffer, DOM, SAX, or his own concept of an "XmlWriter". While he includes a number of code snippets, unfortunately, none are really complete or take advantage of the available Java APIs.

The "StringBuffer" method:

I think Henri explained this one quite well, and warned that it should be cautiously avoided. It blurs the desired division between Java code and the XML, and requires additional work to properly handle escaping of quotes and other special characters - all things that can be obtained "for free" with existing frameworks.

The Java API for XML Processing (JAXP):

Before using any of the following methods, review the JAXP API. Besides defining the various processing interfaces, it is also a pluggable architecture that allows for different implementations and libraries to be used without creating additional dependencies.

A number of implementations are listed on Sun's JAXP FAQ page. Note that you get a default implementation for free with Java 5 and Java 6, and although not listed, also Java 1.4. (Versions of Apache Xalan and Xerces are used by default.) Java 1.3 and 1.2 are also supported by including these libraries. (The Xerces xml-apis.jar includes a javax.xml package with the JAXP interfaces.)

2 popular products not listed are dom4j and JDOM. (Sun's JAXP 1.1 tutorial provides some interesting comparisons, though slightly dated (2005).) Also note that both dom4j and JDOM are only DOM implementations, and rely on other libraries such as Apache Xerces for parsing existing XML into a DOM structure.) Between these 2, dom4j is my personal favorite. For one reason, JDOM, while popular, does not currently implement the JAXP interfaces. While both provide the ability to rather easily convert between their version of an XML Document (org.dom4j.* or org.jdom.*) and the W3C standards interfaces (org.w3c.dom.*), dom4j provides a configuration option and package that natively implements both dom4j's document interfaces as well as W3C's (org.dom4j.dom.*). dom4j can even be used as a DOM implementation under JAXP. (That said, I still prefer using the independent JAXP calls, which return the Apache implementations by default.)

The DOM (Document Object Model) method:

Using DOM has the advantage of dealing with an object- and tree-based structure, which mirrors well to the tree-based layout of XML. As an in-memory model, the document can be created in segments, re-ordered, and edited at any point before being serialized. With the help of a Transformer and associated classes, the output is guaranteed to be valid XML, including properly-escaped character sequences.

One reason not to use DOM is when memory usage is a concern - especially if creating a large XML document, as the entire document is stored in memory until it can eventually be garbage collected.

Here is some sample code that shows how to create some sample XML using JAXP and DOM:

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.w3c.dom.Document;
import org.w3c.dom.Element;

/** @author Mark A. Ziesemer */
public class DOMSample{

  public static void main(String[] args) throws Exception{
    // Create a new document.
    Document doc = DocumentBuilderFactory.newInstance()
      .newDocumentBuilder().newDocument();

    // Create and add a root element and an attribute.
    Element root = doc.createElement("Root");
    doc.appendChild(root);
    root.setAttribute("Name", "Value");
    Element child = doc.createElement("Child");
    root.appendChild(child);

    // Output the document.
    TransformerFactory.newInstance().newTransformer().transform(
      new DOMSource(doc), new StreamResult(System.out));
  }
}

The SAX (Simple API for XML) method:

Using SAX is a great way to overcome the memory requirements of DOM. Unfortunately, this has to be one of the lesser known methods. This article on JavaZOOM is really the only search result I found that even makes an attempt at it. In fact, an IBM developerWorks article from 2003 basically calls it impossible:

The first option, SAX, is really a non-option. I've included it in the list because most developers getting started with XML hear about SAX and how quick it is for XML processing. While SAX is traditionally considered the fastest and slimmest API for XML, it does not have the ability to output XML (or anything else, for that matter). In fact, if you examine the SAX package (org.xml.sax), you won't find a single output method. It is designed from the ground up to read XML, rather than write it.

Henri even mentioned SAX as a generation method, though he relied upon a yet-to-be-written "XMLPrettyPrinter" class, which I assume would implement org.xml.sax.ContentHandler. Fortunately, such an implementation does exist, courtesy of SAXTransformerFactory. It provides a TransformerHandler newTransformerHandler() method, whose methods can then be used to generate XML. (The default JAXP implementation of TransformerFactory can be cast to SAXTransformerFactory. This can be checked for as noted with the FEATURE field, where a runtime exception could be thrown - but the ClassCastException that would be thrown by default may sometimes be acceptable.)

Here is some sample code that shows how to create the same sample XML as above, using JAXP and SAX:

import javax.xml.transform.TransformerFactory;
import javax.xml.transform.sax.SAXTransformerFactory;
import javax.xml.transform.sax.TransformerHandler;
import javax.xml.transform.stream.StreamResult;

import org.xml.sax.helpers.AttributesImpl;

/** @author Mark A. Ziesemer */
public class SAXXmlWriter{

  public static void main(String[] args) throws Exception{
    TransformerFactory tf = TransformerFactory.newInstance();
    if(!tf.getFeature(SAXTransformerFactory.FEATURE)){
      throw new RuntimeException(
        "Did not find a SAX-compatible TransformerFactory.");
    }
    SAXTransformerFactory stf = (SAXTransformerFactory)tf;
    TransformerHandler th = stf.newTransformerHandler();
    th.setResult(new StreamResult(System.out));

    th.startDocument();

    AttributesImpl fieldAttrs = new AttributesImpl();
    fieldAttrs.addAttribute("", "", "Name", "", "Value");

    th.startElement("", "", "Root", fieldAttrs);
    th.startElement("", "", "Child", null);
    th.endElement("", "", "Child");
    th.endElement("", "", "Root");
    th.endDocument();
  }
}

Unfortunately, note that it is very possible to create invalid XML using this method - by forgetting to end an element, or ending the wrong element, for example. This is one advantage that a DOM-based method has over SAX.

Also note that all the empty strings ("") passed into the above methods would normally be used with a Level 2 document to specify namespaces, a feature omitted here for simplicity.

The StAX (Streaming API for XML) method:

StAX is a relatively new method, and isn't built-in to Java until Java 6. It can be seen as a median between DOM and SAX, providing both ease-of-use and high performance in terms of CPU and memory requirements.

For users not yet on Java 6, the APIs and a list of implementations are available at https://stax-utils.dev.java.net/. I've successfully used Sun's implementation (SJSXP) under Java 5, but haven't yet looked into earlier versions (Java 1.4/1.3). I've noticed that none of the implementations appear to contain any information regarding Java version requirements.

Here is some sample code that generates yet the same sample XML using StAX:

import javax.xml.stream.XMLOutputFactory;
import javax.xml.stream.XMLStreamWriter;

/** @author Mark A. Ziesemer */
public class StAXSample{

  public static void main(String[] args) throws Exception{
    XMLStreamWriter xsw = XMLOutputFactory.newInstance()
      .createXMLStreamWriter(System.out);
    xsw.writeStartDocument();
    xsw.writeStartElement("Root");
    xsw.writeAttribute("Name", "Value");
    xsw.writeEmptyElement("Child");
    xsw.writeEndElement();
    xsw.writeEndDocument();
    xsw.close();
  }
}

The XSLT method:

See my other post, XML and XSLT Tips and Tricks for Java.

Implementation notes:

All of the sample code here was written "quick and simple" to demonstrate how each method can be used to generate XML using the different Java APIs. It should be easy to copy & paste into an IDE or just pass to javac to test and experiment with. If any of this code were to actually be used, it would probably belong outside of the main method, put outside of the default (empty) Java package, and require proper error checking rather than declaring that the method (here, main) "throws Exception".

Additionally, JAXP makes good use of the factory method pattern. If the operations are to be repeated, such as creating multiple XML documents, the created instances should probably be saved and reused. These factories can then be configured once and used multiple times with various options that affect the created instances, such as XML namespace support. If being used from multiple threads, synchronization needs to be considered, as many of these classes are not guaranteed to be thread safe. Check the associated Javadocs for details.

Alternative methods:

This is by no means a comprehensive list of APIs and existing utilities that can be used to generate XML. One of the more popular methods involves mapping back and forth to Java classes, sometimes generated automatically, and sometimes using Java annotations. JAXB, the Java Architecture for XML Binding, is the standard Java API for this method. Castor and JiBX are 2 other popular products. Another IBM developerWorks article, "XML and Java technologies: Data binding" (Part 2: Performance) by Dennis Sosnoski reviews both of these and a number of other related products, but gives very favorable reviews to JiBX, then continues to cover it in detail: Part 3: JiBX architecture and Part 4: JiBX Usage.

Sample XML:

For reference, here is the sample, simple XML that is produced in all of the above methods. It contains a root element, attribute, and child element. Anything more complex is pretty much more of the same, though all 3 methods discussed also support generating CDATA tags, processing instructions, and other valid XML markup. (Note that it has been "pretty-printed" here, which was not done by any of the above code. It could be, but that's another post...)

<?xml version="1.0" encoding="UTF-8"?>
<Root Name="Value">
  <Child/>
</Root>

1 comment:

Arun N Kumar said...

Nice article in overall.
But can you rate the approaches in terms of performance so that people can also choose the right one for the right perf requirements? That would be helpful !

Thanks!
Arun