Thursday, February 8, 2007

Java: Iterating over XML DOM Nodes

OK, this should be basic stuff for anyone with a Computer Science degree, but especially after running across some less-than-desirable code attempting to the same, I just "felt the need" to make a post out of this...

(Skip ahead to "The Real Issue" or right to "The Result".)

First, some background:

I'm going to assume you're already familiar with the basic "for" loop in Java:

for(int i=0; i<10; i++){
  System.out.println("i: " + i);
}

Well, this works great if you don't need to operate on or with each item within a collection, or if you DO need access to a "counter" variable.

A better approach when needing to operate on or with each item within a collection is to use an Iterator:

for(Iterator iter = list.iterator(); iter.hasNext();){
  Object item = iter.next();
  System.out.println(item);
}

Please note how the Iterator is declared within the loop. A frequently-seen alternative which I really dislike looks like this:

Iterator iter = list.iterator();
while(iter.hasNext()){
  Object item = iter.next();
  System.out.println(item);
}

This unnecessarily places the Iterator variable outside the scope of the loop where it is not needed, "polluting" the parent scope. It also allows the variable to remain after completion, where it could be accidentally referenced and used, potentially causing an issue.

Java 1.5 provides an even quicker approach using an enhanced "for" loop:

for(Object item : list){
  System.out.println(item);
}

(See "The for Statement" in The Java Tutorials, and the official specifics on these language features in the Java Language Specification, both from Sun Microsystems.)

Now, the real issue: How to effectively and efficiently iterate over a collection of XML DOM Nodes?

Every Node object has a getChildNodes() method which returns a NodeList. (Other methods also return NodeLists, such as various XPath operations.) Unfortunately, neither Node nor NodeList implement Iterable, provide an iterator() method, or otherwise extend from or implement any other Java feature that would make this an easier task. (I do have to give credit to Sun for specifically following W3C's "Document Object Model (DOM) Level 3 Core Specification" - following standards is usually a good thing!)

NodeList provides two methods: int getLength() and Node item(int index).

Here is one way to iterate over a node's child nodes:

NodeList nodeList = node.getChildNodes();
for(int i=0; i<nodeList.getLength(); i++){
  Node childNode = nodeList.item(i);
  // Do something with childNode...
}

However, this unnecessarily declares the NodeList outside the loop, just as I disliked with the Iterator in the example above.

At one point, I even worked on a "NodeListAdapater" adapter class that implemented Iterable, which would have allowed for the following:

for(Node childNode :
    new NodeListAdapter(node.getChildNodes())){
  // Do something with childNode...
}

However, either of the above methods have some additional issues:

  1. Accessing each child node by item(int) is not the most efficient method of accesssing the child nodes, at least not under Java's default implementation using Apache Xerces. (Check out the source code if you want to see for yourself.) Again, there are some real good reasons for this, but I'll leave the explanation for later or for someone else to explain...
  2. Adding or removing nodes while looping will cause undesirable results, e.g. either missed or duplicated results. This isn't even an issue with synchronization and thread safety, but if the nodes are modified by the loop itself. Most of Java's Iterator's will throw a ConcurrentModificationException in such a case, something that NodeList does not account for.

What does this leave us? The result / favored solution!

Think about a "Linked List". After all, this is how the DOM structure is typically implemented! (Actually, the Xerces implementation is based on a doubly-linked list.)

Take note of a few other methods available from Node: Node getFirstChild(), Node getNextSibling(), and the opposite/related methods.

Here's the 1st trial:

for(Node childNode=node.getFirstChild();
    childNode!=null; childNode=childNode.getNextSibling()){
  // Do something with childNode...
}

This will get the node's first child, and then continually set "childNode" to the next child until there are no more, at which point "childNode" becomes null, and the loop exits.

By the way, this loop runs very fast! (This is good! :-)

Unfortunately, there's still a slight concurrent modification issue to deal with. Actually, it's possibly better than before - if nodes are added or removed before the current position, they won't be noticed; if after, they'll still be encountered. However, the node can get moved, or deleted! In this case, the call to getNextSibling() may become null and/or may no longer be what is desired.

The solution is to calculate and remember the next node before proceeding with the operation. Here's the 2nd and final "solution" (for the moment, at least!):

for(Node childNode = node.getFirstChild();
    childNode!=null;){
  Node nextChild = childNode.getNextSibling();
  // Do something with childNode,
  //   including move or delete...
  childNode = nextChild;
}

Happy coding!

12 comments:

Yoni Samlan said...

i'm pretty sure that where you wrote
Node nextChild = node.getNextSibling();
you meant:
Node nextChild = childNode.getNextSibling();

since node.getNextSibling() would lead to an infinite loop. Might cause some headaches for beginning DOM programmers who stumble on this page. :)

Mark A. Ziesemer said...

Thanks, Yoni! This is now fixed in the posting.

(Unfortunately, I end up retyping just about everything in here, rather than copying and pasting, due to some poor formatting issues with Blogger. What you mentioned is exactly how I've been using it, and how it should have been copied here...)

Anonymous said...

Very compact and efficient. Thank you!

Anonymous said...

Most programmers don't have a CS degree :) Thanks for the article.

Anonymous said...

Hi, great info Mark. Any tips on how to make a system traverse the entire XML structure, it would be great to know how to achieve this using breadth-first and depth-first iteration techniques.

Best regards

RipX

Anonymous said...

That's a pretty nice article, but it only applies to iterating over nodes that have the same parent / that are siblings. The real issue is how to iterate efficiently over any NodeList, say, like one returned by Document.getElementsByTagName(...)

I would really be interested to know if there are better solutions out there.

Anonymous said...

Very nice!

Anonymous said...

Sweet :) !

Sam said...

Funny. I have tested the original loop and the method you recommend here. But I actually found the method you recommend takes longer time than the original one when I tried to flatten an XML file (both small files and big files) by using loops to go through all the nodes.

Anonymous said...

this code snippet does not show you how to "effectively" go through all the nodes. it only goes over the immediate childNodes of the root.

Anonymous said...

Thank you, very helpful :)

dojo effects said...

This is very helpful. We need this code so many times and this blog post is an excellent reference for the same. By the way , I had blogged on internal working of iterators in java and wanted to share the same after going through this code for iterating dom nodes.