Wednesday, February 18, 2009

MarkUtils-XML: NamespaceContextMap, PrettyPrint, Date Format

Adding to my collection of MarkUtils, this is my introduction of MarkUtils-XML. It is available on ziesemer.java.net under the GPL license, complete with source code, a compiled .jar, generated JavaDocs, and a suite of JUnit tests. Download the com.ziesemer.utils.xml-*.zip distribution from here.

NamespaceContextMap

I think that XML Namespaces are a great solution for avoiding naming collisions. I also think that XPath is a very useful tool for pulling data out of XML documents. Unfortunately, using XPath with XML Namespaces involves a little bit of extra work, especially in the current version of Java.

The most common issue I see other developers run into when first working with this combination is finding that their XPath isn't returning any results. This is because unless otherwise specified, the XPath only searches for nodes declared without a namespace. XML nodes declared with namespaces can be referenced using namespace prefixes, where each prefix is assigned to a specific namespace URI. It should be noted that prefixes only function as placeholders for the namespace URIs. Even though an XML document may have one prefix assigned to a given namespace, it cannot be assumed that prefix will remain unchanged. Many times, these prefixes are generated automatically and/or as needed for each namespace used in a XML document. Two XML documents should be considered equal if the only difference between them are the prefixes used for a common namespace. For example, XSLT uses a namespace URI of "http://www.w3.org/1999/XSL/Transform". It is commonly prefixed to either "xs:" or "xsl:", though other prefixes are also used and valid. As such, any application should explicitly map any desired namespaces to a local prefix that can be used to reference XML nodes declared with a namespace.

In Java, the XPath class accepts a NamespaceContext instance for resolving namespace prefixes to namespace URIs, and vice-versa. Unfortunately, Java does not currently provide an implementation of the NamespaceContext interface, as reported in Sun's bug 6376058. It is relatively easy to write a simple implementation, which can optionally be included as either an inner-class or an anonymous inner-class. However, this can quickly become quite repetitive, especially when needing to support multiple namespace mappings in the same context.

My solution is the NamespaceContextMap class. It implements both NamespaceContext and Map<String, String>, making it very easy to configure and use. It accepts both prefix/URI pairs, as well as QName instances. Lookups are first resolved against the instance's configured list of mappings, then the default mappings as defined in NamespaceContext and XMLConstants. It also follows all the guidelines listed in the interface's Javadoc.

Here is some basic, example usage:

NamespaceContextMap ncm = new NamespaceContextMap();
ncm.put("xslt", "http://www.w3.org/1999/XSL/Transform");
ncm.put("xhtml", "http://www.w3.org/1999/xhtml");

XPathFactory xpf = XPathFactory.newInstance();
XPath xpath = xpf.newXPath();
xpath.setNamespaceContext(ncm);

// Do XPath operations here...

Two maps are internally maintained for performance regardless of lookup type - one map is keyed by prefix, the other by URI. The later are stored a Set with a backing List, which guarantees that multiple prefixes are supported per URI, and that the getPrefixes(String) method returns them in the order that they were added (FIFO) - essentially, an ordered map.

In the implementation, I struggled with finding a solid way of enforcing consistency and constraints. In particular, the entrySet(), keySet(), and values() methods of the Map interface make it very difficult (but not impossible) to intercept add/remove operations, something that I previously posted about in Java Collections Listeners. For now, these methods return unmodifiable collections.

XML PrettyPrint

While XML may commonly be sent as a single-line or without indentation for compactness and efficiency, it is usually most easily viewed with increased indentation at each level, commonly referred to as "pretty printing". This styling is presented by default in most web browsers, including both Mozilla Firefox and Microsoft Internet Explorer, as well as many IDEs and text editors. However, performing this formatting from an automated fashion within Java doesn't seem to be a feature that is readily available, stable, or easy to use. See the "Java 1.5 doesn't want to indent XML output" forum thread for some related discussion, including a copy of my solution.

My solution is an XSLT that reformats the XML with indentation, accounting for existing whitespace, and without any necessary references to "xml.apache.org". It also accepts configurable XSLT parameters for the indentation and newline character sequences. A PrettyPrint class is provided that handles loading the XSLT as a class resource, and returns a reusable, thread-safe Templates instance. For some details on this, including notes on how to chain it into an existing transformation or serialization for increased performance, see my previous post: XML and XSLT Tips and Tricks for Java.

As noted in my "Tips and Tricks" post, please be sure to upgrade to the latest version of Apache Xalan, 2.7.1 or newer. Otherwise, there is a particular issue where generated comments tend to disappear. This isn't an issue specific to my transformation, and can be reproduced even with an identity transformation. See the comment at the beginning of PrettyPrint.xslt for details.

XmlDateFormat

A frequent task I encounter is generating valid XML schema dateTime-formatted values. This format is a profile of the ISO 8601 standard, and is further detailed in RFC 3339. Unfortunately, Java doesn't currently provide a standard DateFormat that matches this specification. Included in my package is a XmlDateFormat class with a getDateFormat() method that returns a properly-configured DateFormat. As with most Format instances, the returned DateFormat instance is not guaranteed to be thread-safe and should not be re-used across threads.

Monday, February 2, 2009

See all newspaper comments at once with Greasemonkey

Background

Like many web users today, I get much of my local news from the online versions of the local newspapers. One particular feature the online editions offer over the print editions is the inclusion of user comments / responses to the stories.

At least around my areas of interest in Central Wisconsin and the Fox Valley, most of the local papers are owned by the Gannett Company. This includes the Appleton Post-Crescent, the Wausau Daily Herald, and others. Much of Gannett's online presence is currently provided by Pluck Social Media's SiteLife product, particularly SiteLife Comments. Pluck even hosts a special customer profile detailing their work with Gannett Corporation. While the sites I'm working with here all happen to be owned by Gannett, it's quite possible that this will also apply to other Pluck-based sites as well.

Unfortunately, Pluck's current implementation leaves some things to be desired. I recall reading many of the negative comments left as the transition was made from the old version of the sites to the current version, which is when I believe Pluck became involved. Fortunately, they have improved some things since, and definitely seem to be fairing better than the current fiasco at Dell's online community after their so-called upgrade. However, the most annoying issue I have while reading Gannett's local news articles is that only 5 user comments are visible at a time. There is a "Full Page View" option available at the bottom, but this only increases the visible comments-per-page to 10. While all these comments are loaded in an AJAX-type fashion using JSON data, clicking to retrieve the next page still results in reloading the entire page. Even on a broadband connection, each page change requires 5+ seconds. This makes trying to read through all the comments on a popular story very frustrating, especially when there are sometimes 50 or more responses. While many of these comments are informative or insightful, having to click through and reload 5 or more pages is certainly not making the best use of web technology.

Technical Challenges

As I had done with Resizing the Blogger Edit Box, my first thought was to attempt to improve things with a Bookmarklet. Unfortunately, the task proved to be too complex, partially due to the same origin policy blocking the necessary cross-domain data. In particular, while the article page is served from a "www." host, the JSON data containing the comments is obtained from a "sitelife." host. While the current pages seem to work around this restriction through some iframe tricks, attempting to reuse that functionality would be a hack at best. Instead, I turned to a Greasemonkey-based solution. Greasemonkey provides a non-domain-restricted GM_xmlhttpRequest API method that provides access to Mozilla's chrome-privileged XMLHttpRequest object.

The pages I had to work with were not at all desirable or the easiest to work with. Each page typically includes about 20 JavaScript files, and some of the code is quite obfuscated. One of the main files, "GDSRScripts.js", is about 86 KB. The core of the Yahoo! UI Library (YUI) (yahoo-dom-event.js in 2.6.0) is not even half that, at only 31 KB. I also see no effort made at respecting the JavaScript global namespace, or use of other best practices.

The Solution

I've completed a Greasemonkey script that I've posted at userscripts.org: All Pluck Comments. Once installed and configured for one or more of the Pluck-based Gannett news sites, it will update any loaded news article by showing all available comments on the same, single page. If all the existing comments already fit within one page and the current 5-post limit, the script will exit and do nothing. Unfortunately, much of the previous waiting time doesn't seem to be in the JavaScript, but with the server responding to the JSON requests - a performance issue that can't be resolved client-side. While those requests are made, the script will show the loading status above the existing comments. Once all comment "pages" have been downloaded, the comments section is repopulated with the complete list. Additionally, changing the sort order between "Newest first" and "Oldest first" now performs instantly, without requiring additional remote requests.

Due to the number of possible supported sites, only one default URL pattern is configured to the "included pages" within the Greasemonkey script. Other desired, supported sites will need to be manually added. (This would be easier of Greasemonkey supported regular expressions for the patterns, as I requested in ticket #216.) There are two types of URLs I've observed that should be supported. The first looks like "http://<hostname>/article/<date>/<siteId>/<articleId>/". The other looks like "http://<hostname>/apps/pbcs.dll/article?AID=/<date>/<siteId>/<articleId>". The best non-regular expression pattern I can suggest to match both these patterns is "http://<hostname>/*article*", where "<hostname>" needs to be replaced with the literal host and domain name to be supported.

The only current limitation is that the per-comment controls (Recommend, New post, Reply to this Post, Report Abuse, etc.) are not regenerated. This is because it would be very difficult, if even possible, to make all of Pluck's existing JavaScript work with these enhancements. In order to use these controls, click on the "Full Page View" link that is left below the list of comments. This will bring back the limit of 10 comments per page, but the Greasemonkey script will exit without making any changes and leaving these controls intact. I seldom use these controls, so this issue isn't that important to me. However, if there is a stated interest, I may look into resolving this for a future version. Alternatively, feel free to write and submit a patch!

Technical Details

The script first waits for the existing comments to load, at which point it determines the article ID, the total number of comments available, and other information necessary for requesting the additional "pages" of comments. If it times out waiting, or determines that it is on the "Full Page View", it simply exits and does nothing. Otherwise, it makes a series of asynchronous requests to retrieve all the available comments. The responses are unnecessarily URL-encoded, and are decoded by the script using unescape(). The responses also contain an unnecessary <script> section at the beginning, which is searched for and removed. The JSON text is then "safely" evaluated to a JavaScript object using the regular expression provided in section 6 of RFC 4627. Once all responses are received, the existing comments HTML is cleared, and new comments are built and populated from the JSON data using the HTML DOM.

Some of the tools I used during this process were Firebug, JSView, and Notepad++. Some of the JavaScript practices I used include closures and other JavaScript topics I've written about.