Sunday, March 23, 2008

Scraping Sun's bug database with NekoHTML

Sun has a rather comprehensive bug database at http://bugs.sun.com/, which is probably most notably used to track Java bugs. Unfortunately, at least with the interface and options presented to the general public (such as myself), some aspects are rather lacking. Beyond being able to vote for a bug (only up to 3), and creating a single "watch list", it appears to be a far cry from Bugzilla.

The feature I've been missing the most, however, is an easy way to keep track of bugs of interest, and by extension, being able to share bug lists / searches. Bugzilla makes this incredibly easy by offering a number of export options, including XML, CSV, and RSS feeds. As of Bugzilla 3.0, saved searches can also be shared, as an effort of Bugzilla Bug 69,000, and is demonstrated in Bugzilla's public test environment, http://landfill.bugzilla.org/bugzilla-3.0-branch/.

As a side note, while both https://bugs.eclipse.org/bugs/ and https://bugzilla.mozilla.org/ appear to be running version 3.0+ of Bugzilla, neither currently seem to have these shared queries working/enabled. (I've opened bug 223594 on Eclipse to hopefully address this.)

Java HTML Parsers

Unfortunately, as I previously mentioned, Sun isn't using Bugzilla, and there are no export options of any type. As far as I can tell, this really only leaves parsing the HTML as the only option.

If the HTML returned by Sun on their bug pages was valid XHTML, this would be an easier task. Just write a few XPaths to find the necessary fields, format the data as desired, and another task complete. Unfortunately, the pages are returned simply as regular HTML, not the XML-compatible variety. At least unlike Blogger, Sun's pages are at least returned with a valid identifying type that actually matches the page, in this case a doctype of "-//W3C//DTD HTML 4.01 Transitional//EN".

There are many ways to parse HTML. While I know a lot of people like Perl, I prefer to stick with the Java approach. In any language, there are also many existing tools and frameworks to help with the task, so starting from scratch would probably be a waste of time. The non-XML-based version of HTML can be rather messy, with unbalanced tags, missing escape sequences, and other issues that will quickly lead to headaches - and these tools will help "clean" the input.

I recently found a blog posting about a similar goal: "Showdown - Java HTML Parsing Comparison" (Ben McCann, 2008-02-02, lumidant.com). He demonstrates 4 Java libraries for parsing HTML, including NekoHTML, TagSoup, jTidy, and HTMLCleaner. (Unlike the referenced blog, I've linked these utilities to their official sites for your convenience.) The referenced post favored HTMLCleaner, as it was supposedly the only tool to successfully extract 10/10 documents. I've had better luck with NekoHTML. I'd be curious to see the rest of the code that was used, as I suspect that case sensitivity and a few other issues may have played into the results, and at least with NekoHTML, can easily be normalized with a configuration option or two.

NekoHTML

In the past, I utilized HttpUnit as such a tool. It actually uses NekoHTML as the parser, then provides a HTML-specific API for navigating/querying the "cleaned" document - as well as performing actions, for HttpUnit's real purpose as a testing framework.

Of the four parsers listed above, NekoHTML definately seems to be the most comprehensive and as of recently, actively maintained. After a period without any releases between June 2005 and December 2007, it was relaunched on SourceForge. Beware of version 1.9.6.2, however, as there seems to be a rather severe regression bug with the handling of single quotes, as I reported in issue 1922810.

For starting with NekoHTML, I recommend the following configuration:

import org.cyberneko.html.parsers.DOMParser;
// …
DOMParser domParser = new DOMParser();
domParser.setFeature("http://cyberneko.org/html/features/insert-namespaces", true);
domParser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");

See http://nekohtml.sourceforge.net/settings.html for the details of these options. The insert-namespaces feature basically utilizes the XHTML namespace, http://www.w3.org/1999/xhtml, to all HTML content, allowing for distinction between HTML content and other possible content defined in an alternate namespace within the document. The names/elems property instructs NekoHTML to convert all tag names to lower case, which matches the XHTML specification (compared to upper-case for HTML). Setting the insert-namespaces feature seems to be a pre-requisite for the names/elems property.

The only down-side to enabling XML namespaces is slightly complicating the use of XPath. In order to properly query elements in XML namespaces with XPath, a javax.xml.namespace.NamespaceContext implementation needs to be registered to the XPath using XPath.setNamespaceContext(…), which maps prefixes to namespaces, etc. (See NamespaceContextMap in MarkUtils-XML.)

Parsing Sun's bug pages

For extracting the fields from one of Sun's bug pages, I formulated two XPath expressions:

//html:table[preceding::html:a[@name='skip2content']]//html:table/html:tbody

This expression finds the body of the table after an anchor that separates the desired content from the rest of the page headers, navigation, etc.

html:tr[html:td//text()=$pageLabel]/html:td[position()=2]

This expression finds the value of a desired field on the page, e.g. "Bug ID:", "Synopsis", or "Category", as currently represented by $pageLabel above. The expression finds the first <td/> node matching the desired label, then returns the following <td/> node that contains the desired value.

Use a XPathVariableResolver to handle the variable in the 2nd expression. If multiple bug pages are to be processed, these expressions should probably be compiled to XPathExpressions for repeated use.

Here is my export of the bugs I'm currently "watching" / am interested in:

Bug IDSynopsisCategoryReported AgainstRelease FixedStatePrioritySubmit Date
4079882Request for JTristateCheckbox implementationjava:classes_swing1.3.1 , 1.4.1 , 1.1fcs In progress, request for enhancement4-Low1997-09-17
4187336ServletResponse.setContentLength(Long)javax_servlet:api1.1fcs Closed, will not be fixed4-Low1998-11-05
4526561File system change notification events should be supportedjava:classes_iomerlin-beta2 In progress, request for enhancement4-Low2001-11-13
4652184please compile j2sdk rt.jar with -g (all options)java:build1.4.2 , 1.4.2_04 , merlin-rc1 , tiger-beta , tiger-beta2mustang(b28)Closed, fixed4-Low2002-03-13
4782054Allow for comments in the MANIFEST.MF filejava:jar1.4.1 In progress, request for enhancement4-Low2002-11-20
4787931System property "user.home" does not correspond to "USERPROFILE" (win)java:classes_lang1.3 , 1.4.1 , 1.4.2 In progress, bug3-Medium2002-12-03
4838318(str) Substitute CharSequence for String arguments wherever possiblejava:classes_lang1.4.1 , 1.4.2 In progress, request for enhancement4-Low2003-03-27
4880234ServiceUI needs a printDialog method wtih a Component parameterjava:classes_2d1.4.1 In progress, request for enhancement4-Low2003-06-18
4983159Typedef (alias)java:specificationtiger-beta In progress, request for enhancement4-Low2004-01-24
5018574Unable to set focus to another component in JOptionPanejava:classes_swingtiger In progress, bug3-Medium2004-03-23
5043696StringReader should be allow a String{Buffer,Builder} to be the backing storejava:classes_io1.4.2 In progress, request for enhancement4-Low2004-05-07
5096679PIT:PrintDialog is not positioned properly on multi-mon, when coords are invalidjava:classes_2dmustang In progress, bug4-Low2004-09-03
5109347PrinterJob.printDialog() does not support multi-mon, always displayed on primaryjava:classes_2d1.4 In progress, bug4-Low2004-09-30
6192554Need generic factory interface.java:classes_util  In progress, request for enhancement4-Low2004-11-09
6212751DOC: ServiceUI.printDialog() need to enhance the description for X,Y coordinatesjava:classes_2d1.4 In progress, bug4-Low2004-12-27
6214380Quality setting is disabled and always set to Normal in Print Dialogjava:classes_2d  In progress, request for enhancement4-Low2005-01-05
6215174Can't force layout of non-showing componentjava:classes_awt5.0 In progress, request for enhancement4-Low2005-01-07
6312085The for/in statement should support Iteratorsjava:specificationtiger-beta In progress, request for enhancement4-Low2005-08-17
6325564(str) Provide CharSequenceReader with sub-sequence capabilityjava:classes_lang  In progress, request for enhancement4-Low2005-09-19
6358852Add methods on concurrent data structures that interrupt blocked threadsjava:classes_util_concurrent  In progress, request for enhancement4-Low2005-12-05
6400189raw types and inferencejava:compiler  In progress, bug4-Low2006-03-17
6476646(str) Make AbstractStringBuilder class publicjava:classes_lang  In progress, request for enhancement5-Very Low2006-09-29

If I can find the time, a complete sample code download may also follow.

No comments: