Mark A. Ziesemer: Scraping Sun's bug database with NekoHTML

Sun has a rather comprehensive bug database at http://bugs.sun.com/, which is probably most notably used to track Java bugs. Unfortunately, at least with the interface and options presented to the general public (such as myself), some aspects are rather lacking. Beyond being able to vote for a bug (only up to 3), and creating a single "watch list", it appears to be a far cry from Bugzilla.

The feature I've been missing the most, however, is an easy way to keep track of bugs of interest, and by extension, being able to share bug lists / searches. Bugzilla makes this incredibly easy by offering a number of export options, including XML, CSV, and RSS feeds. As of Bugzilla 3.0, saved searches can also be shared, as an effort of Bugzilla Bug 69,000, and is demonstrated in Bugzilla's public test environment, http://landfill.bugzilla.org/bugzilla-3.0-branch/.

As a side note, while both https://bugs.eclipse.org/bugs/ and https://bugzilla.mozilla.org/ appear to be running version 3.0+ of Bugzilla, neither currently seem to have these shared queries working/enabled. (I've opened bug 223594 on Eclipse to hopefully address this.)

Java HTML Parsers

Unfortunately, as I previously mentioned, Sun isn't using Bugzilla, and there are no export options of any type. As far as I can tell, this really only leaves parsing the HTML as the only option.

If the HTML returned by Sun on their bug pages was valid XHTML, this would be an easier task. Just write a few XPaths to find the necessary fields, format the data as desired, and another task complete. Unfortunately, the pages are returned simply as regular HTML, not the XML-compatible variety. At least unlike Blogger, Sun's pages are at least returned with a valid identifying type that actually matches the page, in this case a doctype of "-//W3C//DTD HTML 4.01 Transitional//EN".

There are many ways to parse HTML. While I know a lot of people like Perl, I prefer to stick with the Java approach. In any language, there are also many existing tools and frameworks to help with the task, so starting from scratch would probably be a waste of time. The non-XML-based version of HTML can be rather messy, with unbalanced tags, missing escape sequences, and other issues that will quickly lead to headaches - and these tools will help "clean" the input.

I recently found a blog posting about a similar goal: "Showdown - Java HTML Parsing Comparison" (Ben McCann, 2008-02-02, lumidant.com). He demonstrates 4 Java libraries for parsing HTML, including NekoHTML, TagSoup, jTidy, and HTMLCleaner. (Unlike the referenced blog, I've linked these utilities to their official sites for your convenience.) The referenced post favored HTMLCleaner, as it was supposedly the only tool to successfully extract 10/10 documents. I've had better luck with NekoHTML. I'd be curious to see the rest of the code that was used, as I suspect that case sensitivity and a few other issues may have played into the results, and at least with NekoHTML, can easily be normalized with a configuration option or two.

NekoHTML

In the past, I utilized HttpUnit as such a tool. It actually uses NekoHTML as the parser, then provides a HTML-specific API for navigating/querying the "cleaned" document - as well as performing actions, for HttpUnit's real purpose as a testing framework.

Of the four parsers listed above, NekoHTML definately seems to be the most comprehensive and as of recently, actively maintained. After a period without any releases between June 2005 and December 2007, it was relaunched on SourceForge. Beware of version 1.9.6.2, however, as there seems to be a rather severe regression bug with the handling of single quotes, as I reported in issue 1922810.

For starting with NekoHTML, I recommend the following configuration:

import org.cyberneko.html.parsers.DOMParser;
// …
DOMParser domParser = new DOMParser();
domParser.setFeature("http://cyberneko.org/html/features/insert-namespaces", true);
domParser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");

See http://nekohtml.sourceforge.net/settings.html for the details of these options. The insert-namespaces feature basically utilizes the XHTML namespace, http://www.w3.org/1999/xhtml, to all HTML content, allowing for distinction between HTML content and other possible content defined in an alternate namespace within the document. The names/elems property instructs NekoHTML to convert all tag names to lower case, which matches the XHTML specification (compared to upper-case for HTML). Setting the insert-namespaces feature seems to be a pre-requisite for the names/elems property.

The only down-side to enabling XML namespaces is slightly complicating the use of XPath. In order to properly query elements in XML namespaces with XPath, a javax.xml.namespace.NamespaceContext implementation needs to be registered to the XPath using XPath.setNamespaceContext(…), which maps prefixes to namespaces, etc. (See NamespaceContextMap in MarkUtils-XML.)

Parsing Sun's bug pages

For extracting the fields from one of Sun's bug pages, I formulated two XPath expressions:

//html:table[preceding::html:a[@name='skip2content']]//html:table/html:tbody

This expression finds the body of the table after an anchor that separates the desired content from the rest of the page headers, navigation, etc.

html:tr[html:td//text()=$pageLabel]/html:td[position()=2]

This expression finds the value of a desired field on the page, e.g. "Bug ID:", "Synopsis", or "Category", as currently represented by $pageLabel above. The expression finds the first <td/> node matching the desired label, then returns the following <td/> node that contains the desired value.

Use a XPathVariableResolver to handle the variable in the 2nd expression. If multiple bug pages are to be processed, these expressions should probably be compiled to XPathExpressions for repeated use.

Here is my export of the bugs I'm currently "watching" / am interested in:

Bug ID	Synopsis	Category	Reported Against	Release Fixed	State	Priority	Submit Date
4079882	Request for JTristateCheckbox implementation	java:classes_swing	1.3.1 , 1.4.1 , 1.1fcs		In progress, request for enhancement	4-Low	1997-09-17
4187336	ServletResponse.setContentLength(Long)	javax_servlet:api	1.1fcs		Closed, will not be fixed	4-Low	1998-11-05
4526561	File system change notification events should be supported	java:classes_io	merlin-beta2		In progress, request for enhancement	4-Low	2001-11-13
4652184	please compile j2sdk rt.jar with -g (all options)	java:build	1.4.2 , 1.4.2_04 , merlin-rc1 , tiger-beta , tiger-beta2	mustang(b28)	Closed, fixed	4-Low	2002-03-13
4782054	Allow for comments in the MANIFEST.MF file	java:jar	1.4.1		In progress, request for enhancement	4-Low	2002-11-20
4787931	System property "user.home" does not correspond to "USERPROFILE" (win)	java:classes_lang	1.3 , 1.4.1 , 1.4.2		In progress, bug	3-Medium	2002-12-03
4838318	(str) Substitute CharSequence for String arguments wherever possible	java:classes_lang	1.4.1 , 1.4.2		In progress, request for enhancement	4-Low	2003-03-27
4880234	ServiceUI needs a printDialog method wtih a Component parameter	java:classes_2d	1.4.1		In progress, request for enhancement	4-Low	2003-06-18
4983159	Typedef (alias)	java:specification	tiger-beta		In progress, request for enhancement	4-Low	2004-01-24
5018574	Unable to set focus to another component in JOptionPane	java:classes_swing	tiger		In progress, bug	3-Medium	2004-03-23
5043696	StringReader should be allow a String{Buffer,Builder} to be the backing store	java:classes_io	1.4.2		In progress, request for enhancement	4-Low	2004-05-07
5096679	PIT:PrintDialog is not positioned properly on multi-mon, when coords are invalid	java:classes_2d	mustang		In progress, bug	4-Low	2004-09-03
5109347	PrinterJob.printDialog() does not support multi-mon, always displayed on primary	java:classes_2d	1.4		In progress, bug	4-Low	2004-09-30
6192554	Need generic factory interface.	java:classes_util			In progress, request for enhancement	4-Low	2004-11-09
6212751	DOC: ServiceUI.printDialog() need to enhance the description for X,Y coordinates	java:classes_2d	1.4		In progress, bug	4-Low	2004-12-27
6214380	Quality setting is disabled and always set to Normal in Print Dialog	java:classes_2d			In progress, request for enhancement	4-Low	2005-01-05
6215174	Can't force layout of non-showing component	java:classes_awt	5.0		In progress, request for enhancement	4-Low	2005-01-07
6312085	The for/in statement should support Iterators	java:specification	tiger-beta		In progress, request for enhancement	4-Low	2005-08-17
6325564	(str) Provide CharSequenceReader with sub-sequence capability	java:classes_lang			In progress, request for enhancement	4-Low	2005-09-19
6358852	Add methods on concurrent data structures that interrupt blocked threads	java:classes_util_concurrent			In progress, request for enhancement	4-Low	2005-12-05
6400189	raw types and inference	java:compiler			In progress, bug	4-Low	2006-03-17
6476646	(str) Make AbstractStringBuilder class public	java:classes_lang			In progress, request for enhancement	5-Very Low	2006-09-29

If I can find the time, a complete sample code download may also follow.

Mark A. Ziesemer

Sunday, March 23, 2008

Scraping Sun's bug database with NekoHTML

Java HTML Parsers

NekoHTML

Parsing Sun's bug pages

No comments:

About Me

Blog Archive

Labels

Notes

Search

Followers

Mark A. Ziesemer

Sunday, March 23, 2008

Scraping Sun's bug database with NekoHTML

Java HTML Parsers

NekoHTML

Parsing Sun's bug pages

No comments:

About Me

Blog Archive

Labels

Notes

Search

Followers

Subscribe To