Sun has a rather comprehensive bug database at http://bugs.sun.com/, which is probably most notably used to track Java bugs. Unfortunately, at least with the interface and options presented to the general public (such as myself), some aspects are rather lacking. Beyond being able to vote for a bug (only up to 3), and creating a single "watch list", it appears to be a far cry from Bugzilla.
The feature I've been missing the most, however, is an easy way to keep track of bugs of interest, and by extension, being able to share bug lists / searches. Bugzilla makes this incredibly easy by offering a number of export options, including XML, CSV, and RSS feeds. As of Bugzilla 3.0, saved searches can also be shared, as an effort of Bugzilla Bug 69,000, and is demonstrated in Bugzilla's public test environment, http://landfill.bugzilla.org/bugzilla-3.0-branch/.
As a side note, while both https://bugs.eclipse.org/bugs/ and https://bugzilla.mozilla.org/ appear to be running version 3.0+ of Bugzilla, neither currently seem to have these shared queries working/enabled. (I've opened bug 223594 on Eclipse to hopefully address this.)
Java HTML Parsers
Unfortunately, as I previously mentioned, Sun isn't using Bugzilla, and there are no export options of any type. As far as I can tell, this really only leaves parsing the HTML as the only option.
If the HTML returned by Sun on their bug pages was valid XHTML, this would be an easier task.
Just write a few XPaths to find the necessary fields, format the data as desired, and another task complete.
Unfortunately, the pages are returned simply as regular HTML, not the XML-compatible variety.
At least unlike Blogger, Sun's pages are at least returned with a valid identifying type that actually matches the page, in this case a doctype of "
-//W3C//DTD HTML 4.01 Transitional//EN".
There are many ways to parse HTML. While I know a lot of people like Perl, I prefer to stick with the Java approach. In any language, there are also many existing tools and frameworks to help with the task, so starting from scratch would probably be a waste of time. The non-XML-based version of HTML can be rather messy, with unbalanced tags, missing escape sequences, and other issues that will quickly lead to headaches - and these tools will help "clean" the input.
I recently found a blog posting about a similar goal: "Showdown - Java HTML Parsing Comparison" (Ben McCann, 2008-02-02, lumidant.com). He demonstrates 4 Java libraries for parsing HTML, including NekoHTML, TagSoup, jTidy, and HTMLCleaner. (Unlike the referenced blog, I've linked these utilities to their official sites for your convenience.) The referenced post favored HTMLCleaner, as it was supposedly the only tool to successfully extract 10/10 documents. I've had better luck with NekoHTML. I'd be curious to see the rest of the code that was used, as I suspect that case sensitivity and a few other issues may have played into the results, and at least with NekoHTML, can easily be normalized with a configuration option or two.
In the past, I utilized HttpUnit as such a tool. It actually uses NekoHTML as the parser, then provides a HTML-specific API for navigating/querying the "cleaned" document - as well as performing actions, for HttpUnit's real purpose as a testing framework.
Of the four parsers listed above, NekoHTML definately seems to be the most comprehensive and as of recently, actively maintained. After a period without any releases between June 2005 and December 2007, it was relaunched on SourceForge. Beware of version 18.104.22.168, however, as there seems to be a rather severe regression bug with the handling of single quotes, as I reported in issue 1922810.
For starting with NekoHTML, I recommend the following configuration:
import org.cyberneko.html.parsers.DOMParser; // … DOMParser domParser = new DOMParser(); domParser.setFeature("http://cyberneko.org/html/features/insert-namespaces", true); domParser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
See http://nekohtml.sourceforge.net/settings.html for the details of these options.
insert-namespaces feature basically utilizes the XHTML namespace,
http://www.w3.org/1999/xhtml, to all HTML content, allowing for distinction between HTML content and other possible content defined in an alternate namespace within the document.
names/elems property instructs NekoHTML to convert all tag names to lower case, which matches the XHTML specification (compared to upper-case for HTML).
insert-namespaces feature seems to be a pre-requisite for the
The only down-side to enabling XML namespaces is slightly complicating the use of XPath.
In order to properly query elements in XML namespaces with XPath, a
javax.xml.namespace.NamespaceContext implementation needs to be registered to the XPath using
XPath.setNamespaceContext(…), which maps prefixes to namespaces, etc.
NamespaceContextMap in MarkUtils-XML.)
Parsing Sun's bug pages
For extracting the fields from one of Sun's bug pages, I formulated two XPath expressions:
This expression finds the body of the table after an anchor that separates the desired content from the rest of the page headers, navigation, etc.
This expression finds the value of a desired field on the page, e.g. "Bug ID:", "Synopsis", or "Category", as currently represented by $pageLabel above. The expression finds the first <td/> node matching the desired label, then returns the following <td/> node that contains the desired value.
XPathVariableResolver to handle the variable in the 2nd expression.
If multiple bug pages are to be processed, these expressions should probably be compiled to
XPathExpressions for repeated use.
Here is my export of the bugs I'm currently "watching" / am interested in:
If I can find the time, a complete sample code download may also follow.