Sun has a rather comprehensive bug database at http://bugs.sun.com/, which is probably most notably used to track Java bugs. Unfortunately, at least with the interface and options presented to the general public (such as myself), some aspects are rather lacking. Beyond being able to vote for a bug (only up to 3), and creating a single "watch list", it appears to be a far cry from Bugzilla.
The feature I've been missing the most, however, is an easy way to keep track of bugs of interest, and by extension, being able to share bug lists / searches. Bugzilla makes this incredibly easy by offering a number of export options, including XML, CSV, and RSS feeds. As of Bugzilla 3.0, saved searches can also be shared, as an effort of Bugzilla Bug 69,000, and is demonstrated in Bugzilla's public test environment, http://landfill.bugzilla.org/bugzilla-3.0-branch/.
As a side note, while both https://bugs.eclipse.org/bugs/ and https://bugzilla.mozilla.org/ appear to be running version 3.0+ of Bugzilla, neither currently seem to have these shared queries working/enabled. (I've opened bug 223594 on Eclipse to hopefully address this.)
Java HTML Parsers
Unfortunately, as I previously mentioned, Sun isn't using Bugzilla, and there are no export options of any type. As far as I can tell, this really only leaves parsing the HTML as the only option.
If the HTML returned by Sun on their bug pages was valid XHTML, this would be an easier task.
Just write a few XPaths to find the necessary fields, format the data as desired, and another task complete.
Unfortunately, the pages are returned simply as regular HTML, not the XML-compatible variety.
At least unlike Blogger, Sun's pages are at least returned with a valid identifying type that actually matches the page, in this case a doctype of "-//W3C//DTD HTML 4.01 Transitional//EN
".
There are many ways to parse HTML. While I know a lot of people like Perl, I prefer to stick with the Java approach. In any language, there are also many existing tools and frameworks to help with the task, so starting from scratch would probably be a waste of time. The non-XML-based version of HTML can be rather messy, with unbalanced tags, missing escape sequences, and other issues that will quickly lead to headaches - and these tools will help "clean" the input.
I recently found a blog posting about a similar goal: "Showdown - Java HTML Parsing Comparison" (Ben McCann, 2008-02-02, lumidant.com). He demonstrates 4 Java libraries for parsing HTML, including NekoHTML, TagSoup, jTidy, and HTMLCleaner. (Unlike the referenced blog, I've linked these utilities to their official sites for your convenience.) The referenced post favored HTMLCleaner, as it was supposedly the only tool to successfully extract 10/10 documents. I've had better luck with NekoHTML. I'd be curious to see the rest of the code that was used, as I suspect that case sensitivity and a few other issues may have played into the results, and at least with NekoHTML, can easily be normalized with a configuration option or two.
NekoHTML
In the past, I utilized HttpUnit as such a tool. It actually uses NekoHTML as the parser, then provides a HTML-specific API for navigating/querying the "cleaned" document - as well as performing actions, for HttpUnit's real purpose as a testing framework.
Of the four parsers listed above, NekoHTML definately seems to be the most comprehensive and as of recently, actively maintained. After a period without any releases between June 2005 and December 2007, it was relaunched on SourceForge. Beware of version 1.9.6.2, however, as there seems to be a rather severe regression bug with the handling of single quotes, as I reported in issue 1922810.
For starting with NekoHTML, I recommend the following configuration:
import org.cyberneko.html.parsers.DOMParser; // … DOMParser domParser = new DOMParser(); domParser.setFeature("http://cyberneko.org/html/features/insert-namespaces", true); domParser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
See http://nekohtml.sourceforge.net/settings.html for the details of these options.
The insert-namespaces
feature basically utilizes the XHTML namespace, http://www.w3.org/1999/xhtml
, to all HTML content, allowing for distinction between HTML content and other possible content defined in an alternate namespace within the document.
The names/elems
property instructs NekoHTML to convert all tag names to lower case, which matches the XHTML specification (compared to upper-case for HTML).
Setting the insert-namespaces
feature seems to be a pre-requisite for the names/elems
property.
The only down-side to enabling XML namespaces is slightly complicating the use of XPath.
In order to properly query elements in XML namespaces with XPath, a javax.xml.namespace.NamespaceContext
implementation needs to be registered to the XPath using XPath.setNamespaceContext(…)
, which maps prefixes to namespaces, etc.
(See NamespaceContextMap
in MarkUtils-XML.)
Parsing Sun's bug pages
For extracting the fields from one of Sun's bug pages, I formulated two XPath expressions:
//html:table[preceding::html:a[@name='skip2content']]//html:table/html:tbody
This expression finds the body of the table after an anchor that separates the desired content from the rest of the page headers, navigation, etc.
html:tr[html:td//text()=$pageLabel]/html:td[position()=2]
This expression finds the value of a desired field on the page, e.g. "Bug ID:", "Synopsis", or "Category", as currently represented by $pageLabel above. The expression finds the first <td/> node matching the desired label, then returns the following <td/> node that contains the desired value.
Use a XPathVariableResolver
to handle the variable in the 2nd expression.
If multiple bug pages are to be processed, these expressions should probably be compiled to XPathExpression
s for repeated use.
Here is my export of the bugs I'm currently "watching" / am interested in:
Bug ID | Synopsis | Category | Reported Against | Release Fixed | State | Priority | Submit Date |
---|---|---|---|---|---|---|---|
4079882 | Request for JTristateCheckbox implementation | java:classes_swing | 1.3.1 , 1.4.1 , 1.1fcs | In progress, request for enhancement | 4-Low | 1997-09-17 | |
4187336 | ServletResponse.setContentLength(Long) | javax_servlet:api | 1.1fcs | Closed, will not be fixed | 4-Low | 1998-11-05 | |
4526561 | File system change notification events should be supported | java:classes_io | merlin-beta2 | In progress, request for enhancement | 4-Low | 2001-11-13 | |
4652184 | please compile j2sdk rt.jar with -g (all options) | java:build | 1.4.2 , 1.4.2_04 , merlin-rc1 , tiger-beta , tiger-beta2 | mustang(b28) | Closed, fixed | 4-Low | 2002-03-13 |
4782054 | Allow for comments in the MANIFEST.MF file | java:jar | 1.4.1 | In progress, request for enhancement | 4-Low | 2002-11-20 | |
4787931 | System property "user.home" does not correspond to "USERPROFILE" (win) | java:classes_lang | 1.3 , 1.4.1 , 1.4.2 | In progress, bug | 3-Medium | 2002-12-03 | |
4838318 | (str) Substitute CharSequence for String arguments wherever possible | java:classes_lang | 1.4.1 , 1.4.2 | In progress, request for enhancement | 4-Low | 2003-03-27 | |
4880234 | ServiceUI needs a printDialog method wtih a Component parameter | java:classes_2d | 1.4.1 | In progress, request for enhancement | 4-Low | 2003-06-18 | |
4983159 | Typedef (alias) | java:specification | tiger-beta | In progress, request for enhancement | 4-Low | 2004-01-24 | |
5018574 | Unable to set focus to another component in JOptionPane | java:classes_swing | tiger | In progress, bug | 3-Medium | 2004-03-23 | |
5043696 | StringReader should be allow a String{Buffer,Builder} to be the backing store | java:classes_io | 1.4.2 | In progress, request for enhancement | 4-Low | 2004-05-07 | |
5096679 | PIT:PrintDialog is not positioned properly on multi-mon, when coords are invalid | java:classes_2d | mustang | In progress, bug | 4-Low | 2004-09-03 | |
5109347 | PrinterJob.printDialog() does not support multi-mon, always displayed on primary | java:classes_2d | 1.4 | In progress, bug | 4-Low | 2004-09-30 | |
6192554 | Need generic factory interface. | java:classes_util | In progress, request for enhancement | 4-Low | 2004-11-09 | ||
6212751 | DOC: ServiceUI.printDialog() need to enhance the description for X,Y coordinates | java:classes_2d | 1.4 | In progress, bug | 4-Low | 2004-12-27 | |
6214380 | Quality setting is disabled and always set to Normal in Print Dialog | java:classes_2d | In progress, request for enhancement | 4-Low | 2005-01-05 | ||
6215174 | Can't force layout of non-showing component | java:classes_awt | 5.0 | In progress, request for enhancement | 4-Low | 2005-01-07 | |
6312085 | The for/in statement should support Iterators | java:specification | tiger-beta | In progress, request for enhancement | 4-Low | 2005-08-17 | |
6325564 | (str) Provide CharSequenceReader with sub-sequence capability | java:classes_lang | In progress, request for enhancement | 4-Low | 2005-09-19 | ||
6358852 | Add methods on concurrent data structures that interrupt blocked threads | java:classes_util_concurrent | In progress, request for enhancement | 4-Low | 2005-12-05 | ||
6400189 | raw types and inference | java:compiler | In progress, bug | 4-Low | 2006-03-17 | ||
6476646 | (str) Make AbstractStringBuilder class public | java:classes_lang | In progress, request for enhancement | 5-Very Low | 2006-09-29 |
If I can find the time, a complete sample code download may also follow.
No comments:
Post a Comment