Monday, February 2, 2009

See all newspaper comments at once with Greasemonkey

Background

Like many web users today, I get much of my local news from the online versions of the local newspapers. One particular feature the online editions offer over the print editions is the inclusion of user comments / responses to the stories.

At least around my areas of interest in Central Wisconsin and the Fox Valley, most of the local papers are owned by the Gannett Company. This includes the Appleton Post-Crescent, the Wausau Daily Herald, and others. Much of Gannett's online presence is currently provided by Pluck Social Media's SiteLife product, particularly SiteLife Comments. Pluck even hosts a special customer profile detailing their work with Gannett Corporation. While the sites I'm working with here all happen to be owned by Gannett, it's quite possible that this will also apply to other Pluck-based sites as well.

Unfortunately, Pluck's current implementation leaves some things to be desired. I recall reading many of the negative comments left as the transition was made from the old version of the sites to the current version, which is when I believe Pluck became involved. Fortunately, they have improved some things since, and definitely seem to be fairing better than the current fiasco at Dell's online community after their so-called upgrade. However, the most annoying issue I have while reading Gannett's local news articles is that only 5 user comments are visible at a time. There is a "Full Page View" option available at the bottom, but this only increases the visible comments-per-page to 10. While all these comments are loaded in an AJAX-type fashion using JSON data, clicking to retrieve the next page still results in reloading the entire page. Even on a broadband connection, each page change requires 5+ seconds. This makes trying to read through all the comments on a popular story very frustrating, especially when there are sometimes 50 or more responses. While many of these comments are informative or insightful, having to click through and reload 5 or more pages is certainly not making the best use of web technology.

Technical Challenges

As I had done with Resizing the Blogger Edit Box, my first thought was to attempt to improve things with a Bookmarklet. Unfortunately, the task proved to be too complex, partially due to the same origin policy blocking the necessary cross-domain data. In particular, while the article page is served from a "www." host, the JSON data containing the comments is obtained from a "sitelife." host. While the current pages seem to work around this restriction through some iframe tricks, attempting to reuse that functionality would be a hack at best. Instead, I turned to a Greasemonkey-based solution. Greasemonkey provides a non-domain-restricted GM_xmlhttpRequest API method that provides access to Mozilla's chrome-privileged XMLHttpRequest object.

The pages I had to work with were not at all desirable or the easiest to work with. Each page typically includes about 20 JavaScript files, and some of the code is quite obfuscated. One of the main files, "GDSRScripts.js", is about 86 KB. The core of the Yahoo! UI Library (YUI) (yahoo-dom-event.js in 2.6.0) is not even half that, at only 31 KB. I also see no effort made at respecting the JavaScript global namespace, or use of other best practices.

The Solution

I've completed a Greasemonkey script that I've posted at userscripts.org: All Pluck Comments. Once installed and configured for one or more of the Pluck-based Gannett news sites, it will update any loaded news article by showing all available comments on the same, single page. If all the existing comments already fit within one page and the current 5-post limit, the script will exit and do nothing. Unfortunately, much of the previous waiting time doesn't seem to be in the JavaScript, but with the server responding to the JSON requests - a performance issue that can't be resolved client-side. While those requests are made, the script will show the loading status above the existing comments. Once all comment "pages" have been downloaded, the comments section is repopulated with the complete list. Additionally, changing the sort order between "Newest first" and "Oldest first" now performs instantly, without requiring additional remote requests.

Due to the number of possible supported sites, only one default URL pattern is configured to the "included pages" within the Greasemonkey script. Other desired, supported sites will need to be manually added. (This would be easier of Greasemonkey supported regular expressions for the patterns, as I requested in ticket #216.) There are two types of URLs I've observed that should be supported. The first looks like "http://<hostname>/article/<date>/<siteId>/<articleId>/". The other looks like "http://<hostname>/apps/pbcs.dll/article?AID=/<date>/<siteId>/<articleId>". The best non-regular expression pattern I can suggest to match both these patterns is "http://<hostname>/*article*", where "<hostname>" needs to be replaced with the literal host and domain name to be supported.

The only current limitation is that the per-comment controls (Recommend, New post, Reply to this Post, Report Abuse, etc.) are not regenerated. This is because it would be very difficult, if even possible, to make all of Pluck's existing JavaScript work with these enhancements. In order to use these controls, click on the "Full Page View" link that is left below the list of comments. This will bring back the limit of 10 comments per page, but the Greasemonkey script will exit without making any changes and leaving these controls intact. I seldom use these controls, so this issue isn't that important to me. However, if there is a stated interest, I may look into resolving this for a future version. Alternatively, feel free to write and submit a patch!

Technical Details

The script first waits for the existing comments to load, at which point it determines the article ID, the total number of comments available, and other information necessary for requesting the additional "pages" of comments. If it times out waiting, or determines that it is on the "Full Page View", it simply exits and does nothing. Otherwise, it makes a series of asynchronous requests to retrieve all the available comments. The responses are unnecessarily URL-encoded, and are decoded by the script using unescape(). The responses also contain an unnecessary <script> section at the beginning, which is searched for and removed. The JSON text is then "safely" evaluated to a JavaScript object using the regular expression provided in section 6 of RFC 4627. Once all responses are received, the existing comments HTML is cleared, and new comments are built and populated from the JSON data using the HTML DOM.

Some of the tools I used during this process were Firebug, JSView, and Notepad++. Some of the JavaScript practices I used include closures and other JavaScript topics I've written about.

No comments: