Wednesday, May 27, 2009

PuTTY and SSH port forwarding corruption

PuTTY seems to have some serious issues with SSH port forwarding. http://www.chiark.greenend.org.uk/~sgtatham/putty/wishlist/portfwd-corrupt.html seems to document exactly the same issues I've been having.

I was doing a simple read of XML data in Java from a remote HTTP server, using a SSH tunnel created by PuTTY. Commonly but sporadically, the HttpURLConnection appeared to close prematurely, resulting in only partially-received content. Switching between "dynamic" and "static" (Local) tunnel destinations did not have any effect.

I've also been having issues with PuTTY crashing on unclean closes.

The latest development snapshot (2009-05-27:r8577) does not appear to help. The "portfwd-corrupt" bug was filed in mid-2003, and there hasn't been a new release in over 2 years. (0.60 was released on 2007-04-29.) All of the PuTTY forks and variations I tried have the same issue. (See PuTTY on Wikipedia.) TeraTerm (Wikipedia) does not suffer the same issue, but does not appear to support dynamic tunneling using SOCKS.

My working solution: Use OpenSSH (Wikipedia), which also works on Windows through the use of Cygwin (Wikipedia). Just be aware that this is a command-line solution. While there are probably various front-ends available, I'd be surprised if there was one that didn't limit the available options that ssh has to offer. Ironically, I am using PuTTYcyg to run ssh, as it is much better than the Windows' command prompt. By entering ssh with the desired arguments as the command instead of the default "-" for the login shell, it saves a bash process from running.

Monday, May 25, 2009

MarkUtils-PacProxySelector for Java

Many computer networks make use of proxy servers for web and Internet connectivity. This is especially true for business and other organization networks, where their use is required by security and other policies. Due to the typical use of proxy servers, they are often thought of in terms of "restricting access". Instead, they should be thought of in the proper terms of a means of "providing access". Even outside of typical corporate environments, proxy servers can be invaluable for testing and debugging, as well as used as a type of VPN between private networks.

Most web browsers and other networked applications support directing traffic through one or more proxy servers. The typical configuration dialog looks like this, as shown from Mozilla Firefox:

Mozilla Firefox Connection Settings

Of the "manual" options, "HTTP" and "SSL" (TLS) are the most basic and common, followed by "FTP". "Gopher" is rarely used anymore - it has already been dropped by Microsoft Internet Explorer, and may be dropped in Mozilla Firefox 4.0. "SOCKS" is arguably the most powerful, supporting any of the above protocols in addition to any other TCP- or UDP-based protocol. (If SOCKS is configured and supported, none of the other protocols need to be configured.)

Unfortunately, directing an entire networks' traffic through a single proxy server can quickly cause a bottle-neck and a common point for failure. This is especially true when all LAN traffic is also sent through the proxy. (In ideal network traffic patterns, there should be many more multiples of LAN traffic over Internet traffic.) Some of the performance and availability concerns can be addressed through DNS or other load balancing. Another common attempt is to configure the "no proxy for" / exception list. Unfortunately, there are some severe limitations to this design. First, the list must be kept up-to-date as the network configuration changes. (There are various tools for this.) More significantly, there are many desired configurations that cannot be accounted for. For example, what if a list is needed for the servers that should be sent to a proxy server, rather than skipping the proxy server (reverse logic)? Or what if traffic must be split among multiple proxy servers, depending upon the destination or other parameters?

The solution to all of the above and other similar concerns is through the use of the last option shown above, and probably the most overlooked: "Automatic proxy configuration URL". This option is also known as proxy auto-config, or PAC, and was introduced into Netscape Navigator as early as 1996. A PAC file only needs to contain an implementation of a JavaScript function, FindProxyForURL(url, host). From here, the full power of JavaScript can be used, including regular expressions, associative arrays, and closures, as well as a number of predefined helper functions specific to PAC. Within the PAC function, various load balancing and black- or white-listing tasks can be performed, optionally by maintaining internal state. A list of multiple proxies may also be returned for attempts by the client. The PAC file is loaded from a URL (including local file:// URLs), where it can be centrally maintained and updated. The PAC file may be cached by the web browser or other client, but should respect the cache settings sent in the HTTP headers if retrieved through HTTP. Alternatively, a chrome:// URL can even be used, allowing for the PAC file to be maintained within a Firefox extension, and updated through Firefox's standard auto-update process for extensions.

Java support

Java supports many of the same above proxy options, mostly through the use of system properties. For full details, see the tech note at java.sun.com, Java Networking and Proxies. These settings affect any communications made through URLConnection, Socket, and possibly other network-related classes. Previously, the proxy configuration options were limited to the "manual" options listed above, with separate options for HTTP, HTTPS, FTP, and SOCKS. However, Java 1.5/5.0 introduced the Proxy and ProxySelector classes. A default ProxySelector can be configured for the current JVM by calling ProxySelector.setDefault(ProxySelector).

Unfortunately, Java does not currently provide any visible support for proxy auto-config (PAC) files. However, the ProxySelector's List<Proxy> select(URI uri) method looks and works very similar to the PAC's FindProxyForURL(url, host) function. The most notable difference is that it is strongly-typed to standard Java classes. As part of my MarkUtils collection, I created MarkUtils-PacProxySelector to provide a ProxySelector implementation that works with PAC files.

Since the PAC files are based on JavaScript, the ability to evaluate JavaScript is required. Fortunately, this is easily done through Java, especially with the introduction of the Java Scripting API in Java 1.6/6.0 (JSR-223). Java 1.6 bundles an internal version of the Mozilla Rhino implementation of JavaScript for Java, based on 1.6R2. Unfortunately, Java doesn't expose all the features of JavaScript or Rhino directly through the scripting API, some of which are required to implement the PAC functionality in a compatible fashion. This includes defining top-level bindings in the JavaScript environment to Java functions, which is directly supported in Rhino by adding a binding to a FunctionObject - a class to which there is no publicly visible match in the JDK. While it is probably possible to hack a work-around to this, my current implementation utilizes Rhino directly. Besides taking advantage of the improvements in the latest version of Rhino (currently 1.7R2), this allows the utility to be easily used with both Java 1.5/5.0 and 1.6/6.0. (However, note that JSR-223 is unofficially supported under Java 1.5/5.0 as well by downloading and including the .jar's from the reference implementation.) Using Rhino directly also avoids some potential security issues, which I reported in Sun Bug 6782031 and Mozilla Bug 468385.

As commented in the pom.xml file, Mozilla Rhino is currently not available through the central repository, a Mozilla repository, or any other "official" repository. I've added a dependency to it as "org.mozilla.javascript : rhino : 1.7R2". For this to work properly, Rhino will need to be downloaded and installed into a local repository as named above.

In addition to the standard PAC methods, PacProxySelector supports an added function called "connectFailed" to take advantage of the connectFailed(URI, SocketAddress, IOException) functionality on ProxySelector. The JavaScript method is called with the same arguments as on ProxySelector, just with the .toString() representations of each of the three parameters. The PAC file could then store this information within internal state to possibly affect future calls to FindProxyForURL.

For the most flexibility, the constructor to PacProxySelector accepts a Reader, which should read from a PAC file. There is also a public static configureFromProperties() method that returns a ProxySelector, assuming that the path to a PAC file is stored as either a Java system or environment property named "proxy.autoConfig", similar to the other network properties. After obtaining an instance from either the constructor or the method, it should be passed to ProxySelector.setDefault(ProxySelector), unless otherwise used directly. Alternatively, a setDefaultFromProperties() convenience method is provided to do this in one call.

I wrote this in mind for plugging into other Java applications. Ideally, the JDK would provide a system property that accepts the classname for the default ProxySelector or some other method for setting the default outside of a function call within the code, but this is currently not the case. However, all that has to be done is finding a way to execute one of the above configuration options from the desired Java application before network access is attempted. I've successfully written a plugin for Oracle SQL Developer that does exactly this. The same is also possible for Eclipse, though it requires patching of some of the plugins due to the current infrastructure. (See Eclipse bug 257443.) Alternatively, PacProxySelector provides a main method that calls setDefaultFromProperties() before chaining execution to another program's main method. See the included Javadoc for details.

Download

com.ziesemer.utils.pacProxySelector is available on ziesemer.java.net under the GPL license, complete with source code, a compiled .jar, generated JavaDocs, and a suite of JUnit tests. Download the com.ziesemer.utils.pacProxySelector-*.zip distribution from here. Please report any bugs or feature requests on the java.net Issue Tracker.

Saturday, May 23, 2009

Handling XML Encodings with MarkUtils

Especially in today's focus on higher-level languages, lower-level details are often overlooked. However, character encodings are one such detail that must be remembered, particularly when working with interfaces such as web services, which may be communicating with a large variety of different platforms and languages - both programming and written.

US-ASCII is probably one of the most significant character encodings, but not the earliest. ASCII is the predecessor to the ISO/IEC 646 standard, and is a subset of the Unicode standard, particularly UTF-8. (See also: English in computing.) US-ASCII by itself is unable to represent more than its defined 95 printable characters - 62 of these being [A-Za-z0-9], with the remainder being used for punctuation and other miscellaneous symbols. This limited character set makes it impossible to accurately represent other Latin-based languages such as Spanish, French, and German. "Extended ASCII" such as ISO/IEC 8859-1 provide complete or near-complete coverage of these and other alphabets, but still fail to account for other characters. Using a character encoding that provides full support for the Universal Character Set (UCS) such as UTF-8 is the recommended solution to this and other related issues.

Many character encodings share the same byte representations for the basic English character set [A-Za-z] and [0-9]. These include US-ASCII, ISO-8859-1, UTF-8, and others. For example, in these character encodings, the following mappings are always true:

CharacterDecimalHexadecimal
0480x30
9570x39
A650x41
Z900x5A
a970x61
z1220x7A

Other character encodings, such as EBCDIC, are completely different.

Even when receiving a byte stream from one of the "compatible" encodings, the correct encoding still needs to be determined, as each encoding is handled slightly differently. Certain byte sequences that may be allowed in US-ASCII variants may be invalid in UTF-8. UTF-16 and UTF-32 are easily converted to and from UTF-8, but may be sent with different byte orderings - either big-endian or little-endian.

Encodings in XML

In the W3C's recommendation for XML (essentially the XML standard), the use of character encodings is specifically addressed in section 4.3.3. However, since the encoding is defined with the XML, the encoding declaration itself is also encoded. This is also addressed in the appendix, Autodetection of Character Encodings (Non-Normative):

The XML encoding declaration functions as an internal label on each entity, indicating which character encoding is in use. Before an XML processor can read the internal label, however, it apparently has to know what character encoding is in use—which is what the internal label is trying to indicate. In the general case, this is a hopeless situation. It is not entirely hopeless in XML, however, because XML limits the general case in two ways: each implementation is assumed to support only a finite set of character encodings, and the XML encoding declaration is restricted in position and content in order to make it feasible to autodetect the character encoding in use in each entity in normal cases.

Essentially, the character encoding in use may be determined by a combination of the Byte-order mark (if present), and the value of the "encoding" attribute in the XML declaration of the prolog. However, there may also be external encoding information available from higher-level protocols that must be factored into the determination. This includes the MIME Content-Type sent over HTTP, as defined in RFC 3023.

In Apache Xerces2-J, this is mostly handled within the implementation, particularly by org.apache.xerces.impl.XMLEntityManager.createReader(…). Similar code is also in org.apache.xerces.xinclude.XIncludeTextReader, which also accounts for at least most of the RFC 3023 rules, but only if the input source is a "system Id" which results in the use of a URLConnection.

Addition to MarkUtils-XML

I had a need to meet these same requirements on an input stream to determine the encoding, but before handing off the processing directly to Xerces or another XML processor. I also wanted a flexible solution that would follow the RFC 3023 recommendation, but would be able to work with a variety of sources - including a URLConnection or a HttpServletRequest. I found no existing and public API that met these requirements, so I made my own - and am making it available for public use as part of MarkUtils-XML.

Below is an outline of the public API. While it may first appear to be a bit lengthy, it is designed to be flexible and usable in high-performance environments. None of the Javadoc documentation is included here for brevity. The complete source code, along with a compiled .jar, generated Javadocs, and a comprehensive suite of JUnit tests are available in the com.ziesemer.utils.xml-*.zip distribution on ziesemer.java.net, with XmlEncoding available starting with version 2009-05-20. The tests include all 16 usable MIME Content-Type examples from http://tools.ietf.org/html/rfc3023#section-8.

public class XmlEncoding{
  
  public static final String US_ASCII_NAME = "US-ASCII";
  public static final Charset US_ASCII;
  public static final String UTF_8_NAME = "UTF-8";
  public static final Charset UTF_8;
  public static final String UTF_16BE_NAME = "UTF-16BE";
  public static final Charset UTF_16BE;
  public static final String UTF_16LE_NAME = "UTF-16LE";
  public static final Charset UTF_16LE;
  
  // Below are not guaranteed to be in both Java 1.5/5.0 and 1.6/6.0.
  public static final String UTF_32BE_NAME = "UTF-32BE";
  public static final String UTF_32LE_NAME = "UTF-32LE";
  public static final String IBM037_NAME = "IBM037";
  
  public static InputStream createBufferedStream(InputStream is) throws IOException{…}
  
  public static String determineFromBOM(InputStream is) throws IOException{…}
  
  public static String guessFromDeclaration(InputStream is) throws IOException{…}
  
  public static String determineFromDeclaration(InputStream is, Charset guessed) throws IOException{…}
  public static String determineFromDeclaration(InputStream is, Charset guessed, int readSize) throws IOException{…}
  
  public static String calculate(InputStream is) throws IOException{…}
  public static String calculate(InputStream is, String contentTypeMime, String contentTypeEncoding) throws IOException{…}
  public static String calculate(InputStream is, String contentType) throws IOException{…}
  
  public static InputSource createInputSource(InputStream is) throws IOException{…}
  public static InputSource createInputSource(InputStream is, Charset charset) throws IOException{…}
  public static InputSource createInputSource(InputStream is, String contentTypeMime, String contentTypeEncoding) throws IOException{…}
  public static InputSource createInputSource(InputStream is, String contentType) throws IOException{…}
  
  public static String getContentTypeMime(String contentType){…}
  
  public static String getContentTypeEncoding(String contentType){…}
  
  public static boolean isContentTypeTextXml(String mime){…}
  
  public static boolean isContentTypeApplicationXml(String mime){…}
}

In most instances, only one of the createInputSource methods will be necessary. For example, from a HttpServletRequest:

HttpServletRequest req = …;
InputSource iSource = XmlEncoding.createInputSource(
  req.getInputStream(), req.getContentType());

All other non-createInputSource(…) methods in this class that accept an InputStream require that mark(int) and reset() are supported. This allows for bytes to be read and "unread", allowing for read portions to be re-read by the XML processor as appropriate. For example, determineFromBOM(InputStream) will consume the BOM bytes if it can be successfully determined, while the guessFromDeclaration(…) and determineFromDeclaration(…) methods will always reset the InputStream to its original position. createBufferedStream(InputStream) can be used to appropriately wrap an InputStream if required. See the included Javadocs for complete details.

Friday, May 22, 2009

Xalan-J Serialization Performance hindered by Flushing

Following the "Chaining Transformations" approach I described in XML and XSLT Tips and Tricks for Java, I had developed a very performance-aware system centered around XML processing. By "pipe-lining" the various steps, less memory is required, and the execution time is reduced. The example in my previous post only used a sample final destination of System.out. Unfortunately, an issue quickly appeared once a similar approach was used in a real-world situation, where the output was a higher-latency destination. The approach was and still is correct, but a work-around is currently necessary to avoid a bug in the Apache Xalan/Serializer implementation that would otherwise cause a severe performance penalty.

As discussed between 2001 and 2003 in the XALANJ-78 bug report, there was some discussion around when flush() is called on the result. The overall consensus was that it was and should only be called from endDocument(). This would mean only one flush operation per document, which would seem acceptable.

However, I found that flush() is being called much more often, at least using versions of Xalan-J between 2.6.0 (used in Java 1.5/5.0 - 1.6/6.0) and the latest 2.7.1. It seems that any call to TransformerIdentityImpl.startPrefixMapping(…) calls ContentHandler.startPrefixMapping(…), with no overloaded methods in the public API. This is implemented by ToStream.startPrefixMapping(String prefix, String uri). This then calls the non-API method ToStream.startPrefixMapping(String prefix, String uri, boolean shouldFlush), with "shouldFlush" always true. This in itself seems to be correct, in that "shouldFlush" affects other logic beyond just flushing the output stream. However, this always calls flushPending(), which then flushes the actual output stream.

The result? The output stream or writer may be flushed as much as once per XML element written. I reported this in XALANJ-2500, along with an example that demonstrates 100 XML elements being written, and flush() being called as many times. In this particular case, using namespaced XML elements is required. However, where I first ran into this was with an XSL that utilized XML namespaces for parameter names, but the generated document was completely within the default namespace.

Assume that the output destination has a latency of even just 50ms. Writing just the small sample document of 100 elements will take 5 seconds under the given circumstances! In some related scenarios, wrapping the OutputStream or Writer in a BufferedOutputStream or BufferedWriter can improve performance by allowing the caller to write without causing a call to the underlying system for each write. Unfortunately, each call to flush() on the buffered implementations simply cause the buffers to flush to the underlying output, and for the output to be flushed as well.

The only solution I'm aware of at the moment is the one I mentioned in the bug report: Use a subclass of BufferedOutputStream or BufferedWriter, with flush() being overwritten to do essentially nothing. (See my NoFlushBufferedOutputStream and NoFlushBufferedWriter classes in MarkUtils-IO for an implementation.)

Tuesday, May 19, 2009

Outlook HTML vs. Printing

Today I noticed how I haven't missed using Microsoft Outlook. I printed an email, but found that all the standard headers were missing - From, Sent, To, and Subject. Additionally, trying to troubleshoot using Print Preview also gave surprising results:

Microsoft Office Outlook - Print Preview is not available for HTML formatted items.

Microsoft's KB 222320 ("Some print options not available with HTML messages") does little more than acknowledging this as a problem.

Converting from HTML to Rich Text format is not an option. HTML is much more standards compliant, even with Outlook's quirks. More importantly, HTML format has much better support by other email clients, and has a smaller message size than Rich Text format.

I didn't find a "great" solution for this. The best work-around I found it to not have Outlook handle the printing of messages at all, even though it involves some extra steps. Simply save the message(s) using File, Save As, then selecting "HTML" for "Save As Type". Open the file in your favorite web browser, then preview and/or print from there.

I experienced this on Outlook 2003 with the latest service packs. The KB article states the issue also exists with 2002 (XP) and 2000. I don't have Office 2007 available to test, but I'd be curious of the results.

Wednesday, May 13, 2009

Dynamically Configuring Logging at Runtime

I've been using SLF4J - the Simple Logging Facade for Java, and Logback - a native and the preferred implementation, for about 3 years. Logback is an excellent replacement for the popular log4j project, development of which has mostly stalled by Apache. Both SLF4J and Logback were designed and are maintained by Ceki Gülcü, the founder of log4j. This combination of SLF4J and Logback are currently used in many significant projects, ironically including many Apache projects, as well as Hibernate, Jetty, and others. See also on Wikipedia: SLF4J and Log4j.

Like log4j and similar logging frameworks, Logback provides several, powerful options for configuration. This includes an XML configuration file that supports variable substitution, nested variables and property files, default values, and file inclusion, among other features. However, what can be done when these options aren't enough?

In particular, Logback resolves its variables from Java's system properties. These must be set when Java starts, or at another point before SLF4J (Logback) is accessed for the first time and automatically configures itself. (Logback can always be reconfigured, but this isn't exactly clean and can cause other issues.) I have seen practices where all calls to the logging framework are expected to go through a proxy class that would first perform the configuration, but this is prone to error. Even if none of multiple developers accidentally call SLF4J or Logback without going through the proxy, there may be 3rd-party libraries that wouldn't even know of the proxy. Additionally, variables by themselves may not be able to provide the dynamic configuration options desired at runtime.

Even Java's built-in java.util.logging introduced with Java 1.4 reads a java.util.logging.config.class system property that can be used to configure the logging within code. However, this again relies on the system property to be set before logging is accessed. Ideally, it would have a default value and class name that could be read from the classpath. Logback does provide a StatusManager, it is primarily meant only for receiving configuration status updates. While it could feasibly be used as a hook for further configuration, this would be an ugly hack at best, and is not the shown intent of the StatusListeners.

I was adding SLF4J and logback into a large, multi-tiered environment, with multiple application server nodes, and multiple JVMs per node. All JVMs operate from the same NAS mount, so even without considering the multiple JVMs per node, simply providing separate configuration options is not an option - not to mention the need it would introduce to maintain multiple files. Logback's FileAppender actually supports this configuration, in having multiple JVMs write to the same log files through the use of the prudent configuration property - but not without approximately tripling the cost of writing logging events. In this performance-critical environment, this increased cost is not an option - so I needed a way to quickly, easily, and reliably store separate logging files per node and JVM.

The easiest way I found to implement this was by using a sub-classed FileAppender. Here is an example of my configuration:

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
    <rollingPolicy class="com.ziesemer.example.MAZTimeBasedRollingPolicy">
      <FileNamePattern>maz-%d-%jvm.log.gz</FileNamePattern>
      <MaxHistory>60</MaxHistory>
    </rollingPolicy>
    <layout class="ch.qos.logback.classic.PatternLayout">
      <!-- http://logback.qos.ch/manual/layouts.html#PatternLayout -->
      <Pattern>%d [%t] %-5p %c %X - %m%n</Pattern>
    </layout>
  </appender>
  <root>
    <level value="WARN"/>
    <appender-ref ref="FILE"/>
  </root>
</configuration>

And my sub-class:

package com.ziesemer.example;

import java.io.File;
import java.net.InetAddress;

import ch.qos.logback.core.rolling.TimeBasedRollingPolicy;

public class MAZTimeBasedRollingPolicy<E> extends TimeBasedRollingPolicy<E>{
  @Override
  public void setFileNamePattern(String fnp){
    try{
      fnp = fnp.replace("%jvm", System.getProperty("PROCESS_ID"));
      
      String serverName = InetAddress.getLocalHost().getHostName();
      File f = new File("/<logPath>/" + serverName);
      // Logback will create any necessary parent paths.
      super.setFileNamePattern(f.getAbsolutePath() + "/" + fnp);
    }catch(Exception ex){
      throw new RuntimeException(ex);
    }
  }
}

This assumes that the basic functionality of the time-based rolling policy and rolling file appender was still desired. Otherwise, the normal FileAppender could have been sub-classed and used instead. This also assumes that the parent process launching the Java process sets a Java system property called "PROCESS_ID". Otherwise, Igor Minar lists some options for obtaining the process ID through Java, though none are perfect or the most ideal: How a Java Application Can Discover its Process ID (PID) (blog.igorminar.com, 2007-03-03). Still otherwise, a random number or other information could potentially be used to differentiate by JVM.

Note that this implementation doesn't completely control the file name pattern - even though it could. Instead, it is using and extending the file name pattern configured in the XML. It intercepts the configured pattern through the setFileNamePattern(String) method, then calls super with the extended version. Besides adding child directories for each server, it adds support for a new, custom "%jvm" variable - in addition to the "%d" already supported by TimeBasedRollingPolicy.

Improving URLEncoder/URLDecoder Performance in Java

Please note that MarkUtils-Codec is intended as a complete replacement, and this "urlCodec" library is now in archival status.

I had a need to do some Percent-encoding (a.k.a. "URL encoding") in Java with high-performance requirements. Java provides a default implementation of this functionality in java.net.URLEncoder and java.net.URLDecoder. Unfortunately, it is not the best performing, due to both how the API was written as well as details within the implementation. A number of performance-related bugs have been filed on sun.com in relation to URLEncoder.

There is an alternative: org.apache.commons.codec.net.URLCodec from Apache Commons Codec. (Commons Codec also provides a useful implementation for Base64 encoding.) Unfortunately, Commons' URLCodec suffers some of the same issues as Java's URLEncoder/URLDecoder.

The current sources for each are available online at jdk-jrl-sources.dev.java.net for the JDK (requires registration) and svn.apache.org for Commons. Here are some things I see that could be improved upon, especially considering the features readily-available in Java 1.5/5.0 and above:

Recommendations for the JDK:

  • Use of the synchronized StringBuffer instead of the faster StringBuilder. Since these are local method variables, and will never be accessed simultaneously by multiple threads, there is no need for the synchronization overhead. (Java 1.6/6.0's "escape analysis" attempts to skip the synchronization where it is not needed, but it doesn't always work.)
    • The same applies to the CharArrayWriter instance used. While none of CharArrayWriter's methods are marked as synchronized, its write(…) methods all make use of synchronization blocks - really the same thing.

I have reported the above observations to Sun in bug 6837325.

Recommendations for both the JDK and Commons:

  • When constructing any of the "buffer" classes, e.g. ByteArrayOutputStream, CharArrayWriter, StringBuilder, or StringBuffer, estimate and pass-in an estimated capacity. The JDK's URLEncoder currently does this for its StringBuffer, but should do this for its CharArrayWriter instance as well. Common's URLCodec should do this for its ByteArrayOutputStream instance. If the classes' default buffer sizes are too small, they may have to resize by copying into new, larger buffers - which isn't exactly a "cheap" operation. If the classes' default buffer sizes are too large, memory may be unnecessarily wasted.
  • Both implementations are dependent on Charsets, but only accept them as their String name. Charset provides a simple and small cache for name lookups - storing only the last 2 Charsets used. This should not be relied upon, and both should accept Charset instances for other interoperability reasons as well.
  • Both implementations only handle fixed-size inputs and outputs. The JDK's URLEncoder only works with String instances. Commons' URLCodec is also based on Strings, but also works with byte[] arrays. This is a design-level constraint that essentially prevents efficient processing of larger or variable-length inputs. Instead, the "stream-supporting" interfaces such as CharSequence, Appendable, and java.nio's Buffer implementations of ByteBuffer and CharBuffer should be supported.

Recommended replacement URLCodec API:

public class URLCodec{
  public static CharSequence encode(CharSequence in) throws IOException{…}
  public static void encode(CharSequence in, Appendable out) throws IOException{…}
  public static void encode(CharSequence in, Charset charset, Appendable out) throws IOException{…}
  public static void encode(ByteBuffer in, Appendable out) throws IOException{…}
  
  public static CharSequence decode(CharSequence in) throws IOException{…}
  public static void decode(CharSequence in, Appendable out) throws IOException{…}
  public static void decode(CharSequence in, Charset charset, Appendable out) throws IOException{…}
  public static byte[] decodeToBytes(CharSequence in) throws IOException{…}
  public static void decode(CharSequence in, OutputStream out) throws IOException{…}
}

public class URLEncoder implements Appendable, Flushable, Closeable{
  public URLEncoder(Appendable out){…}
  public URLEncoder(Appendable out, int bufferSize){…}
  public URLEncoder(Appendable out, int bufferSize, Charset charset){…}
  
  public Appendable append(CharSequence in) throws IOException{…}
  public Appendable append(char c) throws IOException{…}
  public Appendable append(CharSequence csq, int start, int end) throws IOException{…}
  public void close() throws IOException{…}
}

public class URLEncoderOutputStream extends OutputStream{
  public URLEncoderOutputStream(Appendable out){…}
  public void write(int b) throws IOException{…}
  public void write(byte[] b, int off, int len) throws IOException{…}
}

public class URLDecoder implements Appendable, Flushable, Closeable{
  public URLDecoder(Appendable out){…}
  public URLDecoder(Appendable out, int bufferSize){…}
  public URLDecoder(Appendable out, int bufferSize, Charset charset){…}
  
  public Appendable append(CharSequence in) throws IOException{…}
  public Appendable append(char c) throws IOException{…}
  public Appendable append(CharSequence csq, int start, int end) throws IOException{…}
  public void close() throws IOException{…}
}

The "byte[] decodeToBytes(CharSequence in)" and "void decode(CharSequence in, OutputStream out)" methods reflect that Percent-encoding can be used to encode any series of bytes - not just character representations.

Unfortunately, Java does not yet have an Appendable-equivalent interface for bytes (rather than chars). As such, there is no common interface for OutputStream and ByteBuffer. URLEncoderOutputStream is provided, since OutputStream can be continually appended to without limit. Ideally, these byte methods would also be visible on URLEncoder, but can't be done without spending the class's one option for inheritance, as OutputStream is an abstract class rather than an interface - and Java does not support multiple inheritance. (This also is visible in the implementation of "decodeUrl(CharSequence in, Charset charset, Appendable out)").

Performance

Improving performance was the original goal, so here are some quick measurements. (While not perfect, some steps were taken to avoid the typical flaws of running a flawed microbenchmark). I took a random sequence of 50 characters. Using 10,000,000 iterations per implementation and operation, I encoded the raw characters and decoded the encoded characters:

Implementation Encode Decode
JDK Version: 1.6/6.0 1.5/5.0 1.6/6.0 1.5/5.0
JDK URLEncoder/URLDecoder 8,817 ms 27,607 5,980 ms 23,719
Apache Commons URLCodec 9,470 ms 37,735 9,323 ms 32,625
com.ziesemer.utils.urlCodec 2,505 ms 14,934 3,875 ms 11,889

The 1.6/6.0 JDK was "Java(TM) SE Runtime Environment (build 1.6.0_13-b03), Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)". The 1.5/5.0 JDK was "Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_18-b02), Java HotSpot(TM) Client VM (build 1.5.0_18-b02, mixed mode, sharing)" (32-bit). All tests were run under Windows Vista 64-bit. I also tested the equivalent 32-bit version of the 1.6/6.0 JDK, and the results were all somewhere in-between the previous results. The 32-bit version did default to the client VM rather than the server VM. Forcing it to the server VM did make-up most, but not all of the differences.

Note that com.ziesemer.utils.urlCodec is over 3x as fast as the JDK URLEncoder, and over 1.5x as fast as the JDK URLDecoder. (The JDK's URLDecoder was faster than the URLEncoder, so there wasn't as much room for improvement.)

Download

Please note that MarkUtils-Codec is intended as a complete replacement, and this "urlCodec" library is now in archival status.

Adding to my collection of MarkUtils, com.ziesemer.utils.urlCodec is available on ziesemer.java.net under the GPL license, complete with source code, a compiled .jar, generated JavaDocs, and a suite of JUnit tests. Download the com.ziesemer.utils.urlCodec-*.zip distribution from here. Please report any bugs or feature requests on the java.net Issue Tracker.

Saturday, May 2, 2009

HP 0xC0000005 Print Driver Errors, Works Font Cache, and more HP issues

I had a call to help my relative install something to open MS Word files. I thought this would be easy - install the free OpenOffice.org. (His computer came with a 60 day trial of MS Office 2003, which had since expired.) Unfortunately, most of the files he needed to open were actually *.wps files for Microsoft Works. Works was failing to open due to an error, and the current "vanilla" version of OpenOffice.org doesn't support the MS Works file format.

Microsoft Works© Font Cache has encountered a problem and needs to close.  We are sorry for the inconvenience.

Clicking for the details reports the following:

AppName: wkgdcach.exe
AppVer: 8.4.623.0
ModName: hpz3r5mu.dll
ModVer: 61.73.241.0
Offset: 000a0490

Several searches suggested that this could be solved by forcing Windows' font cache to rebuild. Most included removing the C:\Windows\ttfCache file (not folder), which doesn't even appear to exist under current versions of Windows (XP or newer). There is a C:\windows\system32\FNTCACHE.dat file, but renaming this didn't help. What I found is that this was caused by the printing system. Temporarily stopping the "Print Spooler" Windows service (Spooler) prevented the error and allowed Works to start. After restarting the Spooler service, I found that removing the HP printer drivers also provided a work-around.

I completely removed all the HP printer drivers, then reinstalled. The Works issue with wkgdcach.exe came back, as well as the following error when trying to open the preference pages by right-clicking the printer from the "Printers and Faxes" folder and clicking Properties:

C:\WINDOWS\Explorer.EXE  Function address 0x500a0490 caused a protection fault. (exception code 0xc0000005)  Some or all property page(s) may not be displayed.

This was after installing the latest 10.0.0 (06-2008) version of the "HP Deskjet Full Feature Software and Drivers" (100_215_DJ_AIO_03_F4200_Full_NonNet_enu.exe) from HP's web site for the HP Deskjet F4240. Installing the following available updates didn't help either:

  • "Critical Update to Enhance Reliability of Network and USB Connectivity and Improve System Responsiveness While Printing", 04-2009, 2.0, slp_dd_hathi_110_017.exe
  • "Critical Update to Correct a PC to Printer Communication Issue", 03-2009, 1.0, ConvergedIO_HPCOM_V3.exe

Removing and installing the 11.0.0 (06-200) version of the "HP Deskjet Basic Driver" didn't help either, nor did re-installing the above updates. There were other issues with this machine, so I went ahead with a back-up and reinstall operation. Unfortunately, there was no stand-alone Windows XP installation disc with the system, just the factory restore partition - so I used that instead. (This is also a Hewlett-Packard / Compaq computer system.) It would be possible to "customize" a XP install disc that would work with the OEM key on the system, but I didn't have the time for this - though it probably would have saved me time in the long-run. After re-installing, I ended up with exactly the same issues - again.

HP Online Support Issues:

After already wasting too many hours on these issues, I thought I'd give my luck with HP support a try. I prefer using the online chat over phone support for a number of reasons. Unfortunately, HP seems to be having serious issues with that as well. On multiple computers, regardless of the web browser used, I could not get their chat program to start. Using Mozilla Firefox redirected to the following server-side error page from Microsoft ASP.NET:

  We're sorry, but an unexpected error has occurred.
The error has been logged and will be examined for further review.

2009-05-02 21:14:05Z Error on 15.201.8.161: Unhandled exception caught: Error Message: Input string was not in a correct format.
 Error Occurred On: /ChatEntry/Chat.aspx
 ExceptionType: System.FormatException
 Stack Trace:
   at System.Number.StringToNumber(String str, NumberStyles options, NumberBuffer& number, NumberFormatInfo info, Boolean parseDecimal)
   at System.Number.ParseInt32(String s, NumberStyles style, NumberFormatInfo info)
   at System.Convert.ToInt32(String value)
   at ChatEntry.Chat.IsSupportedBrowser() in C:\ASTRO\Chat\Integration\inetpub\wwwroot\chat\Chat.aspx.vb:line 346
   at ChatEntry.Chat.Page_Load(Object sender, EventArgs e) in C:\ASTRO\Chat\Integration\inetpub\wwwroot\chat\Chat.aspx.vb:line 163
   at System.Web.UI.Control.OnLoad(EventArgs e)
   at System.Web.UI.Control.LoadRecursive()
   at System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint)

(If they're going to use ASP.NET, they could at least do themselves a favor and use C#.NET instead of VB.NET.) Using Microsoft Internet Explorer 8 at least brought up a chat dialog window, but failed with client-side scripting errors:

Webpage error details

User Agent: Mozilla/4.0 (compatible; MSIE 8.0; …)
Timestamp: Sat, 2 May 2009 21:22:27 UTC


Message: Object expected
Line: 609
Char: 2
Code: 0
URI: https://wimpro.cce.hp.com/system/LiveCustomerServlet.egain

I gave up and called the 800 technical support number. When I was connected to a representative, I explained that their online chat system appeared to be broken, and his answer was that it must be because it is the weekend (!).

The solution:

The HP representative said he recognized this error, and suggested trying to create a new user profile. I feel a bit disappointed for not trying this myself first, as it is a usual part of my troubleshooting routine. Sure enough, this solved the issue - for both the printer driver and MS Works. In this case, even the system reinstall didn't help, as HP's default user profile included in the recovery partition was already "corrupt".

Having previously searched HP's online knowledge base, there were no results for "0xC0000005" - so I strongly suggested that this be included for future searches. I was also surprised that general online searching for 0xC0000005 errors returned a number of results, but no real solutions and nothing related to the user profile.

It should be possible to correct this issue in the current user profile without starting a new, "clean" user profile. However, I haven't found any obvious fixes - such as something to reset under "Hewlett-Packard" in the registry or the user profile folder. Hopefully, as with my previous post on HP driver issues, this post will become helpful to many other users.