Wednesday, May 13, 2009

Improving URLEncoder/URLDecoder Performance in Java

Please note that MarkUtils-Codec is intended as a complete replacement, and this "urlCodec" library is now in archival status.

I had a need to do some Percent-encoding (a.k.a. "URL encoding") in Java with high-performance requirements. Java provides a default implementation of this functionality in java.net.URLEncoder and java.net.URLDecoder. Unfortunately, it is not the best performing, due to both how the API was written as well as details within the implementation. A number of performance-related bugs have been filed on sun.com in relation to URLEncoder.

There is an alternative: org.apache.commons.codec.net.URLCodec from Apache Commons Codec. (Commons Codec also provides a useful implementation for Base64 encoding.) Unfortunately, Commons' URLCodec suffers some of the same issues as Java's URLEncoder/URLDecoder.

The current sources for each are available online at jdk-jrl-sources.dev.java.net for the JDK (requires registration) and svn.apache.org for Commons. Here are some things I see that could be improved upon, especially considering the features readily-available in Java 1.5/5.0 and above:

Recommendations for the JDK:

  • Use of the synchronized StringBuffer instead of the faster StringBuilder. Since these are local method variables, and will never be accessed simultaneously by multiple threads, there is no need for the synchronization overhead. (Java 1.6/6.0's "escape analysis" attempts to skip the synchronization where it is not needed, but it doesn't always work.)
    • The same applies to the CharArrayWriter instance used. While none of CharArrayWriter's methods are marked as synchronized, its write(…) methods all make use of synchronization blocks - really the same thing.

I have reported the above observations to Sun in bug 6837325.

Recommendations for both the JDK and Commons:

  • When constructing any of the "buffer" classes, e.g. ByteArrayOutputStream, CharArrayWriter, StringBuilder, or StringBuffer, estimate and pass-in an estimated capacity. The JDK's URLEncoder currently does this for its StringBuffer, but should do this for its CharArrayWriter instance as well. Common's URLCodec should do this for its ByteArrayOutputStream instance. If the classes' default buffer sizes are too small, they may have to resize by copying into new, larger buffers - which isn't exactly a "cheap" operation. If the classes' default buffer sizes are too large, memory may be unnecessarily wasted.
  • Both implementations are dependent on Charsets, but only accept them as their String name. Charset provides a simple and small cache for name lookups - storing only the last 2 Charsets used. This should not be relied upon, and both should accept Charset instances for other interoperability reasons as well.
  • Both implementations only handle fixed-size inputs and outputs. The JDK's URLEncoder only works with String instances. Commons' URLCodec is also based on Strings, but also works with byte[] arrays. This is a design-level constraint that essentially prevents efficient processing of larger or variable-length inputs. Instead, the "stream-supporting" interfaces such as CharSequence, Appendable, and java.nio's Buffer implementations of ByteBuffer and CharBuffer should be supported.

Recommended replacement URLCodec API:

public class URLCodec{
  public static CharSequence encode(CharSequence in) throws IOException{…}
  public static void encode(CharSequence in, Appendable out) throws IOException{…}
  public static void encode(CharSequence in, Charset charset, Appendable out) throws IOException{…}
  public static void encode(ByteBuffer in, Appendable out) throws IOException{…}
  
  public static CharSequence decode(CharSequence in) throws IOException{…}
  public static void decode(CharSequence in, Appendable out) throws IOException{…}
  public static void decode(CharSequence in, Charset charset, Appendable out) throws IOException{…}
  public static byte[] decodeToBytes(CharSequence in) throws IOException{…}
  public static void decode(CharSequence in, OutputStream out) throws IOException{…}
}

public class URLEncoder implements Appendable, Flushable, Closeable{
  public URLEncoder(Appendable out){…}
  public URLEncoder(Appendable out, int bufferSize){…}
  public URLEncoder(Appendable out, int bufferSize, Charset charset){…}
  
  public Appendable append(CharSequence in) throws IOException{…}
  public Appendable append(char c) throws IOException{…}
  public Appendable append(CharSequence csq, int start, int end) throws IOException{…}
  public void close() throws IOException{…}
}

public class URLEncoderOutputStream extends OutputStream{
  public URLEncoderOutputStream(Appendable out){…}
  public void write(int b) throws IOException{…}
  public void write(byte[] b, int off, int len) throws IOException{…}
}

public class URLDecoder implements Appendable, Flushable, Closeable{
  public URLDecoder(Appendable out){…}
  public URLDecoder(Appendable out, int bufferSize){…}
  public URLDecoder(Appendable out, int bufferSize, Charset charset){…}
  
  public Appendable append(CharSequence in) throws IOException{…}
  public Appendable append(char c) throws IOException{…}
  public Appendable append(CharSequence csq, int start, int end) throws IOException{…}
  public void close() throws IOException{…}
}

The "byte[] decodeToBytes(CharSequence in)" and "void decode(CharSequence in, OutputStream out)" methods reflect that Percent-encoding can be used to encode any series of bytes - not just character representations.

Unfortunately, Java does not yet have an Appendable-equivalent interface for bytes (rather than chars). As such, there is no common interface for OutputStream and ByteBuffer. URLEncoderOutputStream is provided, since OutputStream can be continually appended to without limit. Ideally, these byte methods would also be visible on URLEncoder, but can't be done without spending the class's one option for inheritance, as OutputStream is an abstract class rather than an interface - and Java does not support multiple inheritance. (This also is visible in the implementation of "decodeUrl(CharSequence in, Charset charset, Appendable out)").

Performance

Improving performance was the original goal, so here are some quick measurements. (While not perfect, some steps were taken to avoid the typical flaws of running a flawed microbenchmark). I took a random sequence of 50 characters. Using 10,000,000 iterations per implementation and operation, I encoded the raw characters and decoded the encoded characters:

Implementation Encode Decode
JDK Version: 1.6/6.0 1.5/5.0 1.6/6.0 1.5/5.0
JDK URLEncoder/URLDecoder 8,817 ms 27,607 5,980 ms 23,719
Apache Commons URLCodec 9,470 ms 37,735 9,323 ms 32,625
com.ziesemer.utils.urlCodec 2,505 ms 14,934 3,875 ms 11,889

The 1.6/6.0 JDK was "Java(TM) SE Runtime Environment (build 1.6.0_13-b03), Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)". The 1.5/5.0 JDK was "Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_18-b02), Java HotSpot(TM) Client VM (build 1.5.0_18-b02, mixed mode, sharing)" (32-bit). All tests were run under Windows Vista 64-bit. I also tested the equivalent 32-bit version of the 1.6/6.0 JDK, and the results were all somewhere in-between the previous results. The 32-bit version did default to the client VM rather than the server VM. Forcing it to the server VM did make-up most, but not all of the differences.

Note that com.ziesemer.utils.urlCodec is over 3x as fast as the JDK URLEncoder, and over 1.5x as fast as the JDK URLDecoder. (The JDK's URLDecoder was faster than the URLEncoder, so there wasn't as much room for improvement.)

Download

Please note that MarkUtils-Codec is intended as a complete replacement, and this "urlCodec" library is now in archival status.

Adding to my collection of MarkUtils, com.ziesemer.utils.urlCodec is available on ziesemer.java.net under the GPL license, complete with source code, a compiled .jar, generated JavaDocs, and a suite of JUnit tests. Download the com.ziesemer.utils.urlCodec-*.zip distribution from here. Please report any bugs or feature requests on the java.net Issue Tracker.

7 comments:

nordlig ulv said...

Good api. Have you report as a feature request to bugs.sun.com? Would like to see it in jdk)

Mark A. Ziesemer said...

"nordlig ulv": As noted, I did file bug # 6837325 around this with Sun. Additionally, please note that MarkUtils-Codec is a complete replacement for this library. Hopefully you will agree that it is even a better API.

nordlig ulv said...

Thank you for the answer. In my project i need base64 and url encodings so i will use your solution because of good implementation.
But my main point is that i am kinda idealist and would like to see more robust, revisited apis in jdk. Like Properties not being extended from Hashtable, better logger api and etc.
I know there are MarkUtils, google guava, slf4j but it'd be nice they were bundled with jdk. And unfortunately their inclusion into jdk is unlikely, because of api compatibly.

Anonymous said...

Why GPL license and not Apache one ?

Mark A. Ziesemer said...

Anonymous - please present a compelling argument for your desire, and I might be open to dual-licensing it. (To make this an official request, please open a ticket at https://java.net/jira/browse/ZIESEMER for this.)

AlexR said...

Hi Mark,

I saw additional references in web that note performance benefits of your URLDecoder and would like to use it in project I am working on. Unfortunately ziesemer.dev.java.net is not available. Is there other way to get your code?

Mark A. Ziesemer said...

Alex - as I think you've already seen, java.net had moved their URLs a while ago, such that ziesemer.java.net is now the correct address - not ziesemer.dev.java.net. This was already correct at /2009/09/markutils-codec-base64-url-bytechar.html - but I will correct it here as well. Do please note, though - the general MarkUtils-Codec project is intended as a more general, complete replacement.