Please note that MarkUtils-Codec is intended as a complete replacement, and this "urlCodec" library is now in archival status.
I had a need to do some Percent-encoding (a.k.a. "URL encoding") in Java with high-performance requirements.
Java provides a default implementation of this functionality in java.net.URLEncoder
and java.net.URLDecoder
.
Unfortunately, it is not the best performing, due to both how the API was written as well as details within the implementation.
A number of performance-related bugs have been filed on sun.com in relation to URLEncoder
.
There is an alternative: org.apache.commons.codec.net.URLCodec
from Apache Commons Codec.
(Commons Codec also provides a useful implementation for Base64 encoding.)
Unfortunately, Commons' URLCodec
suffers some of the same issues as Java's URLEncoder
/URLDecoder
.
The current sources for each are available online at jdk-jrl-sources.dev.java.net for the JDK (requires registration) and svn.apache.org for Commons. Here are some things I see that could be improved upon, especially considering the features readily-available in Java 1.5/5.0 and above:
Recommendations for the JDK:
-
Use of the synchronized
StringBuffer
instead of the fasterStringBuilder
. Since these are local method variables, and will never be accessed simultaneously by multiple threads, there is no need for the synchronization overhead. (Java 1.6/6.0's "escape analysis" attempts to skip the synchronization where it is not needed, but it doesn't always work.)-
The same applies to the
CharArrayWriter
instance used. While none ofCharArrayWriter
's methods are marked as synchronized, itswrite(…)
methods all make use of synchronization blocks - really the same thing.
-
The same applies to the
I have reported the above observations to Sun in bug 6837325.
Recommendations for both the JDK and Commons:
-
When constructing any of the "buffer" classes, e.g.
ByteArrayOutputStream
,CharArrayWriter
,StringBuilder
, orStringBuffer
, estimate and pass-in an estimated capacity. The JDK'sURLEncoder
currently does this for itsStringBuffer
, but should do this for itsCharArrayWriter
instance as well. Common'sURLCodec
should do this for itsByteArrayOutputStream
instance. If the classes' default buffer sizes are too small, they may have to resize by copying into new, larger buffers - which isn't exactly a "cheap" operation. If the classes' default buffer sizes are too large, memory may be unnecessarily wasted. -
Both implementations are dependent on
Charset
s, but only accept them as their String name.Charset
provides a simple and small cache for name lookups - storing only the last 2Charset
s used. This should not be relied upon, and both should acceptCharset
instances for other interoperability reasons as well. -
Both implementations only handle fixed-size inputs and outputs.
The JDK's
URLEncoder
only works withString
instances. Commons'URLCodec
is also based onString
s, but also works withbyte[]
arrays. This is a design-level constraint that essentially prevents efficient processing of larger or variable-length inputs. Instead, the "stream-supporting" interfaces such asCharSequence
,Appendable
, andjava.nio
'sBuffer
implementations ofByteBuffer
andCharBuffer
should be supported.
Recommended replacement URLCodec API:
public class URLCodec{ public static CharSequence encode(CharSequence in) throws IOException{…} public static void encode(CharSequence in, Appendable out) throws IOException{…} public static void encode(CharSequence in, Charset charset, Appendable out) throws IOException{…} public static void encode(ByteBuffer in, Appendable out) throws IOException{…} public static CharSequence decode(CharSequence in) throws IOException{…} public static void decode(CharSequence in, Appendable out) throws IOException{…} public static void decode(CharSequence in, Charset charset, Appendable out) throws IOException{…} public static byte[] decodeToBytes(CharSequence in) throws IOException{…} public static void decode(CharSequence in, OutputStream out) throws IOException{…} } public class URLEncoder implements Appendable, Flushable, Closeable{ public URLEncoder(Appendable out){…} public URLEncoder(Appendable out, int bufferSize){…} public URLEncoder(Appendable out, int bufferSize, Charset charset){…} public Appendable append(CharSequence in) throws IOException{…} public Appendable append(char c) throws IOException{…} public Appendable append(CharSequence csq, int start, int end) throws IOException{…} public void close() throws IOException{…} } public class URLEncoderOutputStream extends OutputStream{ public URLEncoderOutputStream(Appendable out){…} public void write(int b) throws IOException{…} public void write(byte[] b, int off, int len) throws IOException{…} } public class URLDecoder implements Appendable, Flushable, Closeable{ public URLDecoder(Appendable out){…} public URLDecoder(Appendable out, int bufferSize){…} public URLDecoder(Appendable out, int bufferSize, Charset charset){…} public Appendable append(CharSequence in) throws IOException{…} public Appendable append(char c) throws IOException{…} public Appendable append(CharSequence csq, int start, int end) throws IOException{…} public void close() throws IOException{…} }
The "byte[] decodeToBytes(CharSequence in)
" and "void decode(CharSequence in, OutputStream out)
" methods reflect that Percent-encoding can be used to encode any series of bytes - not just character representations.
Unfortunately, Java does not yet have an Appendable
-equivalent interface for bytes (rather than chars).
As such, there is no common interface for OutputStream
and ByteBuffer
.
URLEncoderOutputStream
is provided, since OutputStream
can be continually appended to without limit.
Ideally, these byte methods would also be visible on URLEncoder
, but can't be done without spending the class's one option for inheritance, as OutputStream
is an abstract class rather than an interface - and Java does not support multiple inheritance.
(This also is visible in the implementation of "decodeUrl(CharSequence in, Charset charset, Appendable out)
").
Performance
Improving performance was the original goal, so here are some quick measurements. (While not perfect, some steps were taken to avoid the typical flaws of running a flawed microbenchmark). I took a random sequence of 50 characters. Using 10,000,000 iterations per implementation and operation, I encoded the raw characters and decoded the encoded characters:
Implementation | Encode | Decode | ||
---|---|---|---|---|
JDK Version: | 1.6/6.0 | 1.5/5.0 | 1.6/6.0 | 1.5/5.0 |
JDK URLEncoder/URLDecoder | 8,817 ms | 27,607 | 5,980 ms | 23,719 |
Apache Commons URLCodec | 9,470 ms | 37,735 | 9,323 ms | 32,625 |
com.ziesemer.utils.urlCodec | 2,505 ms | 14,934 | 3,875 ms | 11,889 |
The 1.6/6.0 JDK was "Java(TM) SE Runtime Environment (build 1.6.0_13-b03), Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)". The 1.5/5.0 JDK was "Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_18-b02), Java HotSpot(TM) Client VM (build 1.5.0_18-b02, mixed mode, sharing)" (32-bit). All tests were run under Windows Vista 64-bit. I also tested the equivalent 32-bit version of the 1.6/6.0 JDK, and the results were all somewhere in-between the previous results. The 32-bit version did default to the client VM rather than the server VM. Forcing it to the server VM did make-up most, but not all of the differences.
Note that com.ziesemer.utils.urlCodec is over 3x as fast as the JDK URLEncoder, and over 1.5x as fast as the JDK URLDecoder. (The JDK's URLDecoder was faster than the URLEncoder, so there wasn't as much room for improvement.)
Download
Please note that MarkUtils-Codec is intended as a complete replacement, and this "urlCodec" library is now in archival status.
Adding to my collection of MarkUtils, com.ziesemer.utils.urlCodec
is available on ziesemer.java.net under the GPL license, complete with source code, a compiled .jar, generated JavaDocs, and a suite of JUnit tests.
Download the com.ziesemer.utils.urlCodec-*.zip
distribution from here.
Please report any bugs or feature requests on the java.net Issue Tracker.
7 comments:
Good api. Have you report as a feature request to bugs.sun.com? Would like to see it in jdk)
"nordlig ulv": As noted, I did file bug # 6837325 around this with Sun. Additionally, please note that MarkUtils-Codec is a complete replacement for this library. Hopefully you will agree that it is even a better API.
Thank you for the answer. In my project i need base64 and url encodings so i will use your solution because of good implementation.
But my main point is that i am kinda idealist and would like to see more robust, revisited apis in jdk. Like Properties not being extended from Hashtable, better logger api and etc.
I know there are MarkUtils, google guava, slf4j but it'd be nice they were bundled with jdk. And unfortunately their inclusion into jdk is unlikely, because of api compatibly.
Why GPL license and not Apache one ?
Anonymous - please present a compelling argument for your desire, and I might be open to dual-licensing it. (To make this an official request, please open a ticket at https://java.net/jira/browse/ZIESEMER for this.)
Hi Mark,
I saw additional references in web that note performance benefits of your URLDecoder and would like to use it in project I am working on. Unfortunately ziesemer.dev.java.net is not available. Is there other way to get your code?
Alex - as I think you've already seen, java.net had moved their URLs a while ago, such that ziesemer.java.net is now the correct address - not ziesemer.dev.java.net. This was already correct at /2009/09/markutils-codec-base64-url-bytechar.html - but I will correct it here as well. Do please note, though - the general MarkUtils-Codec project is intended as a more general, complete replacement.
Post a Comment