Saturday, May 23, 2009

Handling XML Encodings with MarkUtils

Especially in today's focus on higher-level languages, lower-level details are often overlooked. However, character encodings are one such detail that must be remembered, particularly when working with interfaces such as web services, which may be communicating with a large variety of different platforms and languages - both programming and written.

US-ASCII is probably one of the most significant character encodings, but not the earliest. ASCII is the predecessor to the ISO/IEC 646 standard, and is a subset of the Unicode standard, particularly UTF-8. (See also: English in computing.) US-ASCII by itself is unable to represent more than its defined 95 printable characters - 62 of these being [A-Za-z0-9], with the remainder being used for punctuation and other miscellaneous symbols. This limited character set makes it impossible to accurately represent other Latin-based languages such as Spanish, French, and German. "Extended ASCII" such as ISO/IEC 8859-1 provide complete or near-complete coverage of these and other alphabets, but still fail to account for other characters. Using a character encoding that provides full support for the Universal Character Set (UCS) such as UTF-8 is the recommended solution to this and other related issues.

Many character encodings share the same byte representations for the basic English character set [A-Za-z] and [0-9]. These include US-ASCII, ISO-8859-1, UTF-8, and others. For example, in these character encodings, the following mappings are always true:

CharacterDecimalHexadecimal
0480x30
9570x39
A650x41
Z900x5A
a970x61
z1220x7A

Other character encodings, such as EBCDIC, are completely different.

Even when receiving a byte stream from one of the "compatible" encodings, the correct encoding still needs to be determined, as each encoding is handled slightly differently. Certain byte sequences that may be allowed in US-ASCII variants may be invalid in UTF-8. UTF-16 and UTF-32 are easily converted to and from UTF-8, but may be sent with different byte orderings - either big-endian or little-endian.

Encodings in XML

In the W3C's recommendation for XML (essentially the XML standard), the use of character encodings is specifically addressed in section 4.3.3. However, since the encoding is defined with the XML, the encoding declaration itself is also encoded. This is also addressed in the appendix, Autodetection of Character Encodings (Non-Normative):

The XML encoding declaration functions as an internal label on each entity, indicating which character encoding is in use. Before an XML processor can read the internal label, however, it apparently has to know what character encoding is in use—which is what the internal label is trying to indicate. In the general case, this is a hopeless situation. It is not entirely hopeless in XML, however, because XML limits the general case in two ways: each implementation is assumed to support only a finite set of character encodings, and the XML encoding declaration is restricted in position and content in order to make it feasible to autodetect the character encoding in use in each entity in normal cases.

Essentially, the character encoding in use may be determined by a combination of the Byte-order mark (if present), and the value of the "encoding" attribute in the XML declaration of the prolog. However, there may also be external encoding information available from higher-level protocols that must be factored into the determination. This includes the MIME Content-Type sent over HTTP, as defined in RFC 3023.

In Apache Xerces2-J, this is mostly handled within the implementation, particularly by org.apache.xerces.impl.XMLEntityManager.createReader(…). Similar code is also in org.apache.xerces.xinclude.XIncludeTextReader, which also accounts for at least most of the RFC 3023 rules, but only if the input source is a "system Id" which results in the use of a URLConnection.

Addition to MarkUtils-XML

I had a need to meet these same requirements on an input stream to determine the encoding, but before handing off the processing directly to Xerces or another XML processor. I also wanted a flexible solution that would follow the RFC 3023 recommendation, but would be able to work with a variety of sources - including a URLConnection or a HttpServletRequest. I found no existing and public API that met these requirements, so I made my own - and am making it available for public use as part of MarkUtils-XML.

Below is an outline of the public API. While it may first appear to be a bit lengthy, it is designed to be flexible and usable in high-performance environments. None of the Javadoc documentation is included here for brevity. The complete source code, along with a compiled .jar, generated Javadocs, and a comprehensive suite of JUnit tests are available in the com.ziesemer.utils.xml-*.zip distribution on ziesemer.java.net, with XmlEncoding available starting with version 2009-05-20. The tests include all 16 usable MIME Content-Type examples from http://tools.ietf.org/html/rfc3023#section-8.

public class XmlEncoding{
  
  public static final String US_ASCII_NAME = "US-ASCII";
  public static final Charset US_ASCII;
  public static final String UTF_8_NAME = "UTF-8";
  public static final Charset UTF_8;
  public static final String UTF_16BE_NAME = "UTF-16BE";
  public static final Charset UTF_16BE;
  public static final String UTF_16LE_NAME = "UTF-16LE";
  public static final Charset UTF_16LE;
  
  // Below are not guaranteed to be in both Java 1.5/5.0 and 1.6/6.0.
  public static final String UTF_32BE_NAME = "UTF-32BE";
  public static final String UTF_32LE_NAME = "UTF-32LE";
  public static final String IBM037_NAME = "IBM037";
  
  public static InputStream createBufferedStream(InputStream is) throws IOException{…}
  
  public static String determineFromBOM(InputStream is) throws IOException{…}
  
  public static String guessFromDeclaration(InputStream is) throws IOException{…}
  
  public static String determineFromDeclaration(InputStream is, Charset guessed) throws IOException{…}
  public static String determineFromDeclaration(InputStream is, Charset guessed, int readSize) throws IOException{…}
  
  public static String calculate(InputStream is) throws IOException{…}
  public static String calculate(InputStream is, String contentTypeMime, String contentTypeEncoding) throws IOException{…}
  public static String calculate(InputStream is, String contentType) throws IOException{…}
  
  public static InputSource createInputSource(InputStream is) throws IOException{…}
  public static InputSource createInputSource(InputStream is, Charset charset) throws IOException{…}
  public static InputSource createInputSource(InputStream is, String contentTypeMime, String contentTypeEncoding) throws IOException{…}
  public static InputSource createInputSource(InputStream is, String contentType) throws IOException{…}
  
  public static String getContentTypeMime(String contentType){…}
  
  public static String getContentTypeEncoding(String contentType){…}
  
  public static boolean isContentTypeTextXml(String mime){…}
  
  public static boolean isContentTypeApplicationXml(String mime){…}
}

In most instances, only one of the createInputSource methods will be necessary. For example, from a HttpServletRequest:

HttpServletRequest req = …;
InputSource iSource = XmlEncoding.createInputSource(
  req.getInputStream(), req.getContentType());

All other non-createInputSource(…) methods in this class that accept an InputStream require that mark(int) and reset() are supported. This allows for bytes to be read and "unread", allowing for read portions to be re-read by the XML processor as appropriate. For example, determineFromBOM(InputStream) will consume the BOM bytes if it can be successfully determined, while the guessFromDeclaration(…) and determineFromDeclaration(…) methods will always reset the InputStream to its original position. createBufferedStream(InputStream) can be used to appropriately wrap an InputStream if required. See the included Javadocs for complete details.

No comments: