Character Encoding and MSXML

 

Character encodings provide a map between a series of numbers and the characters people expect to see when they enter text into computers. The capital letter "A", for example, is represented by the decimal number 65 (41 in hexadecimal) in a variety of character encodings, including the ASCII text familiar to many Western programmers and Windows Code Page-1252, the default encoding used by most Microsoft® Windows® Western systems.

Character encodings are not fonts. Fonts provide graphic representations, glyphs, that map to a particular character encoding. Microsoft Word, for example, includes a version of Arial (Arial Unicode MS) with tens of thousands of characters.

All XML processors are required to understand two transformations of the Unicode character encoding, UTF-8 and UTF-16. Microsoft XML Core Services (MSXML) supports more encodings, but all text in XML documents is treated internally as the Unicode UCS-2 character encoding.

Even different platforms representing the same set of Western characters can use different bytes to represent the same character, as shown in the following table.

Byte Windows

(CP1252)
Macintosh

(MacRoman)
140 Œ å
229 å Â
231 ç Á
232 è Ë
233 é È

Parsers can read in documents written ISO-8859-1, Big-5, or Shift-JIS, but the processing rules treat everything as Unicode. MSXML and other XML parsers perform the conversion while loading XML documents.

There are some limitations to auto-detecting character encodings. For example, 8-bit ASCII text is acceptable UTF-8, but UTF-8 is more than 8-bit ASCII text. For reliable processing, XML documents that use character encodings other than UTF-8 or UTF-16 must include an encoding declaration in the XML declaration. This makes it possible for a parser to read the characters correctly or report errors when it cannot process an encoding.

Because the XML declaration is written in basic ASCII text, parsers can read its contents even if the document is in a very different encoding. The encoding declaration significantly increases the likelihood that documents in encodings other than UTF-8 and UTF-16 will be interpreted correctly.

Some transactions, for example, those carried over HTTP and e-mail protocols, also provide information about character encodings. Microsoft Internet Explorer uses that information in document processing, but it isn't available, for example, if you load an XML document from a local hard drive or even a file server.

XML Data Islands and Character Encodings

XML data islands inside of HTML documents receive different encoding handling, depending on whether the content is stored directly within the <xml> element or referenced by a SRC attribute.

XML data islands that contain the XML directly in the HTML document use the encoding of the surrounding HTML document. Because the entire document is assumed to use the same encoding, this approach significantly simplifies parsing.

XML data islands that reference the content through a SRC attribute are less constrained. The XML documents they reference can have an XML declaration containing a character encoding, and MSXML will use that character encoding. The contents of the data island will be presented as Unicode, but the parser will handle the conversion automatically.

Other Resources

For more information, search for "How to Encode XML Data" on msdn.microsoft.com.

See Also

Enforcing Character Encoding with DOM