Encode and Decode XML Element and Attribute Names

Encoding names sometimes contain characters that are invalid in XML names, such as spaces or half-width Katakana. Characters that need to be mapped to XML names without reference to schemas are translated into XML names. The invalid characters are translated into escaped numeric entity encoding.

Only two non-alpha characters, the colon (:) and the underscore (_), are allowed at the beginning of an XML name. Because the colon is already reserved for namespaces, the XmlConvert class uses the underscore as the escape character.

The escape rules at time of encoding are:

  • Any UCS-2 character that is not a valid XML name character according to the World Wide Web Consortium (W3C) Extensible Markup Language (XML) 1.0 (fourth edition) recommendation is escaped as _xHHHH_, where HHHH stands for the four-digit hexadecimal UCS-2 code for the character in the most significant bit first order. For example, the name Order Details is encoded as Order_x0020_Details.

  • The XML name encoding does not support UTF-16 surrogates. Instead, the characters of the UCS-4 additions in the Unicode range U+00010000 to U+0010FFFF require the full 4-byte encoding. They are encoded as _xHHHHHHHH*_*, where HHHHHHHH stands for the 8-digit hexadecimal UCS-4 encoding of the character. If the underlying system stores characters as two UCS-2 values, the first UCS-2 value x, high surrogate code, falling into the U+D800 to U+DBFF range and the second UCS-2 value y, the low surrogate code, falling into the U+DC00 - U+DFFF range, then the two values are translated to the UCS-4 code by:

          (x - xD800) * x400 + (y - xDC00) + x10000

    For example, U+D800U+DC00 is mapped to _x00010000_, and U+DBFFU+DFFF mapped to _x0010FFFF_. If the two subsequent UCS-2 values do not form a UTF-16 encoding, they are encoded according to rule 1.

  • The underscore character does not need to be escaped unless it is followed by a character sequence that together with the underscore can be misinterpreted as an escape sequence when decoding the name. For example, Order_Details is not encoded, but Order_x0020_ is encoded as Order_x005f_x0020_.

  • Short forms are not allowed. For example, _x20*_* or __ is not generated.

The following table displays the methods in the XmlConvert class, along with the descriptions of how each method performs encoding and decoding.

Method

Description

EncodeName

Takes in the name to be encoded and returns the encoded name. Returns the name along with any invalid character that is replaced by an escape string. EncodeName allows colons in any position, which means that the name may still be invalid according to the W3C Namespaces in XML Recommendation.

EncodeNmToken

Takes in the name to be encoded and returns the encoded name.

EncodeLocalName

Takes in the name to be encoded and returns the encoded name. This method is the same as EncodeName except it also encodes the colon character, guaranteeing that it can be used as the LocalName part of a namespace qualified name.

DecodeName

Reverses the transformation for all the encoding methods.

Encoding Sample

The following encoding example shows where the XmlConvert class changes UTF-16 surrogates to UCS-4 encoding:

Option Explicit
Option Strict

Imports System
Imports System.Xml
Imports Microsoft.VisualBasic

Class CMain
   Public Shared Sub Main()
      Dim surrogateChar As String = ChrW(&HD800) & ChrW(&HDC00)
      Console.WriteLine("EncodeLocalName: " & _
                        XmlConvert.EncodeLocalName(surrogateChar))
      Console.WriteLine("EncodeName: " & _
                        XmlConvert.EncodeName(surrogateChar))
      Console.WriteLine("EncodeNmToken: " & _
                        XmlConvert.EncodeNmToken(surrogateChar))
   End Sub
End Class
using System;
using System.Xml;
 
class CMain
{
    public static void Main(string[] args) 
    {
    string surrogateChar = "\uD800\uDC00"; 
    Console.WriteLine("EncodeLocalName: " +
                      XmlConvert.EncodeLocalName(surrogateChar));
    Console.WriteLine("EncodeName: " +
                       XmlConvert.EncodeName(surrogateChar));
    Console.WriteLine("EncodeNmToken: " +
              XmlConvert.EncodeNmToken(surrogateChar));
    }
}

Output

EncodeLocalName: _x00010000_
EncodeName: _x00010000_
EncodeNmToken: _x00010000_

See Also

Reference

XmlConvert

Concepts

Character Encoding of XML Names and Conversion of XML Data Types

Conversion of XML Data Types