Using Unicode Encoding

Applications that target the common language runtime use encoding to map character representations from the native character scheme (Unicode) to other schemes. Applications use decoding to map characters from non-native schemes (non-Unicode) to the native scheme. The System.Text namespace provides a number of classes that allow your applications to encode and decode characters. An introduction to these classes is provided in Character Encoding in the .NET Framework.

Unicode Transformation Formats

The Unicode Standard assigns a code point (a number) to each character in every supported script. A Unicode Transformation Format (UTF) is a format used to encode that code point. The Unicode Standard version 3.2 uses the UTFs and other encodings defined in the following table. For all the encodings, the internal .NET Framework strings are native UTF-16 strings. For more information, see the Unicode Standard at the Unicode home page.

  • Unicode UTF-32 encoding
    Represents Unicode characters as sequences of 32-bit integers. The application can use the UTF32Encoding class to convert characters to and from UTF-32 encoding.

    UTF-32 is used when applications need to avoid the surrogate code point behavior of UTF-16 on operating systems for which encoded space is too important. Note that single "glyphs" rendered on a display can still be encoded with more than one UTF-32 character. The supplementary characters susceptible to this behavior are currently much rarer than the Unicode BMP characters.

  • Unicode UTF-16 encoding
    Represents Unicode characters as sequences of 16-bit integers. Your application can use the UnicodeEncoding class to convert characters to and from UTF-16 encoding.

    UTF-16 is often used natively, as in the Microsoft.Net char type, the Windows WCHAR type, and other common types. Most common Unicode code points take only one UTF-16 code point (2 bytes). Unicode supplementary characters U+10000 and greater still require two UTF-16 surrogate code points.

  • Unicode UTF-8 encoding
    Represents Unicode characters as sequences of 8-bit bytes. Your application can use the UTF8Encoding class to convert characters to and from UTF-8 encoding.

    UTF-8 allows encoding using 8-bit data sizes and works well with many existing operating systems. For the ASCII range of characters, UTF-8 is identical to ASCII encoding and allows a broader set of characters. For CJK scripts, however, UTF-8 can require three bytes for each character, potentially causing larger data sizes than UTF-16. Note that sometimes the amount of ASCII data, such as HTML tags, justifies the increase in size for the CJK range.

  • Unicode UTF-7 encoding
    Represents Unicode characters as sequences of 7-bit ASCII characters. Your application can use the UTF7Encoding class to convert characters to and from UTF-7 encoding. Non-ASCII Unicode characters are represented by an escape sequence of ASCII characters.

    UTF-7 supports certain protocols, most often e-mail and newsgroup protocols. However, UTF-7 is not particularly secure or robust. In some situations, changing one bit can radically alter the interpretation of an entire UTF-7 string. In other situations, different UTF-7 strings can encode the same text. For sequences that include non-ASCII characters, UTF-7 is much less space-efficient than UTF-8, and encoding/decoding is slower. Consequently, your applications should generally prefer UTF-8 to UTF-7.

  • ASCII encoding
    Encodes the Latin alphabet as single 7-bit ASCII characters. Because this encoding only supports character values from U+0000 through U+007F, it is inadequate in most cases for internationalized applications. Your application can use the ASCIIEncoding class to convert characters to and from ASCII encoding. For examples of using this class in code, see Encoding Base Types.

  • ANSI/ISO encodings
    Used for non-Unicode encoding. The Encoding class provides support for a wide range of ANSI/ISO encodings.

Passing Binary Data in Strings

Random collections of numbers, either bytes or characters, do not make a valid string or valid Unicode. Your application cannot convert a byte array to or from Unicode and expect it to work. Certain characters and code point sequences are illegal in Unicode 5.0 and do not convert with any of the Unicode encodings. If your application must pass binary data in a string format, it should use base 64 or another format designed for that purpose.

Using the Encoding Class

Your application can use the GetEncoding method to return an encoding object for a specified encoding. The application can use the GetBytes method to convert a Unicode string to its byte representation in a specified encoding.

The following code example uses the GetEncoding method to create a target encoding object for a specified code page. The GetBytes method is called on the target encoding object to convert a Unicode string to its byte representation in the target encoding. The byte representations of the strings in the specified code pages are displayed.

Imports System
Imports System.IO
Imports System.Globalization
Imports System.Text

Public Class Encoding_UnicodeToCP
   Public Shared Sub Main()
      ' Converts ASCII characters to bytes.
      ' Displays the string's byte representation in the 
      ' specified code page.
      ' Code page 1252 represents Latin characters.
      PrintCPBytes("Hello, World!", 1252)
      ' Code page 932 represents Japanese characters.
      PrintCPBytes("Hello, World!", 932)
      
      ' Converts Japanese characters.
      PrintCPBytes("\u307b,\u308b,\u305a,\u3042,\u306d",1252)
      PrintCPBytes("\u307b,\u308b,\u305a,\u3042,\u306d",932)
   End Sub

   Public Shared Sub PrintCPBytes(str As String, codePage As Integer)
      Dim targetEncoding As Encoding
      Dim encodedChars() As Byte      
      
      ' Gets the encoding for the specified code page.
      targetEncoding = Encoding.GetEncoding(codePage)
      
      ' Gets the byte representation of the specified string.
      encodedChars = targetEncoding.GetBytes(str)
      
      ' Prints the bytes.
      Console.WriteLine("Byte representation of '{0}' in CP '{1}':", _
         str, codePage)
      Dim i As Integer
      For i = 0 To encodedChars.Length - 1
         Console.WriteLine("Byte {0}: {1}", i, encodedChars(i))
      Next i
   End Sub
End Class
using System;
using System.IO;
using System.Globalization;
using System.Text;

public class Encoding_UnicodeToCP
{
   public static void Main()
   {
      // Converts ASCII characters to bytes.
      // Displays the string's byte representation in the 
      // specified code page.
      // Code page 1252 represents Latin characters.
      PrintCPBytes("Hello, World!",1252);
      // Code page 932 represents Japanese characters.
      PrintCPBytes("Hello, World!",932);

      // Converts Japanese characters to bytes.
      PrintCPBytes("\u307b,\u308b,\u305a,\u3042,\u306d",1252);
      PrintCPBytes("\u307b,\u308b,\u305a,\u3042,\u306d",932);
   }

   public static void PrintCPBytes(string str, int codePage)
   {
      Encoding targetEncoding;
      byte[] encodedChars;

      // Gets the encoding for the specified code page.
      targetEncoding = Encoding.GetEncoding(codePage);

      // Gets the byte representation of the specified string.
      encodedChars = targetEncoding.GetBytes(str);

      // Prints the bytes.
      Console.WriteLine
               ("Byte representation of '{0}' in Code Page  '{1}':", str, 
                  codePage);
      for (int i = 0; i < encodedChars.Length; i++)
               Console.WriteLine("Byte {0}: {1}", i, encodedChars[i]);
   }
}

Note

If you use this code in a console application, the specified Unicode text elements might not be displayed correctly. The support for Unicode characters in the console environment varies depending on the version of the Windows operating system that is running.

You can use these methods in an ASP.NET application to determine the encoding to use for response characters. The application should set the value of the ContentEncoding property to the value returned by the appropriate method. The following code example illustrates how to set HttpResponse.ContentEncoding.

' Explicitly sets ContentEncoding to UTF-8.
Response.ContentEncoding = Encoding.UTF8

' Sets ContentEncoding using the name of an encoding.
Response.ContentEncoding = Encoding.GetEncoding(name)

' Sets ContentEncoding using a code page number.
Response.ContentEncoding = Encoding.GetEncoding(codepageNumber)
// Explicitly sets the encoding to UTF-8.
Response.ContentEncoding = Encoding.UTF8;

// Sets ContentEncoding using the name of an encoding.
Response.ContentEncoding = Encoding.GetEncoding(name);

// Sets ContentEncoding using a code page number.
Response.ContentEncoding = Encoding.GetEncoding(codepageNumber);

For most ASP.NET applications, you should match the ContentEncoding property to the ContentEncoding property to display text in the encoding that the user expects.

For more information about using encodings in ASP.NET, see the Multiple Encodings Sample in the Common Tasks QuickStart and the Setting Culture and Encoding Sample in the ASP.NET QuickStart.

See Also

Concepts

Character Encoding in the .NET Framework

Unicode in the .NET Framework