Custom Case Mappings and Sorting Rules

Article
10/17/2014

Case mappings, alphabetical order, and conventions for sequencing items vary from culture to culture. You should be aware of these variations and understand that they can cause the results of string operations to vary depending on culture.

The unique case-mapping rules for the Turkish alphabet illustrate how uppercase and lowercase mappings differ from language to language even when they use most of the same letters. In most Latin alphabets, the character i (Unicode 0069) is the lowercase version of the character I (Unicode 0049). However, the Turkish alphabet has two versions of the character I: one with a dot and one without a dot. In Turkish, the character I (Unicode 0049) is considered the uppercase version of a different character ı (Unicode 0131). The character i (Unicode 0069) is considered the lowercase version of yet another character İ (Unicode 0130). As a result, a case-insensitive string comparison of the characters i (Unicode 0069) and I (Unicode 0049) that succeeds for most cultures fails for the culture "tr-TR" (Turkish in Turkey).

The following code example demonstrates how the result of a case-insensitive String.Compare operation performed on the strings "FILE" and "file" differs depending on culture. The comparison returns true if the Thread.CurrentCulture property is set to "en-US" (English in the United States). The comparison returns false if CurrentCulture is set to "tr-TR" (Turkish in Turkey).

Imports System
Imports System.Globalization
Imports System.Threading

Public Class TurkishISample
    Public Shared Sub Main()
        ' Set the CurrentCulture property to English in the U.S.
        Thread.CurrentThread.CurrentCulture = New CultureInfo("en-US")
        Console.WriteLine("Culture = {0}", _
            Thread.CurrentThread.CurrentCulture.DisplayName)
        Console.WriteLine("(file == FILE) = {0}", String.Compare("file", _
            "FILE", True) = 0)
        
        ' Set the CurrentCulture property to Turkish in Turkey.
        Thread.CurrentThread.CurrentCulture = New CultureInfo("tr-TR")
        Console.WriteLine("Culture = {0}", _
            Thread.CurrentThread.CurrentCulture.DisplayName)
        Console.WriteLine("(file == FILE) = {0}", String.Compare("file", _
            "FILE", True) = 0)
    End Sub
End Class
[C#]
using System;
using System.Globalization;
using System.Threading;

public class TurkishISample
{
    public static void Main()
    {
    // Set the CurrentCulture property to English in the U.S.
    Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");
    Console.WriteLine("Culture = {0}",   
        Thread.CurrentThread.CurrentCulture.DisplayName);
    Console.WriteLine("(file == FILE) = {0}", (string.Compare("file", 
        "FILE", true) == 0));

    // Set the CurrentCulture property to Turkish in Turkey.
    Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-TR");
    Console.WriteLine("Culture = 
        {0}",Thread.CurrentThread.CurrentCulture.DisplayName);
    Console.WriteLine("(file == FILE) = {0}", (string.Compare("file", 
        "FILE", true) == 0));
    }
}

The following output to the console illustrates how the results vary by culture, because the case-insensitive comparison of i and I evaluates to true for the "en-US" culture and false for the "tr-TR" culture.

Culture = English (United States)
(file == FILE) = True
Culture = Turkish (Turkey)
(file == FILE) = False

Note The culture "az -AZ-Latn" (Azeri (Latin) in Azerbaijan) also uses this case-mapping rule.

Additional Custom Case Mappings and Sorting Rules

In addition to the unique case mappings used in the Turkish and Azeri alphabets, there are other custom case mappings and sorting rules that you should be aware of when performing string operations. The alphabets of nine cultures in the ASCII range (Unicode 0000- Unicode 007F) contain two-letter pairs where the result of a case-insensitive comparison, such as String.Compare, does not evaluate to equal when the case is mixed. These cultures are "hr-HR" (Croatian in Croatia), "cs-CZ" (Czech in the Czech Republic), "sk-SK" (Slovak in Slovakia), "da-DK" (Danish in Denmark), "nb-NO" (Norwegian (Bokmal) in Norway), "nn-NO" (Norwegian (Nynorsk) in Norway), "hu-HU" (Hungarian in Hungary), "vi-VN" (Vietnamese in Vietnam) and "es-ES" (Spanish in Spain) using the traditional sort order. For example, in the Danish language, a case-insensitive comparison of the two-letter pairs aA and AA is not considered equal. In the Vietnamese alphabet, a case-insensitive comparison of the two-letter pairs nG and NG is not considered equal. Although you should be aware that these rules exist, in practice, it is unusual to run into a situation where a culture-sensitive comparison of these pairs creates problems because they are uncommon in fixed strings or identifiers.

The alphabets of six cultures within the ASCII range have standard casing rules, but different sorting rules. These cultures are "et-EE" (Estonian in Estonia), "fi-FI" (Finnish in Finland), "hu-HU" (Hungarian in Hungary) using the technical sort order, "lt-LT" (Lithuanian in Lithuania), "sv-FI" (Swedish in Finland), and "sv-SE" (Swedish in Sweden). For example, in the Swedish alphabet, the letter w sorts as if it is the letter v. In application code, sorting operations tend to be used less frequently than equality comparisons and therefore are less likely to create problems.

An additional 35 cultures have custom case mappings and sorting rules outside of the ASCII range. These rules are generally confined to the alphabets used by those specific cultures. Therefore, the likelihood of them causing problems is low.

For details about the custom case mappings and sorting rules that apply to specific cultures, see The Unicode Standard at www.unicode.org.

Custom Case Mappings and Sorting Rules

Additional Custom Case Mappings and Sorting Rules

See Also

Additional resources