Writing Culture-Safe Managed Code

 

Anthony Moore
Development Lead, .NET Framework CLR

May 2004

Applies to:

   Microsoft® Visual Studio® .NET

Summary: For developers who write applications that are to run on multiple locales, it is important to consider the issues raised in this article so as to ensure their code functions correctly, even in different languages. (8 printed pages)

Contents

Locale vs. Culture?
Overview
Why Does It Work This Way?
The Turkish Example
Other Countries
Incorrect Code Example
Directly Affected APIs
Alternative Techniques
What About Native Code?
Conclusion

The Microsoft® .NET Framework is designed to allow developers to create applications that work in a variety of locales. This is most visible in methods like ToString and Parse, which work differently depending on which country and language you have chosen in Regional Options in control panel.

There is a much more subtle difference in behavior, however, that can occur in different locales, of which every developer should be aware, even if they are not deliberately writing an application to be shipped to different countries. Basic string operations like sorting and lower-casing can also differ in behavior, and this can cause applications to fail when particular Regional Options are selected.

Even if you don't care about whether your application works in different cultures, it is still important to be aware of these issues for security reasons.

Locale vs. Culture?

The .NET Framework uses the terminology "culture" to represent what might have traditionally been called the "locale". The .NET Framework has two concepts of the active culture.

Culture, which is indicated by the Thread.CurrentCulture property, corresponds by default to the selection in Regional Options in the control panel. This affects how numbers, dates, and times are formatted, and is also what determines which sorting and casing rules to use. This is the property we are concerned with here.

UICulture, which is indicated by the Thread.CurrentUICulture property, corresponds by default to the language of the operating system, or the selected language on a multi-language version of Microsoft® Windows®. This affects which resources get loaded, so it determines which strings and pictures the user sees. This is not the property we are concerned with in this case.

Overview

The most commonly used routines that have this variation in behavior, depending on the culture, are String.Compare, String.ToUpper and String.ToLower. There are also some other important routines that call these, and which pass on the default behavior.

Whether or not you want these operations to vary depending on the culture depends a lot on what you are doing. If you are displaying a sorted list of localized strings in a list box, then you probably do want a culturally-aware comparison. If you are comparing a string to see if it is a recognized XML tag, then you do not want any variation. Most routines that have culture-dependent behavior will give you the option of passing in the culture to use directly; passing in CultureInfo.InvariantCulture is the way to eliminate the variation.

Not all string-related methods have this variation. For example, String.Equals and String.CompareOrdinal do not have any dependency on the culture.

A class of problems arises when using culturally sensitive routines in places, whether or not they should work in a way independent of the culture. In general, if you are working with file names, persistence formats, or symbolic information that will not be shown to the end user, you do not want the behavior to vary. The variation is generally only desirable if the text is entered by or displayed to an end-user, and it is normally localized or in their language. Because culturally-dependent behavior is the default, it is very easy to write code that will fail when run in cultures with special sorting and casing rules.

A much smaller class of problems arises when you are using invariant behavior when you should be using culturally dependent behavior. This is less common, because many operations are culturally dependent by default.

Many developers have a vague notion that sorting can vary from one culture to another. It is much less commonly known that casing rules can vary; the majority of problems in code, therefore, will be related to casing, not sorting.

Why Does It Work This Way?

Given the number of problems that can arise here, the question may be asked why this default behavior was chosen. Depending on what sort of application you are writing, you will find this default behavior either a blessing or a curse.

If you are developer writing an application for users of a single culture, this behavior will be of great benefit, particularly for a non-English application. Without having to really think about it, string comparisons will do the logical thing for your users. It was with scenario in mind that the default behavior was chosen.

If you are writing an application that needs to run for multiple cultures this behavior can be a headache, and means that care must be taken when using any of these methods.

The Turkish Example

For most Latin alphabets, the letter i (Unicode 0069) is the lowercase version of I (Unicode 0049). The Turkish alphabet, however, has two versions of the letter I, one with a dot and one without. In Turkish, the character I (Unicode 0049) is considered the upper case version of a different character ý (Unicode 0131), and i (Unicode 0069) is considered the lower case version of yet another character Ý (Unicode 0130).

Confused? Below is a diagram that might make this clearer. The result is that in this culture, case-insensitive string comparisons that should normally succeed will fail.

Here is some code that demonstrates what this means:

Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US");
Console.WriteLine("Culture = {0}", Thread.CurrentThread.CurrentCulture.DisplayName);
Console.WriteLine("(file == FILE) = {0}", (string.Compare("file", "FILE", true) == 0));

Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-TR");
Console.WriteLine("Culture = {0}",Thread.CurrentThread.CurrentCulture.DisplayName);
Console.WriteLine("(file == FILE) = {0}", (string.Compare("file", "FILE", true) == 0));

Because of the difference of the comparison of I, results of the comparisons change when the thread culture is changed. This is the output:

Culture = English (United States)
(file == FILE) = True
Culture = Turkish (Turkey)
(file == FILE) = False

Other Countries

The Azerbaijan culture also uses the Turkish I rule, but these are the only cultures that have a single-character casing difference.

A further 9 cultures (Croatian, Czech, Slovak, Danish, Bokmal, Nynorsk, Hungarian, Vietnamese and Spanish Traditional Sort) have 2 letter pairs that do not compare as equal in mixed case. For example, in Danish 'aA' is not considered equal to 'AA' in a case insensitive comparison. Another example is in Vietnamese, where 'nG' does not compare equally to 'NG'. In practice it is unusual to run into a case where an inappropriate culture-sensitive comparison of these pairs creates problems, because they are uncommon in fixed strings or identifiers.

A further 5 cultures (Finnish, Swedish, Technical Hungarian, Lithuanian and Estonian) have regular casing, but different sorting rules within the ASCII range. For example, in Swedish, 'w' sorts as if it were 'v'. Because sorting is much less common than equality comparison, this is even less likely to create problems.

Another 35 cultures have custom sort ordering or casing rules outside the ASCII range. Because these are generally contained to alphabets used within those cultures, this is the least likely to cause problems.

Incorrect Code Example

Here is an example of incorrect code resulting from this problem:

static String IsFileURI(String path) {
    return (String.Compare(path, 0, "file:", 0, 5, true) == 0);
}

This code will return the incorrect result on Turkish. If comparing against a constant string like this, the comparison should always be invariant. This is important for file names and Universal Resource Identifiers in general, because Windows always treats these as invariant. The correct code is this:

static String IsFileURI(String path) {
    return (String.Compare(path, 0, "file:", 0, 5, true, CultureInfo.InvariantCulture) == 0);
}

Directly Affected APIs

Care should be taken when using the following APIs. If writing an application intended for more than one culture, a recommended practice is to use versions of these APIs that make it clearer whether or not there will be cultural variation. This reduces the chances of errors, and is also good for long term maintenance because it shows someone has already deliberately made the choice.

There may appear to be a performance concern with these changes. For the most part, however, the suggested API usage here will not have a negative performance impact. This is because the recommended change is to pass in a CultureInfo to APIs that take it. The shorter forms of these APIs simply call the longer versions of themselves with CultureInfo.CurrentCulture anyway, so this can actually be faster because it saves an extra function call.

String.Compare

If calling String.Compare with ignore case and a CultureInfo is not passed into the function already, you should indicate whether this is an invariant or culture dependent operation by specifying either CultureInfo.InvariantCulture or CultureInfo.CurrentCulture. If both strings are not user-localized text with no symbolic meaning, then it is best to compare using InvariantCulture.

Another, less common error which can also occur with String.Compare is with sorting. If doing a less-than or a greater-than comparison on the result, even if this comparison is case-sensitive, it should be investigated whether the sorting order should be allowed to change with the culture. For example, strings in a list box should probably pick up the culture, while doing a binary search on a hard-coded list of strings should use the invariant.

Suppose you have a case-sensitive comparison, like this:

if (string.Compare(stringLeft, stringRight) == 0) {
    DoSomething();
}

If you want this comparison to be independent of the culture, change it to this:

if (string.Compare(stringLeft, stringRight, false, CultureInfo.InvariantCulture) == 0) {
    DoSomething();
}

If you do want it to be culturally-aware, it is recommended to change it to this equivalent form:

if (string.Compare(stringLeft, stringRight, false, CultureInfo.CurrentCulture) == 0) {
    DoSomething();
}

String.CompareTo

This makes a case-sensitive call through to String.Compare. It is recommended for clarity to replace this with a call to String.Compare, specifying either CultureInfo.InvariantCulture or CultureInfo.CurrentCulture.

If you have a call like this:

if (stringLeft.CompareTo(stringRight) == 0) {
    DoSomething();
}

Replace it with this to prevent variation:

if (string.Compare(stringLeft, stringRight, false, CultureInfo.InvariantCulture) == 0) {
    DoSomething();
}

String.ToUpper and String.ToLower

Any uses of these functions with the overloads with no parameters are suspect, because the results may vary. Frequently, strings are forced to a standard case for easier look up later. These uses should mostly use the invariant culture, because the thread culture could potentially change in-between.

We recommend you always use the version that takes the culture explicitly. If you have a routine that does some sort of identifier lookup like this:

static object LookupKey(string key) {
    return internalHashtable[key.ToLower()];
}

Change it to this:

static object LookupKey(string key) {
    return internalHashtable[key.ToLower(CultureInfo.InvariantCulture)];
}

Char.ToUpper and Char.ToLower

These have the same characteristics as the methods on String, although in this case Turkish and Azeri are the only affected cultures because they are the only ones with single-character casing differences. Again, we recommend using the version that takes the culture explicitly.

Secondarily affected APIs

These APIs make use of the above APIs and could therefore pass on the problem to people using them.

CaseInsensitiveComparer and CaseInsensitiveHashCodeProvider

The default case-insensitive comparer uses the above APIs to do comparison. Any APIs using them are subject to the same results. The solution if this creates problems is to initialize a CaseInsensitiveComparer with the InvariantCulture. CaseInsensitiveHashCodeProvider is the same.

For example, you might create a case-insensitive Hashtable like this:

internalHashtable = new Hashtable(CaseInsensitiveHashCodeProvider.Default, CaseInsensitiveComparer.Default);

To eliminate culture-dependent behavior, do this instead:

internalHashtable = new Hashtable(
new CaseInsensitiveHashCodeProvider(
CultureInfo.InvariantCulture), 
new CaseInsensitiveComparer(
CultureInfo.InvariantCulture));

CollectionsUtil.CreateCaseInsensitiveHashTable

This API is a short cut for creating a case-insensitive Hashtable. It will use the current culture by default, however. To make the keys invariant, change to the Hashtable constructor given above.

SortedList

If using a SortedList with strings as the keys, the sorting and lookup can be affected by the culture. It is recommended you initialize them with custom comparers that compare using the InvariantCulture. Here is an invariant comparer class that can be used for this:

    internal class InvariantComparer : IComparer {
        private CompareInfo m_compareInfo;
        internal static readonly InvariantComparer Default = new InvariantComparer();
        
        internal InvariantComparer() {
            m_compareInfo = CultureInfo.InvariantCulture.CompareInfo;
        }
  
        public int Compare(Object a, Object b) {
            String sa = a as String;
            String sb = b as String;
            if (sa != null && sb != null)
                return m_compareInfo.Compare(sa, sb);
            else
                return Comparer.Default.Compare(a,b);
        }
    }

In general, if you use a SortedList on strings without using some sort of comparer, changing the culture after the list has been populated can invalidate the list.

Array.Sort and ArrayList.Sort.

By default, strings will sort using the current culture, so they are vulnerable to being invalid in cultures with different sort orders. To avoid this, use the invariant comparer class described under SortedList.

Array.BinarySearch and Array.BinarySearch

BinarySearch has the same issue when sorting strings as Sort.

RegularExpression

System.Text.RegularExpressions uses the current culture when doing operations involving case. This might be desirable if searching human-readable text.

If you don't want this variation, either do a case-insensitive search by pre-converting the case, or switch the invariant culture in temporarily as in the example below.

Alternative Techniques

If this change in behavior is troublesome for a particular application, you can just switch the thread culture to invariant at startup:

static void Main(string[] args) {
    Thread.CurrentThread.CurrentCulture = CultureInfo.InvariantCulture;
    . . .
}

If you are doing a localized set of operations that you want to ignore the culture, try this:

static void SwitchExample() {
    CultureInfo originalCulture = Thread.CurrentThread.CurrentCulture;
    Thread.CurrentThread.CurrentCulture = CultureInfo.InvariantCulture;
    try {
        DoSomething();
    }
    finally {
        Thread.CurrentThread.CurrentCulture = originalCulture;                
    }
}

These options are not possible in semi-trusted code such as a downloaded control, because changing the thread culture requires thread control permission.

What About Native Code?

It is possible to introduce this sort of error in native code, as well. This is much less common, for the following reasons:

  1. Many casing and comparison routines used in native code are ordinal routines that are part of runtime libraries and do not exhibit any variation.
  2. There are some Win32 APIs that do exhibit variation, such as CompareString; however, they force you to pass in the culture to use as a parameter, so it is harder to make the same mistake.
  3. The .NET Framework is the first API library to handle Turkish I comparisons in a linguistically correct way. Previously, only the two-letter casing variations and the sorting variations were different, and these are much harder to hit in real code.

Conclusion

Developers will find this varying string behavior either a blessing or a curse, depending on the sort of application being written. If writing an application to run on multiple locales, even if there is no localization or globalization logic in the application, developers must be aware of these issues to ensure correctly functioning code.