New Recommendations for Using Strings in Microsoft .NET 2.0

 

Dave Fetterman
Microsoft Corporation

May 2005

Applies to:
   Microsoft .NET 2.0
   Microsoft .NET Framework

Summary: Code owners previously using the InvariantCulture for string comparison, casing, and sorting should strongly consider using a new set of String overloads in Microsoft .NET 2.0. Specifically, data that is designed to be culture-agnostic and linguistically irrelevant should begin specifying overloads using either the StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase members of the new StringComparison enumeration. These enforce a byte-by-byte comparison similar to strcmp that not only avoids bugs from linguistic interpretation of essentially symbolic strings, but provides better performance. (15 printed pages)

Contents

Introduction
Recommendations for String Use
Overview and Rationale for New Types
Choosing a StringComparison Member for Your Method Call
The Motivation: The Turkish-I Problem
Common String Comparison Methods in the Framework
A List of New Whidbey APIs for Correct String Comparison
Notes for Native Code
What About the Earlier Recommendation for Invariant Culture?
Conclusion

Introduction

The Microsoft .NET Framework enables developers to create software fully ready for localization and internationalization, through comprehensive machinery under the hood designed for, among other tasks, correctly interpreting strings given the current locale. This aids in quickly creating and using solutions designed for a broad range of cultures. But, when culturally-irrelevant string data is interpreted by these methods, code can exhibit subtle bugs and operate slower than necessary on untested cultures.

When interpreting strings, the canonical example of a culturally-aware type, sometimes flipping the culture switch causes unexpected results. The same strings can sort, case, and compare differently under different Thread.CurrentCulture settings. Sometimes strings should be allowed to vary according to the user's culture (for display data), but for most strings internal to an application, such as XML tags, user names, file paths, and system objects, the interpretation should be consistent throughout all cultures. Additionally, when strings represent such symbolic information, comparison operations should be interpreted entirely non-linguistically.

Recommendations for String Use

When developing with the 2.0 version of the .NET Framework, keeping a few very simple recommendations in mind will suffice to solve confusion about using strings.

  • DO: Use StringComparison.Ordinal or OrdinalIgnoreCase for comparisons as your safe default for culture-agnostic string matching.
  • DO: Use StringComparison.Ordinal and OrdinalIgnoreCase comparisons for increased speed.
  • DO: Use StringComparison.CurrentCulture-based string operations when displaying the output to the user.
  • DO: Switch current use of string operations based on the invariant culture to use the non-linguistic StringComparison.Ordinal or StringComparison.OrdinalIgnoreCase when the comparison is linguistically irrelevant (symbolic, for example).
  • DO: Use ToUpperInvariant rather than ToLowerInvariant when normalizing strings for comparison.
  • DON'T: Use overloads for string operations that don't explicitly or implicitly specify the string comparison mechanism.
  • DON'T: Use StringComparison.InvariantCulture-based string operations in most cases; one of the few exceptions would be persisting linguistically meaningful but culturally-agnostic data.

Many new and recommended String method overloads consume a StringComparison parameter, making these choices explicit:

Example 1:

String protocol = MyGetUrlProtocol(); 

if (String.Compare(protocol, "ftp", StringComparsion.Ordinal) != 0)
{
   throw new InvalidOperationException();
}

Example 2:

String filename = args[0];
StreamReader reader;

if (String.EndsWith(filename, "txt", StringComparison.OrdinalIgnoreCase))
{
   reader = File.OpenText(filename);   
}

Overview and Rationale for New Types

String comparison is the heart of many string-related operations, importantly sorting and equality.

Strings sort in a determined order: If string "my" appears before "string" in a sorted list of strings, it must be the case that in a string comparison, "my" compares "less than or equal to" "string." Additionally, comparison implicitly defines equality, as well, since this comparison operation will produce zero for any strings it deems equal; a good interpretation would be that neither string is 'less' than the other. Most meaningful operations involving strings include one or both of these procedures: comparing with another string, and executing a well-defined sort.

For many overloads, Thread.CurrentCulture dictates the default behavior for string comparisons in the .NET Framework. However, the comparison and casing behavior necessarily varies when the culture changes, either when run on a machine with a different culture than that on which the code was developed, or when the executing thread itself changes culture. This behavior is intended but remains non-obvious to many developers.

Correctly interpreting Strings given varying culture information becomes much easier with new overloads of existing APIs, plus a few new types like the System.StringComparison enumeration.

Whidbey introduces a clear new type that alleviates much of the confusion surrounding correct string comparisons: the StringComparison enumeration in mscorlib.dll.

namespace System
{
      public enum StringComparison {
         CurrentCulture,
         CurrentCultureIgnoreCase,
         InvariantCulture,
         InvariantCultureIgnoreCase,
         Ordinal,
         OrdinalIgnoreCase
         }
}

This gets to the core of how a particular string should be interpreted. Many string operations, most importantly String.Compare and String.Equals, now expose an overload consuming a StringComparison parameter. Explicitly setting this parameter in all cases, rather than choosing the default String.Compare(string strA, string strB) or String.Equals(string a, string b) makes your code clearer and easier to maintain, and is highly recommended. Furthermore, code specifying the new StringComparison.Ordinal and StringComparison.OrdinalIgnore settings for these APIs gains the greatest speed and, most often, correctness benefits.

The StringComparison members are explained next.

Ordinal String Operations

Specifying the StringComparsion.Ordinal or StringComparsion.OrdinalIgnoreCase setting signifies a non-linguistic comparison; that is, the features of any natural language are ignored when making comparison decisions here. APIs run with these settings base string operation decisions on simple byte comparisons, rather than on casing or equivalence tables parameterized by culture. In the majority of cases, this best fits the intended interpretation of strings, while making your code faster and more reliable.

  • Ordinal comparisons are string comparisons in which each byte of each string is compared without linguistic interpretation. This is essentially a C runtime strcmp. Thus, "windows" would not match "Windows." Where the context dictates that strings should be matched exactly, or demands conservative matching policy, this comparison should be used. Additionally, ordinal comparisons are the fastest because they apply no linguistic rules when determining a result.
  • Case insensitive ordinal comparisons are the next most conservative, and ignore most casing. Thus, "windows" would match "Windows." When dealing with ASCII characters, this policy is equivalent to that of StringComparison.Ordinal, but with the usual ASCII casing ignored. Thus, any character in [A, Z] (\u0041-\u005A) matches the corresponding one in [a,z] (\u0061-\007A). Casing outside the ASCII range uses the invariant culture's tables; thus, calling
String.Compare(strA, strB, StringComparsion.OrdinalIgnoreCase) 

is equivalent to (but faster than) calling

String.Compare(ToUpperInvariant(strA), ToUpperInvariant(strB),
   StringComparison.Ordinal).  

These comparisons are still very fast.

Both types of ordinal comparisons use the equivalence of binary values directly, and are best suitable for matching. When in doubt about your comparison settings, use one of these two values. However, since they operate by byte comparison, they sort not by a linguistic sort order (like an English dictionary) but a binary sort order, which may look odd if displayed to users in most contexts.

For those String.Equals overloads not consuming a StringComparsion argument (including ==), ordinal semantics are the default. It is recommended that the StringComparison be specified in any case.

String Operations Using the Current Culture

For linguisitically-relevant data, which should be interpreted differently between cultures, use these operations:

  • CurrentCulture comparisons use the thread's current culture or 'locale'; if not set by the user, these default to the setting in the Regional Options window in the Control Panel. These should be used for culture-sensitive user interaction. If the current culture were set to U.S. English ("en-US"), "visualStudio" would appear in sort order before "windows," like in a U.S. English phone book. If it were Swedish ("sv-SE"), the opposite would be true.
  • Case-insensitive CurrentCulture comparisons are the same as the previous, except they ignore case as dictated by the thread's current culture. This behavior may manifest itself in sort orders, as well.

Comparisons using CurrentCulture sematics are the default for the String.Compare overloads that do not consume a StringComparison. It is recommended that the StringComparison be specified in any case.

String Operations Using the Invariant Culture

InvariantCulture comparisons use the static CultureInfo.InvariantCulture's CompareInfo property for comparison information. This behavior is the same on all systems and translates any characters outside its range into what it believes are 'equivalent' invariant characters. This policy can be useful for maintaining one set of string behavior across cultures, but often provides unexpected results.

Case insensitive InvariantCulture comparisons use the static CultureInfo.InvariantCulture's CompareInfo property for comparison information as well. Any case differences among these translated characters are ignored.

InvariantCulture has very few properties that make it useful for comparison.

It does comparison in a linguistically-relevant manner, which prevents it from guaranteeing full symbolic equivalence, but is not the choice for display in any culture. Perhaps one of the only real reasons to use InvariantCulture for comparison would be persisting ordered data for a cross-culturally identical display; if a large data file containing, say, a list of sorted identifiers for display were to accompany an application, adding to this list would require an insertion with invariant-style sorting.

Choosing a StringComparison Member for Your Method Call

When comparing strings in .NET, there are a few pitfalls such as the Turkish-I problem described later in this article. However, most of these can be quickly eliminated by accompanying your string with meaningful comparison semantics. For a given context, the appropriate choice of comparison style often becomes clear.

Table 1 outlines the mapping from semantic string context to a StringComparison enumeration member:

Table 1

Data meaning Data behavior Corresponding StringComparsion

Value

  • Case-sensitive internal identifiers
  • Case sensitive identifiers in standards like XML and HTTP
  • Case sensitive security-related settings
A non-linguistic identifier, where bytes match exactly. Ordinal
  • Case-insensitive internal identifiers
  • Case-insensitive identifiers in standards like XML and HTTP
  • File paths
  • Registry keys/values
  • Environment variables
  • Resource identifiers (handle names, for example)
  • Case insensitive security related settings
A non-linguistic identifier, where case is irrelevant, especially a piece of data stored in most Microsoft Windows system services. OrdinalIgnoreCase
  • Some persisted linguistically-relevant data
  • Display of linguistic data requiring a fixed sort order
Culturally-agnostic data, which still is linguistically relevant. InvariantCulture

or

InvariantCultureIgnoreCase

  • Data displayed to the user
  • Most user input
Data that requires local linguistic customs. CurrentCulture

or

CurrentCultureIgnoreCase

The Motivation: The Turkish-I Problem

These new recommendations and APIs exist to alleviate misguided assumptions about the behavior of default string APIs. The canonical example of bugs emerging where non-linguistic string data is interpreted linguistically is the "Turkish-I" problem.

For nearly all Latin alphabets, including U.S. English, the character i (\u0069) is the lowercase version of the character I (\u0049). This casing rule quickly becomes the default for someone programming in such a culture. However, in Turkish ("tr-TR"), there exists a capital "i with a dot," character (\u0130), which is the capital version of i. Similarly, in Turkish, there is a lowercase "i without a dot," or (\u0131), which capitalizes to I. This behavior occurs in the Azeri culture ("az") as well.

Therefore, assumptions normally made about capitalizing i or lowercasing I are not valid among all cultures. If the default overloads for string comparison routines are used, they will be subject to variance between cultures. For non-linguistic data, as in the following example, this can produce undesired results:

Thread.CurrentThread.CurrentCulture = new CultureInfo("en-US")
Console.WriteLine("Culture = {0}",
   Thread.CurrentThread.CurrentCulture.DisplayName);
Console.WriteLine("(file == FILE) = {0}", 
   (String.Compare("file", "FILE", true) == 0));

Thread.CurrentThread.CurrentCulture = new CultureInfo("tr-TR");
Console.WriteLine("Culture = {0}",
   Thread.CurrentThread.CurrentCulture.DisplayName);
Console.WriteLine("(file == FILE) = {0}", 
   (String.Compare("file", "FILE", true) == 0));

Because of the difference of the comparison of I, results of the comparisons change when the thread culture is changed. This is the output:

Culture = English (United States)
(file == FILE) = True
Culture = Turkish (Turkey)
(file == FILE) = False

This could cause real problems if the culture is inadvertently used in security-sensitive settings:

static String IsFileURI(String path) {
    return (String.Compare(path, 0, "FILE:", 0, 5, true) == 0);
}

Something like IsFileURI("file:") would return true with a current culture of U.S. English, but false if the culture is Turkish. Thus, on Turkish systems, one could likely circumvent security measures to block access to case-insensitive URIs beginning with "FILE:". Because "file:" is meant to be interpreted as a non-linguistic, culture-insensitive identifier, the code should instead be written this way:

static String IsFileURI(String path) {
    return (String.Compare(path, 0, "FILE:", 0, 5,
      StringComparison.OrdinalIgnoreCase) == 0);
}

The Original Turkish-I Solution and Its Deficiencies

Because of the Turkish-I problem, the .NET team originally recommended using InvariantCulture as the primary cross-culture comparison type. The previous code would then read:

static String IsFileURI(String path) {
   return (String.Compare(path, 0, "FILE:", 0, 5, true,
      CultureInfo.InvariantCulture) == 0);
}

Comparisons using InvariantCulture and Ordinal will work identically when used on ASCII strings; however, InvariantCulture will make linguistic decisions that might not be appropriate for strings that need to be interpreted as a set of bytes.

Using the CultureInfo.InvariantCulture.CompareInfo, certain sets of characters are made equivalent under Compare(). For example, the following equivalence holds under the invariant culture:

InvariantCulture: a + = å

The "latin small letter a" (\u0061) character a, when next to the "combining ring above" (\u030a) character , will be interpreted as the "latin small letter a with ring above" (\u00e5) character å.

Example 3:

string separated = "\u0061\u030a";
string combined = "\u00e5";
      
Console.WriteLine("Equal sort weight under InvariantCulture? {0}",
   String.Compare(separated, combined, 
      StringComparison.InvariantCulture) == 0);
         

Console.WriteLine("Equal sort weight under 
   Ordinal? {0}",
   String.Compare(separated, combined, 
      StringComparison.Ordinal) == 0);

This prints out:

Equal sort weight under InvariantCulture? True
Equal sort weight under Ordinal? False

So, when interpreting file names, cookies, or anything else where something like the å combination can appear, ordinal comparisons still offer the most transparent and fitting behavior.

Common String Comparison Methods in the Framework

The following set of methods are those most commonly used for string comparison, and are accompanied by notes for their use.

String.Compare

Default interpretation: CurrentCulture

As the operation most central to string interpretation, all instances of these method calls should be examined to determine whether strings should be interpreted according to the current culture, or dissociated from the culture (symbolically). Typically, it is the latter, and an Ordinal comparison should be used.

The System.Globalization.CompareInfo class, provided as the CompareInfo property on all System.Globalization.CultureInfo objects, also offers a Compare method which provides a great deal of matching options (ordinal, ignoring white space, ignoring kana type, and so on) by means of the CompareOptions flags enumeration.

String.CompareTo

Default interpretation: CurrentCulture

This API does not currently offer an overload specifying a StringComparison type. It is usually possible to convert these methods to the recommended String.Compare(string, string, StringComparison) form.

Implementing the IComparable interface will necessarily use this method. Since it does not offer the option of a StringComparison argument, types implementing this often allow the user to specify a StringComparer in their constructor (see the following hash table example).

String.Equals

Default interpretation: Ordinal

The String class's equality methods include the static Equals, the static operator ==, and the instance method Equals. All of these operate by default in an ordinal fashion. Using an overload explicitly stating the StringComparison type is still recommended, even if you desire an ordinal comparison; in this way, searching code for a certain string interpretation becomes easier.

String.ToUpper and String.ToLower

Default interpretation: CurrentCulture

Users should most certainly be careful when using these functions, since forcing a string to a certain case is often used as a small normalization for comparing strings irrespective of case. If so, consider using a case-insensitive comparison.

ToUpperInvariant and ToLowerInvariant are also available. ToUpperInvariant is the standard way to normalize case. Comparisons made using OrdinalIgnoreCase are behaviorally the composition of two calls: calling ToUpperInvariant on both string arguments, and doing an Ordinal comparison.

Overloads are available for converting to upper- and lowercase given a CultureInfo parameter as well.

Char.ToUpper and Char.ToLower

Default interpretation: CurrentCulture

These work similarly to the comparable String methods described earlier.

String.StartsWith and String.EndsWith

Default interpretation for StartsWith(String): CurrentCulture

New overloads of these methods consume a StringComparison type.

String.IndexOf and String.LastIndexOf

IndexOf(string,) Default interpretation: CurrentCulture

IndexOf(char,) Default interpretation: Ordinal

At the time of writing (just after .NET 2.0 Beta 2), these currently do not expose an overload using StringComparison. They will be provided with the full release of .NET 2.0. One might also consider using the overloads exposed by the System.Globalization.CompareInfo class.

A List of New Whidbey APIs for Correct String Comparison

The following is a reiteration of the set of new Whidbey APIs, which directly enable users to explicitly specify string comparison types.

public class String
{
   bool Equals(String value, StringComparison comparisonType)
   static bool Equals(String a, String b, StringComparison comparisonType)    

   static int Compare(String strA, String strB, StringComparison 
      comparisonType)
   static int Compare(String strA, int indexA, String strB, int indexB, int 
      length, StringComparison comparisonType)

   bool StartsWith(String value, StringComparison comparisonType)
   bool StartsWith(String value, bool ignoreCase, CultureInfo culture)

   bool EndsWith(String value, StringComparison comparisonType)
   bool EndsWith(String value, bool ignoreCase, CultureInfo culture)

   string ToLowerInvariant()
   string ToUpperInvariant()
}

public abstract class StringComparer : IComparer, IEqualityComparer,
IComparer<string>, IEqualityComparer<string>
{      

   public static StringComparer InvariantCulture        
   public static StringComparer InvariantCultureIgnoreCase
   public static StringComparer CurrentCulture 
   public static StringComparer CurrentCultureIgnoreCase
   public static StringComparer Ordinal
   public static StringComparer OrdinalIgnoreCase
   public static StringComparer Create(CultureInfo culture,
         bool ignoreCase)
   public int Compare(object x, object y)
   public new bool Equals(Object x, Object y)
   public int GetHashCode(object obj) 
   public abstract int Compare(String x, String y);
   public abstract bool Equals(String x, String y);        
   public abstract int GetHashCode(string obj);     
 }

Examples of Secondarily Affected APIs: System.Collections

Some non-String APIs that have string comparison as a central operation consume the new StringComparer type. This type encapsulates in a clear way the comparisons offered by the StringComparsion enumeration.

Array.Sort and Array.BinarySearch

Default interpretation: CurrentCulture

If storing any data to a collection, or reading persisted data from a file or database to a collection, note that switching culture can invalidate the invariants inherent in the collection. Array.BinarySearch assumes that the underlying contents are already sorted; if the contents are strings, String.Compare is used by Array.Sort for this ordering. Using a culture-sensitive comparer can be dangerous, in case the culture changes between sorting the array and searching its contents.

Example 4 (incorrect):

string []storedNames;

public void StoreNames(string [] names)
{
   int index = 0;
   storedNames = new string[names.Length];

   foreach (name in names)
   {
      this.storedNames[index++] = name;
   }

   Array.Sort(names); // line A
}

public bool DoesNameExist(string [] names)
{
   return (Array.BinarySearch(this.storedNames) >= 0); // line B
}

Storage and retrieval here operate on the comparer provided by Thread.CurrentCulture. If the culture is expected to change between calls to StoreNames and DoesNameExist, especially if the contents are persisted somewhere in between, the binary search may fail.

A correct replacement for the bold lines is shown in the following code.

Example 4 (correct):

Array.Sort(names, StringComparer.Ordinal); // line A
// ...
Array.BinarySearch(names, StringComparer.Ordinal); // line B

If this data is persisted and moved across cultures, and sorting is used to present this data to the user, one might even consider a rare use of InvariantCulture, which operates linguistically for better user output, but is unaffected by changes in culture:

Array.Sort(names, StringComparer.InvariantCulture); // lineA
// ...
Array.BinarySearch(names, StringComparer.InvariantCulture); // line B

Collections Example: Hashtable Constructor

Hashing strings is a secondary example of an operation affected, at the core, by the string comparison interpretation.

It should also be noted that the string behavior of the file system, registry, and environment variables is best represented by OrdinalIgnoreCase. The following example demonstrates the use of the members of the new abstract StringComparer class, which wrap up comparison information for passing to existing APIs exposing an IComparer parameter.

const int initialTableCapacity = 100;

public void PopulateFileTable(string directory)
{
   Hashtable h = new Hashtable(initialTableCapacity, 
      StringComparer.OrdinalIgnoreCase);
         
   foreach (string file in Directory.GetFiles(directory))
         h.Add(file, File.GetCreationTime(file));
}

public void PrintCreationTime(string targetFile)
{
   Object dt = h[targetFile];
   if (dt != null)
   {
      Console.WriteLine("File {0} was created at time {1}.",
         targetFile, 
         (DateTime) dt);
   }
   else
   {
      Console.WriteLine("File {0} does not exist.", targetFile);
   }
}

Notes for Native Code

Native code is susceptible to similar types of errors, but they occur much less commonly. Default behaviors of string operations are not based on the locale, but are typically ordinal-based (strcmp or wcscmp, for example). Our recommendations for using managed code mirror this behavior. Finally, where linguistic flexibility is desirable, culture parameters can typically be passed in (see CompareString).

What About the Earlier Recommendation for Invariant Culture?

Comparisons made using InvariantCulture were a previously recommended standard for avoiding culture-sensitive bugs. Ordinal comparisons operate without regard to culture in the same way InvariantCulture does; however, they have the added benefit that none of the implicit linguistic conversions made using InvariantCulture, often overlooked by developers, can affect the comparison result, either.

It is recommended, and will be encoded as an upcoming FxCop rule, that:

  • All string comparison operations use the provided overloads specifying the string comparison type.
  • All users of InvariantCulture in the previously described overloads strongly consider using Ordinal or OrdinalIgnoreCase.
  • All calls to ToLowerInvariant should be avoided if used for string normalization.

Conclusion

String comparison and casing are central to conditional operations on strings, including sorting and equality. Carefully considering the context in which strings should be compared and cased is one of the best ways to make your application faster and more correct. Choosing whether your string should be treated as a symbolic set of bytes (an ordinal interpretation) or should vary over culture (a culture-sensitive interpretation) becomes clearer with New APIs in .NET 2.0. Users should take care to specify which interpretation is correct. Additionally, using an ordinal string interpretation is often the best way to ensure code operates as intended.