CUSTOM CULTURES

Extend Your Code's Global Reach With New Features In The .NET Framework 2.0

Michael Kaplan and Cathy Wissink

This article is based on a prerelease version of the .NET Framework 2.0. All information herein is subject to change.

This article discusses:

  • Custom cultures and regions with CultureAndRegionInfoBuilder
  • Serialization of culture and region data
  • CultureInfo support for SQL Server 2005
  • International domain name mapping
  • Unicode properties and normalization
This article uses the following technologies:
Visual Basic, C#, .NET Framework 2.0, Unicode

Contents

Custom Cultures and Regions
Custom Cultures and the LCID
CultureAndRegionInfoBuilder Serialization with LDML
Cultures in Windows and the .NET Framework
CultureInfo Support for SQL Server 2005
Updates to the Encoding Classes
Ordinal, Invariant, and Current Cultures
Unicode Normalization
New IDN Mapping APIs
Getting Unicode Property Information
Conclusion

Extensibility is crucial to international users today. Users want the option to customize the data as appropriate for their needs. What if the built-in support for a particular language or culture is not adequate or appropriate, or the cultural data is missing entirely? The application of globalization standards (most obviously the Unicode Standard, but others as well) provides a common, non-proprietary approach to international text. Migrating data away from proprietary models and toward a commonly used industry standard allows users to share their work across platforms and applications around the world. The upcoming Microsoft® .NET Framework 2.0 adds a number of globalization features that address the important issues of extensibility, standards support, and migration.

Custom Cultures and Regions

One of the biggest bottlenecks in extending globalization support in the .NET Framework has been the addition of user-defined cultures. While methods of creating user-defined CultureInfo objects have been published, they are only workarounds fraught with a variety of problems due to the inherent lack of support for them in the Framework. One such problem with extending the CultureInfo object was that the user-defined cultures would not work across AppDomains. Second, some relevant classes in the Globalization namespace are sealed, preventing these classes from being extended. Third, some classes in the Framework internally create CultureInfo objects, and they would pick up the built-in cultures rather than the custom, user-defined cultures. Finally, these workarounds are all code-based solutions that are not easily shared between different applications.

The lack of custom culture support is addressed in the .NET Framework 2.0 through the new CultureAndRegionInfoBuilder (CARIB) class. The CARIB class, part of the System.Globalization namespace and exposed from the sysglobl DLL, allows you to create a new culture that you can use or deploy to others. Specifically, it lets you create three different kinds of cultures.

First, you can replace an existing culture so that future attempts to create the culture use your updated culture. For example, if a company in the United States wants all employees to use a particular short date format that is not the default for en-US, the company could create a custom culture that replaces the short date format with the preferred version and deploy it to all users.

You can also create a new culture that is based on an existing culture; in effect, the new culture uses the old culture as a template. Since you can leverage the extant properties in the old culture as needed, you can avoid creating all of the different properties from scratch. An example of this is creating an English language culture for India. You could use many of the existing property values (those for day and month names, and for language name), but then update the properties specific to the region (region name and currency, for example).

Finally, you can create a new culture from scratch. This would be the case when you are creating a culture that has never been supported on Microsoft platforms, so there is nothing to use as a template. An example of this is the Hausa language spoken in Nigeria. A user would need to create both language- and region-specific data from scratch, since no applicable data is currently supported in the .NET Framework. Once you have created and populated the CARIB object, you can call its Register method to create the culture and region and make it available to any managed application (this process stores the culture and region as a file on the local computer in a system directory and can be undone using the static Unregister method on the CARIB class). The code for each task is straightforward (see Figure 1).

Figure 1 Using a Custom Culture

Visual Basic

Dim carib As CultureAndRegionInfoBuilder = _ New CultureAndRegionInfoBuilder("de-DE-Kultur", _ CultureAndRegionModifiers.None) ' load up all of the

existing data for German and for Germany carib.LoadDataFromCultureInfo(New CultureInfo("de-DE", False)) carib.LoadDataFromRegionInfo(New RegionInfo("de")) '

Change an arbitrary property and register the culture on the machine carib.RegionEnglishName = carib.RegionNativeName carib.Register() 'Use the new culture, just as

you would a built-in one Dim ci As CultureInfo = New CultureInfo("de-DE-Kultur")

C#

CultureAndRegionInfoBuilder carib = new CultureAndRegionInfoBuilder("de-DE-Kultur", CultureAndRegionModifiers.None); // load up all of the existing data for German and for Germany carib.LoadDataFromCultureInfo(new CultureInfo("de-DE", false)); carib.LoadDataFromRegionInfo(new RegionInfo("de")); // Change an arbitrary property and register the culture on the machine carib.RegionEnglishName =

carib.RegionNativeName; carib.Register(); // Use the new culture, just as you would a built-in one CultureInfo ci = new CultureInfo("de-DE-Kultur");

As you can see, there is a great deal of flexibility in the way these custom cultures are named. There are, however, best practices you should follow to ensure good interaction between new custom cultures, other managed applications on the machine, and any external components with which your application communicates. You are creating new cultures that other applications will be able to leverage, so your application will need to handle those issues and interactions gracefully. See the documentation for the .NET Framework 2.0 for more information about these best practices.

Custom Cultures and the LCID

The Locale Identifier (LCID) has long been the way by which people refer to locales on Windows® APIs. Although culture names are the preferred way to refer to cultures and regions in the .NET Framework, the Locale Identifier has always been available for interoperability with Windows.

However, the process by which LCIDs are allocated is closed, and it could not be opened up to support custom cultures and regions without causing an entirely new set of interoperability problems. Because of this, the LCID is not a property that can be set on the CARIB object. That said, the LCID will be preserved in the case of replacement cultures, which are designed to stay compatible with the original culture in key aspects.

CultureAndRegionInfoBuilder Serialization with LDML

It's much easier to use custom cultures across different machines if there is a way to serialize the culture data to XML. To accomplish this task, the .NET Framework takes advantage of Locale Data Markup Language (LDML), a standard defined by the Unicode Consortium. The schema is defined in Unicode Technical Standard #35 (see Unicode Technical Standard #35 for more information).

There are two basic supported operations: serializing a CARIB object as an LDML file, and deserializing an LDML file back into a CARIB. In theory, since LDML is a standard, any LDML file can be used rather than just the ones that have been created through the .NET Framework. In practice, however, there is a great deal of variation among the requirements and features of locales on different platforms. In most cases, some tweaking will be needed for the Register Method to succeed on a culture loaded from an LDML file. A good example of this is if you are dealing with an LDML file that does not have information about all of the date or time formats. This information is an important part of the culture and would need to be defined prior to registration. In cases such as these, the LDML file acts as a useful template, and you must fill in all of the remaining details.

The code to serialize a custom culture using LDML is also straightforward, as you can see in Figure 2. This example shows how you could create a custom culture for Turkish in Germany.

Figure 2 Using LDML to Serialize a Custom Culture

Visual Basic

Dim file1 As String = Path.GetTempFileName() File.Delete(file1) Dim ci As CultureInfo = New CultureInfo("tr-TR") Dim ri As RegionInfo = New

RegionInfo("de-DE") Dim carib As CultureAndRegionInfoBuilder = _ New CultureAndRegionInfoBuilder("tr-DE-custom", _ CultureAndRegionModifiers.None)

carib.LoadDataFromCultureInfo(ci) carib.LoadDataFromRegionInfo(ri) carib.Save(file1) carib = CultureAndRegionInfoBuilder.CreateFromLdml(file1) carib.Register()

C#

string file1 = Path.GetTempFileName(); File.Delete(file1); CultureInfo ci = new CultureInfo("tr-TR"); RegionInfo ri = new RegionInfo("de-DE"); CultureAndRegionInfoBuilder

carib = new CultureAndRegionInfoBuilder("tr-DE-custom", CultureAndRegionModifiers.None); carib.LoadDataFromCultureInfo(ci); carib.LoadDataFromRegionInfo(ri);

carib.Save(file1); carib = CultureAndRegionInfoBuilder.CreateFromLdml(file1); carib.Register();

Cultures in Windows and the .NET Framework

The .NET Framework was released at an interesting point in the history of Windows: after Windows XP was released, but before Windows Server™ 2003. As a result, the list of cultures available in the .NET Framework matched the locales included in Windows XP (and provided a superset of the locales included in previous versions of Windows). Developers didn't have to consider the consequences of new locales on Windows. There would be no issue with how these new Windows locales would interact with a version of the .NET Framework that tries to, for example, base its culture settings on the choices available in the operating system. The .NET Framework has always maintained its own data so that it could return the same results on all possible platforms, and until Windows XP SP2, this had never caused any difficulties.

The globalization development team had to address this problem, however, after Windows XP SP2 shipped with 25 new locales. Imagine our surprise when one of our testers discovered that you could not even start a managed application when installing an early build of SP2 and using one of those new locales as the default user locale! This was clearly an issue we needed to address immediately in earlier versions of the .NET Framework, and fix more fully in the .NET Framework 2.0.

Future Windows service packs may include additional locales. Windows Vista™ (formerly codenamed "Longhorn") is expected to ship with additional locales above and beyond what have been supported to date; so that presents a very possible situation where an installed version of Windows could include locales that are not recognized cultures in the .NET Framework. Therefore, it's imperative that the .NET Framework gracefully handle Windows locales in a managed environment. Figure 3 shows Francois Liger's Culture Explorer, which illustrates how the .NET Framework 2.0 picks up the new locales in Windows Vista through the Windows-only cultures.

Figure 3 Viewing Windows-Only Cultures

 

The .NET Framework can now handle previously unidentified Windows locales by using the Win32® API to synthesize a CultureInfo object any time a locale supported in Windows has no corresponding culture in the .NET Framework. These cultures can be created either by name or by LCID, just like any other culture. The following code illustrates how to create a culture by name (new cultures on Windows XP SP2 include mt-MT, bs-BA-Latn, smn-FI, smj-NO, smj-SE, sms-FI, sma-NO, sma-SE, quz-BO, quz-EC, quz-PE, ml-IN, bn-IN, cy-GB, and more):

' Visual Basic For Each ci As CultureInfo In CultureInfo.GetCultures( _ CultureTypes.WindowsOnlyCultures) Console.WriteLine(ci.Name) Next // C# foreach(CultureInfo culture in CultureInfo.GetCultures( CultureTypes.WindowsOnlyCultures)) { Console.WriteLine(ci.Name); }

This is obviously a break from the typical practice in the .NET Framework of giving the same results independent of the platform. However, given the choice between failing completely and succeeding when there is a way to retrieve the data, the option of handling Windows-only cultures successfully provides a better solution for developers who expect some type of culture data returned by the .NET Framework for these Windows-only locales.

You'll notice in the previous code snippet that the CultureInfo.GetCultures static method was used to retrieve a collection of Windows-only cultures. While GetCultures and the CultureTypes enumeration existed in previous versions of the Framework, the .NET Framework 2.0 rounds out the enumeration with more options in order to provide better support for custom and replacement cultures. One of these new values is WindowsOnlyCultures. Figure 4 provides a comparison of the various culture types.

Figure 4 Comparison of Different Culture Types

Enumeration Value Description LCID Allows substitution of other CompareInfo and TextInfo objects Allows changes to other cultures and related fields Available when supported by the OS Available always Supports formatting and parsing
NeutralCultures Refers to cultures that are associated with a language but are not specific to a country/region. X     X X  
SpecificCultures Refers to cultures that are specific to a country/region. X     X X X
ReplacementCultures Refers to custom cultures created by the user that replace cultures shipped with the .NET Framework. X   X X X X
UserCustomCultures Refers to custom cultures created by the user.   X X X X X
WindowsOnlyCultures Refers to cultures installed in the Windows system but not the .NET Framework. X     X   X

CultureInfo Support for SQL Server 2005

SQL Server™ 2005 is the first ship vehicle for the .NET Framework 2.0. Implementing globalization features across both products has resulted in interesting design trade-offs, since they employ very different models for handling locale.

The SQL Server locale semantics define one setting for UI and formatting and another setting for collation and encoding. For example, if you were using a French UI setting and Thai_CS_AS collation on SQL Server, the user interface would be set to French, and the dates would be formatted using settings from the fr-FR culture. However, the data would be sorted as expected by the th-TH culture, and the code page for non-Unicode applications would be set to 874 (the Thai Windows code page). The locale/culture semantics of Windows and the .NET Framework, however, have one setting for UI and another setting for formatting and collation. In Windows and the .NET Framework, setting a French UI culture (UI language) with a Thai current culture (locale) will set the UI to French, while all other settings (formatting, sorting, and code page) will be Thai.

To solve the locale model conflict between SQL Server, the .NET Framework, and Windows, a special override of the GetCultureInfo method was created. This override takes two CultureInfo names for the two distinct SQL Server settings, and creates a special CultureInfo object that matches the semantics SQL Server is expecting. Here's the code for creating the object just described:

' Visual Basic Dim ci as CultureInfo = CultureInfo.GetCulture("fr-FR", "th-TH") // C# CultureInfo ci = CultureInfo.GetCulture("fr-FR", "th-TH");

The ci variable will contain a culture that you'll be able to use in SQL Server 2005 stored procedures that will match the SQL Server locale behavior as shown in Figure 5.

Figure 5 Mapping Multiple Cultures

Setting SQL Server .NET Framework
Resource retrieval (UI) Language (fr-FR) CurrentUICulture (fr-FR)
Number formatting, date formatting Language (fr-FR) CurrentCulture (th-TH)
Sorting (collation) Collation (th-TH) CurrentCulture (th-TH)

Updates to the Encoding Classes

Prior versions of encoding support were thin wrappers around the operating system functionality, which would sometimes lead to configuration problems if necessary code pages were either not available on the platform or simply not installed. Starting with the .NET Framework 2.0, however, encoding support is now built into the .NET Framework. This approach improves performance, provides greater flexibility, and gives more consistent results across supported platforms. It also allows for the addition of a new Encoding.GetEncodings enumeration method.

Other new features include UTF-32 support (little endian and big endian) and UTF-16 big endian support. These encodings, while not used extensively on Windows, can be crucial in some cross-platform scenarios involving systems with different default ways of handling Unicode—for example, the versions of UNIX that use UTF-32, or databases that run on big-endian platforms and need to match this with the encoding for performance reasons.

Another feature added to encodings is encoding and decoding fallback. This allows the developer to define the behavior of a conversion when the code page does not have an explicit conversion for the bytes or characters in question. There are different options for fallback behavior as described in Figure 6. The ability to create custom fallback behavior is itself worthy of a future article.OrdinalIgnoreCase

Even if you decide to use the binary comparison behavior provided by Ordinal comparisons, you may want to have the option of ignoring differences between the case of letters within the two strings being compared. Why? This is the behavior that Windows itself uses in many of the places where it ignores case for symbolic identifiers: file names, registry keys, environment variables, and the names of objects like mutexes and pipes. All of these items use an operation that uppercases the identifiers and then does a binary comparison. You may well want similar behavior with your own symbolic identifiers.

You could go to the trouble of doing a ToUpper call on each of the strings and then doing an Ordinal comparison. This would, however, allocate extra strings, and that is best avoided for performance reasons. Therefore, the .NET Framework 2.0 adds comparison operators that allow you to perform Ordinal operations that ignore case distinctions for symbolic identifiers in the same fashion as Windows.

This functionality does not provide proper linguistic results (any more than Ordinal comparisons provide proper results), but it is a great way to emulate the underlying behavior of the operating system for managed and unmanaged symbolic identifiers in your managed code.

Figure 6 Encoding Fallback Behaviors

Behavior Description
Exception Throws an exception any time unknown bytes or characters are found.
Best fit Provides the traditional Windows best fit support, which supplies fallbacks for characters that do not exist on the code page, like replacing ä with a.
Replacement Provides limited support for the replacement of the bytes that cannot be mapped with something else (as defined by the developer).
Custom Provides full support for the replacement of data, since the developer can inherit from and define their own EncoderFallback and related classes, creating whatever behavior they find appropriate.

Ordinal, Invariant, and Current Cultures

Developers often use the Invariant culture's comparison methods to compare strings, not realizing that it will return the same default collation table results as English, German, and many other languages. It will also ignore characters that are not defined in the .NET Framework's sorting tables. There may also be times when developers want to use a specific culture's preferences or a truly binary comparison.

The invariant culture in the .NET Framework, like the invariant locale added to Windows XP, is intended to provide cultural data that will not vary when user-defined settings change. For collation purposes, the default collation table of the .NET Framework is used. This table provides linguistically expected results for many different languages, including English, German, Hebrew, and Russian, and it handles many of the interesting cases of strings that are canonically equivalent in Unicode. It therefore provides a reasonable baseline for usage that will not vary when the settings within the .NET Framework are changed.

Ordinal comparisons, on the other hand, are meant to provide a true binary (or code point) comparison of the two strings. Ordinal comparison, of course, does not take any linguistic or standards-based factors into account. Even if the string contains characters not found in the .NET Framework sorting tables, and even if it contains strings that look like they are identical, ordinal comparison will treat the two strings as being different if the underlying code points are different.

To decide which you should use in a given situation, it is important to look at what you are trying to accomplish. Would you want two strings that are equivalent according to the Unicode normalization rules to be treated as equal, like you might in a sorted list of names? This is probably a good time to use the current culture. Are you trying to list items in the same order, no matter how the user's preferences are configured? Using Invariant here is probably the best plan. Or should any difference in the underlying strings be cause for treating the strings as different, whether the difference is visible or not (like in a password)? Ordinal comparisons are probably the best plan here.

There are two enhancements to the collation functionality in the .NET Framework 2.0 which help developers properly use this functionality. First, a new IsSortable method was added to the CompareInfo object. This method will return false if the string contains any code points that are not yet defined in the collation tables.

Second, an OrdinalIgnoreCase comparison semantic was added to all the overrides that currently allow an Ordinal comparison, such as String.Compare and CompareInfo.Compare (see the sidebar on OrdinalIgnoreCase for more information on its usefulness).

Many of the environments where you might currently use Invariant would benefit from one of these two methods, allowing you to more effectively match the underlying behavior of the file system. Examples of using some of these different types can be seen in Figure 7.

Figure 7 Old and New Collation Features

Visual Basic

Dim stTest1 As String = "æ, IamAString" Dim stTest2 As String = "STRING" Dim stTest3 As String = "AE" ' Returns true

Console.WriteLine(stTest1.EndsWith(stTest2, _ StringComparison.InvariantCultureIgnoreCase)) ' Returns true Console.WriteLine(stTest1.EndsWith(stTest2, _

StringComparison.OrdinalIgnoreCase)) ' Returns true Console.WriteLine(stTest1.StartsWith(stTest2, _ StringComparison.InvariantCultureIgnoreCase)) ' Returns false

Console.WriteLine(stTest1.StartsWith(stTest2, _ StringComparison.OrdinalIgnoreCase))

C#

string stTest1 = "æ, IamAString"; string stTest2 = "STRING"; string stTest3

= "AE"; // Returns true Console.WriteLine(stTest1.EndsWith(stTest2, StringComparison.InvariantIgnoreCase)); // Returns true

Console.WriteLine(stTest1.EndsWith(stTest2, StringComparison.OrdinalIgnoreCase)); // Returns true Console.WriteLine(stTest1.StartsWith(stTest2,

StringComparison.InvariantIgnoreCase)); // Returns false Console.WriteLine(stTest1. StartsWith(stTest2, StringComparison.OrdinalIgnoreCase));

Note how InvariantIgnoreCase and OrdinalIgnoreCase return the same results for the first two tests. For the second two, however, they return different results—and this is where deciding which comparison you need to make becomes crucial. If you are dealing with comparison of file names or other symbolic identifiers, OrdinalIgnoreCase makes the most sense. If you are handling strings in the user interface or comparing canonically equivalent strings, such as "å" (U+00E5) and "aº" (U+0041 U+030A), the current culture, or rarely, InvariantIgnoreCase might be preferred, depending on whether invariant behavior unaffected by the user's choices or culture-specific behavior that varies depending on the user's preferences is desired.

There are some additional changes that were made, including the corrected Serbian collation (fixed in Windows XP SP2, but inadvertently missed on the .NET Framework), and better handling in general of ignored or ignorable characters in IndexOf, LastIndexOf, IsPrefix, and IsSuffix calls. (These additional changes primarily involve situations where you're searching for a zero-length string, a preceding or trailing NULL character, determining whether unsortable characters exist in the string, or if symbols are present and you choose to ignore something such as by using CompareOptions.IgnoreSymbols.)

Unicode Normalization

The Unicode Consortium defines normalization as a method whereby "equivalent text (canonical or compatibility) will have identical binary representations. When implementations keep strings in a normalized form, they can be assured that equivalent strings have a unique binary representation." The Unicode version of normalization is described in Unicode Standard Annex (UAX) #15. The .NET Framework adds two methods to the String class with two overrides each, and one enumeration giving the different forms:

String.IsNormalized() String.IsNormalized(NormalizationForm normalizationForm) String.Normalize() String.Normalize(NormalizationForm normalizationForm) NormalizationForm enumeration FormC, FormD, FormKC, FormKD

To provide you with an example of different normalization forms, let's take a look at an arbitrary string:

õĥµ¨ (00F5 0068 0302 00B5 00A8)

Figure 8 shows how this arbitrary string appears in the four different Unicode normalization forms. In collation, õĥµ¨ = õĥµ¨ = õĥμ¨ = õĥμ¨.

Figure 8 Normalized Unicode Strings

Form Description Normalized Form
FormC Canonical Decomposition, followed by Canonical Composition õĥµ¨(U+00F5 U+0125 U+00B5 U+00A8)
FormD Canonical Decomposition õĥµ¨(U+006F U+0303 U+0068 U+0302 U+00B5 U+00A8)
FormKC Compatibility Decomposition, followed by Canonical Composition õĥμ¨(U+00F5 U+0125 U+03BC U+0020 U+0308)
FormKD Compatibility Decomposition õĥμ¨(U+006F U+0303 U+0068 U+0302 U+03BC U+0020 U+0308)

New IDN Mapping APIs

The .NET Framework 2.0 also includes the IdnMapping class for using non-ASCII characters in domain names. The IdnMapping class has two properties (AllowUnassigned and UseStd3AsciiRules) and two methods (GetAscii and GetUnicode) that are worth a closer look.

The AllowUnassigned property specifies a value indicating whether unassigned Unicode code points are used in subsequent operations performed by members of this class. UseStd3AsciiRules specifies whether internationalized or non-internationalized naming conventions can be used in operations performed by members of this class. Non-internationalized in the US-ASCII character range means using the characters A-Z, a-z, 0-9, the hyphen, and the period.

GetAscii takes one or more domain name labels in Unicode characters and returns the domain labels encoded in the US-ASCII character range (also known as Punycode). GetUnicode takes one or more domain name labels encoded according to the internationalized domain name (IDN) standard and returns the labels encoded as Unicode.

As an example, we can start with the following (arbitrary) string containing international characters:

www.<img alt="" height="11" ImageName="Chinese" ImageType="gif" runat="server" width="35"  />-on-the-Web.com

With just a little bit of code, this domain name becomes the following ASCII string:

xn—on-the-Web-tj8w5jvv8n.com

The following code uses IdnMapping to encode the original IDN label as ASCII characters:

' Visual Basic Dim idn as IdnMapping = new IdnMapping() Dim st as String = "www.\u65e5\u672c\u8a9e-on-the-web.com" Console.WriteLine(idn.GetAscii(st, 0, st.Length)) // C# IdnMapping idn = new IdnMapping(); string st = "www.\u65e5\u672c\u8a9e-on-the-web.com"; Console.WriteLine(idn.GetAscii(st, 0, st.Length));

Be aware, however, that IDN mapping has been a controversial topic lately due to the security risks involved with spoofing and characters that can be easily confused with other characters that look the same. The Unicode Technical Committee is working to define a standard to deal with these problems, but such a standard does not yet exist.

Microsoft will evaluate the best way to integrate such a standard once it is made available by the Unicode Consortium. Until that time arrives, these classes implement the standard as it is currently defined and can be used as part of a careful strategy to support international domain names.

Getting Unicode Property Information

Methods on the System.Char structure, like IsDigit and IsWhiteSpace, are mostly derived from Unicode, but many of the values match legacy behavior used by previous versions of Visual Basic®. The need for a new class that would return the actual Unicode property values was clear, and it has been provided in the new CharUnicodeInfo class. This new class does much more than the methods on Char, and uses the official data from the Unicode Character Database at About the Unicode Character Database, using the data from Unicode 4.1. Figure 9 shows some of the methods that have been added to get additional Unicode property information.

Figure 9 CharUnicodeInfo Members

Method Description
IsWhiteSpace Is it whitespace by the Unicode definition?
GetNumericValue Returns a float so that any number (even a fraction) within Unicode can have the value parsed out, or -1 if it is not a number.
GetDigitValue Returns an int so that any number that can be represented as an int can have the value parsed out, or -1 if it is not a digit.
GetDecimalDigitValue Returns an int so that any digit 0 to 9 in any Unicode script can have the value parsed out, or -1 if it does not have the Nd property (Numeric,Decimal Digit).
GetUnicodeCategory Returns the actual Unicode General Category.
GetBidiCategory Returns the Unicode Bidi Category.

Each of these methods operates on a UTF-code point, but there is also supplementary character support available through a new override signature that accepts a string and an integer index. The index points to the first character of a surrogate pair if you are looking for the information about a supplementary character. (This is crucial for the instances where one character is actually made up of two UTF-16 code points.) This override functionality can be applied to all of the IsXxx methods of the Char structure, as well as the GetUnicodeCategory and GetNumericValue methods. There are also new ConvertToUtf32 and ConvertFromUtf32 methods, which make it easy to move between surrogate pairs and the actual UTF-32 code point values.

Another related update in the StringInfo class makes it easier to use supplementary characters and other text elements. (A text element is a string of two or more Unicode [UTF-16] code points that make up what a user would consider a character.) The changes add a constructor that takes a string, the String and LengthInTextElements properties, and the SubstringByTextElements method.

StringInfo.String retrieves the string that the StringInfo was created with, or changes the string without having to create a new object. StringInfo. LengthInTextElements returns the number of text elements in the string. StringInfo.SubstringByTextElements returns a substring starting with the nth text element in the string.

To get their results, these new members use the ParseCombiningCharacters method that already existed in the .NET Framework. But many developers do not find that existing method to be intuitive—thus the update. You can see them in action in this section of code:

' Visual Basic Dim si as StringInfo = New StringInfo( _ "A\u0300\u0301\u0300e\u0300\u0301\u0300") For ich As Integer = 0 To si.LengthInTextElements - 1 Console.WriteLine(si.SubstringByTextElements(ich, 1)) Next // C# StringInfo si = new StringInfo("A\u0300\u0301\u0300e\u0300\u0301\u0300"); for(int ich = 0; ich < si.LengthInTextElements; ich++) { Console.WriteLine(si.SubstringByTextElements(ich, 1); }

Conclusion

Many small but important enhancements were made to classes in the System.Globalization namespace. More details about these changes are available in the .NET Framework 2.0 documentation. Figure 10 provides a brief summary of changes not otherwise discussed in this article.What Are Genitive Months?

Genitive months are used in certain cultures where the month name (and its string representation) varies depending on use. The .NET Framework cultures include two strings for each month: one for the standalone version of a month name (used in calendars or outside of dates), and the other for use within dates. The version of the month used within dates needs to inflect—or change its string—when used contextually. In many cultures, this contextual use is referred to by its linguistic case: the genitive case.

For example, in Russian, the Framework will give you two month names for January: Январь is the standalone name used in calendars, and января is the inflected version of January, used in dates, like 6 января 2005 г.

Figure 10 New System.Globalization Features

Feature Description
TextInfo.CultureName Contains the underlying culture from which the TextInfo was created.
TextInfo.LCID Contains the underlying culture from which the TextInfo was created, associated with the CultureName if it is not a custom culture.
DateTimeFormatInfo.ShortestDayNames Contains an array of the shortest possible name for the days, which is ideal for small calendars.
DateTimeFormatInfo.MonthGenitiveNames Contains an array of the genitive month names in a culture. (See the sidebar "What are Genitive Months?")
DateTimeFormatInfo.AbbreviatedMonthGenitiveNames Contains an array of the abbreviated genitive month names in a culture.
NumberFormatInfo.NativeDigits Contains an array of ten elements containing the digits 0 to 9 in the culture underlying the NumberFormatInfo. Not currently used by parsing and formatting, but does allow developers to make use of the data to add such support in the .NET Framework 2.0.
NumberFormatInfo.DigitSubstitution Value that describes how a user might expect native digits to be used. (Not currently a part of parsing and formatting, but again allows developers to make use of the data to add such support.)
CultureInfo.IsCustomCulture Identifies the culture as a custom culture, if applicable.
CultureInfo.IetfLanguageTag Has a tag that can be used to determine the alternate, standards-based name for the culture.
CultureInfo.CultureTypes Contains all the new culture types: replacement, custom, Windows-only, and so on.
CultureInfo.GetCultureInfo Creates a cached, read-only CultureInfo object to improve performance. Since many people only need a read-only instance or an identifier name, the performance benefit over creating a new CultureInfo can be substantial over the course of many instances.
CultureInfo.GetCultureInfoByIetfLanguageTag Creates a new CultureInfo from a IetfLanguageTag value.
RegionInfo You can now create RegionInfo to help with custom cultures being able to support custom regions, using full culture names.
RegionInfo.GeoId Returns the GEOID value of the region, which can be useful for interoperability scenarios that use the GEO APIs in Windows and the GEO values in MapPoint.
RegionInfo.NativeName Contains the native name of the region.
RegionInfo.CurrencyEnglishName Contains the English name of the region's currency.
Encoding Both UTF-8 and UTF-16 handling in the .NET Framework has been tightened up to meet the current Unicode definitions. Invalid sequences, such as unpaired surrogates, will no longer be encoded or decoded.

As we've described here, the .NET Framework 2.0 aims to provide a number of new international features that will help developers better customize their applications to meet the varied needs of worldwide customers. Features like custom cultures and regions, the ability to handle Windows-only locales, and customizable fallback behavior all give the developer greater control over a number of international settings that provide a more culturally authentic experience to users.

Developers will have even greater access in the .NET Framework 2.0 to a diverse set of standards used in the internationalization community today, including Unicode character information, Unicode normalization, LDML, updates to the encoding classes, and exposure of language tags. While there are still a number of internationalization standards that need to be added to the .NET Framework in the future, these current additions demonstrate a long-term commitment to the internationalization standards community, most notably Unicode.

Michael Kaplan is a technical lead at Microsoft, working on both Windows and the .NET Framework, particularly on collation, keyboards, locales, and Unicode support. He is the developer/owner of MSKLC, the Microsoft Keyboard Layout Center, written in C#. He can be reached at https://blogs.msdn.com/michkap.

Cathy Wissink is the Group Program Manager for the Language and Market Roadmap group in the Windows Globalization team at Microsoft. You can reach her at cwissink@microsoft.com.