Click to Rate and Give Feedback
MSDN
MSDN Library
.NET Development
.NET Framework 3.5
System Namespace
Char Structure
Char Methods
Members FilterMembers Filter
Frameworks FilterFrameworks Filter
This page is specific to
Microsoft Visual Studio 2008/.NET Framework 3.5

Other versions are also available for the following:
.NET Framework Class Library
Char..::.GetUnicodeCategory Method

Updated: November 2007

Categorizes a Unicode character into a group identified by one of the UnicodeCategory values.

  NameDescription
GetUnicodeCategory(Char) Categorizes a specified Unicode character into a group identified by one of the UnicodeCategory values.
GetUnicodeCategory(String, Int32) Categorizes the character at the specified position in a specified string into a group identified by one of the UnicodeCategory values.
Top
Tags What's this?: Add a tag
Community Content   What is Community Content?
Add new content RSS  Annotations
WARNING: Chars don't make sense in many languages      Shawn Steele - MSFT   |   Edit   |  

It is worth mentioning that the "char" type represents a single 16 bit value. In Unicode some characters consist of 2 UTF-16 code points, so in that case a "char" cannot represent a complete "character". This doesn't happen to English, but many Chinese and other characters exist outside of the BMP (ie: require 2 chars to represent the Unicode code point).

Also note that the notion of a "character" is also flexible. Many people think of them as "glyphs", but many "glyphs" require multiple code points. For example ä can be "a" + U+0308 (combining diaresis) or "ä" (U+00A4). In some languages all "letters/characters/glyphs" cannot be represented correctly by a single Unicode code point and instead require multiple code points.

Additionally some concepts get confused by this behavior. For example, There is a ΰ (U+03B0 greek small letter Upsilon with Dialytika and Tonos), however there's no equivilent capital letter. Trying to do ToUpper() ends up returning the same value, although you could perhaps argue for Ϋ́ (U+03AB + U+0301, greeke capital letter upsilon with dialytika, and then a combining tonos) Some other operating systems/environments choose that as the ToUpper() value for U+03B0, so then a single "char" ends up with a 2 "char" upper case form.

Another example is when combinations of characters cause their form to change. This isn't common in the "latin" characters, but its kind of like æ (U+00E6) looking like a and e crammed together, or, in German ß being the equivilent of ss. In some scripts the form changes a lot depending on the subsequent letters. An oversimplification would be to describe it as kind of like a hyperactive cursive where the letters connect in different ways depending on the following letters.

There are many other examples of cases when the "character" concept breaks down, so use caution. Strings are generally preferrable to better represent linguistic content.


Flag as ContentBug
CharUnicodeInfo is preferred      Shawn Steele - MSFT   |   Edit   |  
For some backwards compatibility reasons CharUnicodeInfo and Char have slightly different behavior for GetUnicodeCategory. CharUnicodeInfo has more "correct" behavior.
Flag as ContentBug
Processing
© 2008 Microsoft Corporation. All rights reserved. Terms of Use  |  Trademarks  |  Privacy Statement
Page view tracker