Share via


This content is no longer actively maintained. It is provided as is, for anyone who may still be using these technologies, with no warranties or claims of accuracy with regard to the most recent product version or service release.

Aa155630.offvba(en-us,office.10).gif

VBAhacker

LEVEL: New Programmer VB | VBA | Sorting Algorithms

A Better Shell Sort: Part I

How to Sort Arrays of Any Data Type

By Romke Soldaat

As a developer, you probably deal with arrays a lot. If so, organizing them in ascending or descending order must be an important issue. On the surface, it looks like you're spoilt for choice. Many algorithms let you sort arrays: Bubble Sort, Quick Sort, Select Sort, Shell Sort, Insert Sort, and Heap Sort are among the more popular. Some work better with large arrays than others, and some perform at their best (or worst!) if the array is already in near-sorted order. Many books and articles have been written about the "ideal" algorithm. Impressive benchmark tests are being used to prove that one algorithm sorts a million numbers a few seconds faster than another one.

Don't get carried away by those statistics; the ideal algorithm doesn't exist. Most sorting routines were created long ago. Shell Sort was launched in 1959, and even the "modern" Quick Sort dates from the early sixties - twenty years before the birth of the PC. You can't blame their inventors for not anticipating the demands of computer users in the third millennium, but that doesn't change the fact that all sorting algorithms have serious shortcomings.

In this two-part series we'll try to find out how we can make them work the way they should. In this installment we'll look at string and date sorting. In Part II, we'll tackle the sorting of multi-dimensional arrays. I'll also introduce a class module that adds powerful sorting capabilities to your VB and VBA applications.

Introducing the Shell Sort Algorithm

Shell Sort (named after its inventor Donald Shell) is an algorithm that scores quite well under most circumstances. Its beauty is that it's simple, and easy to adapt. Shell Sort may not always be as fast as Quick Sort, but it's very memory efficient. (One of the drawbacks of Quick Sort is that it works by recursion, which, in Visual Basic, often leads to stack space problems. And what's the point in gaining a few fractions of a second if your application crashes before your array is sorted?)

FIGURE 1 shows the listing of a basic Shell Sort variation. Like all sorting algorithms, it works by comparing two values, and swapping them if one value is smaller than the other. Shell Sort starts the comparisons at a calculated distance (aka "gap") from the top of the array. Robert Sedgewick, one of many people who have tried to optimize the algorithm, recommends gaps in which each increment is three times its predecessor plus one: 1, 4, 13, 40, 121, 364, 1093, and so on. The initial (one-gap-too-many) GapSize value is calculated in lines 8 through 10, and then integer-divided by three in line 12.

   1.   Sub ShellSort(vArray AsVariant)
 2.   Dim TempVal As Variant
 3.   Dim i As Long, GapSize As Long, CurPos As Long
 4.   Dim FirstRow As Long, LastRow As Long, NumRows As Long
 5.   FirstRow = LBound(vArray) 
 6.   LastRow = UBound(vArray) 
 7.   NumRows = LastRow - FirstRow + 1
 8.   Do
 9.     GapSize = GapSize * 3 + 1
 10.   Loop Until GapSize > NumRows
 11.   Do
 12.     GapSize = GapSize \ 3
 13.     For i = (GapSize + FirstRow) To LastRow
 14.       CurPos = i
 15.       TempVal = vArray(i) 
 16.       Do While CompareResult( _                    vArray(CurPos - GapSize),TempVal) 
 17.         vArray(CurPos) = vArray(CurPos - GapSize) 
 18.         CurPos = CurPos - GapSize
 19.         If (CurPos - GapSize) < FirstRow Then Exit Do
 20.       Loop
 21.       vArray(CurPos) = TempVal
 22.     Next
 23.   Loop Until GapSize = 1
 24.   End Sub
 25.   Private Function CompareResult( _
 26.     Value1 As Variant, Value2 As Variant)
 27.     CompareResult = (Value1 > Value2) 
 28.   End Function

FIGURE 1: The basic Shell Sort algorithm is good for numbers, bad for strings.

Inside the For loop starting on line 13, and the Do loop starting on line 16, each element beyond the current gap index (TempVal) is compared against an element at the bottom of the array (TestVal). If TempVal is smaller, it replaces TestVal (line 21). The CompareResult function (lines 25 through 28) compares the two values, and returns True if TestVal is larger than TempVal. It's not strictly necessary to do this comparison externally, but it will become useful if we have to deal with different types of data, as you'll see later.

Once the last array element has been processed, the next GapSize is calculated (the controlling Do loop runs from line 11 to 23), and the search-and-swap procedure starts again. By the time GapSize is down to 1, the array is fully sorted. Well, sort of...

What's Wrong with Sorting Algorithms?

How does a sorting algorithm determine that one value is smaller than another? If the array contains just numbers, it's easy. You hardly need a computer to determine that 4 is smaller than 5. Any algorithm will sort numerical values properly.

But what if the array contains strings, and two of these strings are "Four" and "Five"? In that case, the comparison becomes binary. The smaller value is the one that starts with the character that's lowest in the ANSI character table. If the first characters are identical, the second character of each value is examined, and so on. In our example, "Five" is smaller than "Four" because "i" comes before "o". (The character values in the ANSI table are 105 and 111, respectively.)

So far, so good. Now take the string values "Four" and "five". All of a sudden, each sorting algorithm will tell you that "Four" is the smaller one, because the character value of "F" is 70, and the value of "f" is 102. That's where sorting becomes a dodgy business. In the ANSI table, the entire uppercase alphabet is listed before the lowercase alphabet. If your array contains the strings "Algeria", "antelope", "Zimbabwe", and "zebra", they end up sorted like this:

  Algeria
Zimbabwe
antelope
zebra

That's wrong. Many programmers attempt to solve that problem by converting the strings to uppercase (or lowercase) before they are compared. In FIGURE 1, you could replace line 27 with:

  CompareResult = (UCase$(Value1) > UCase$(Value2)) 

Theoretically, this looks like an elegant solution. But it's wrong, too. Even though lowercase and uppercase versions of characters are now treated equally, the sorting is still based on ANSI-dictated character positions. That works fine with strings that contain only the standard A-Z characters (the original 7-bit ASCII character set, which occupies the first half of the ANSI table), but fails with diacritics, such as é, ñ, ï, ù, and å, which can appear anywhere between positions 128 and 255. All sorting algorithms consider these characters larger than z, and move them to an incorrect position in the array. In the case of Unicode strings, it gets even worse: Characters can be scattered over a range that covers more than 65,000 positions. Try sorting that with an algorithm that recognizes only a 26-character alphabet!

Culture Shock

There's no universal solution to this problem. Sorting conventions can differ dramatically between languages. Not only does the alphabetical order (and even the alphabet) vary, but also conventions for ordering items in dictionaries and phone books are often different. For example, in most languages, diacritic versions of a character are sorted together with the "plain" version of the same character. Imagine an array consisting of the characters:

   a z à á
â ã ä å æ

Americans and Western-Europeans expect them to be sorted as shown in the first row of FIGURE 2. The rest of that table shows that the expectations are quite different in other parts of the world.

English a á à â ä ã å æ z
Czech a ä á â ã z æ å à
Danish a á à â ã z æ ä å
Finnish a á à â ã æ z å ä
Icelandic a à â ä ã å á z æ
Swedish a á à â ã æ z å ä

FIGURE 2: Don't make the mistake of thinking that the alphabet stops at z. Here's how some characters are sorted in different languages.

Here are a few more examples that illustrate the complexity of alphabetical sorting:

  • In traditional Spanish, "ch" is considered a unique compression that must be sorted between "c" and "d," but in Czech the same compression is sorted after "h."
  • In English, "æ" is equivalent to "ae," but in many other languages it's a single character that must be sorted after "z."
  • In Lithuanian, there's no difference between "i" and "y," and in Swedish, "ú" and "y" are considered equivalent, as are "v" and "w."
  • Some languages include non-Latin characters that have special sorting rules, such as the Icelandic "Þ," which comes between "d" and "e."

It's obvious that any algorithm that sorts strings on the basis of character positions in the ANSI or Unicode table is utterly useless, and that we have to look at the lexical value of a character. But where do we find that information?

Windows to the Rescue

Fortunately, we don't have to look far. As a matter of fact, we don't have to worry about it at all. Windows is an operating system that's used all over the world and is designed to cope with the lexical aspects of almost any written language. When you installed Windows, you were prompted to select a locale - the country in which you live or work. At that point, the operating system geared itself up to deal with all the intricacies of that locale, such as currency, date and time formats, and alphabetical sorting rules. If you open the Control Panel to select a different locale, all regional standards are instantly adjusted, and any well-behaved application will follow the new rules.

API or VB?

The Windows API contains a number of functions that let you make politically correct string comparisons, such as lstrcmp, lstrcmpi, and CompareString. You can use them in any sorting algorithm, but, as an Office developer, you don't have to. VB has its own implementation of the lstrcmpi function, named StrComp. (According to my tests, the VB version works even faster than the Windows function.) When used with the vbTextCompare switch, StrComp compares strings based on the letter order for the current locale without making a distinction between uppercase and lowercase characters. The function returns -1 if the first string is smaller than the second one, +1 in the reverse case, and 0 if both strings are considered equal.

That gives us the opportunity to make line 27 in FIGURE 1 locale-friendly:

  CompareResult =_
   (StrComp(Value1, Value2, vbTextCompare) = 1) 

Obviously, this modification disqualifies the algorithm for numeric sorting, but that's fairly easy to fix, as we'll see later.

Sorting Dates

Apart from strings and numbers (which covers the Integer, Long, Decimal, Single, Double, and Currency data types), VB supports one odd type of variables: dates.

Although VB usually recognizes a date from many different date expressions (date literals, numbers that look like dates, strings that look like dates, etc.), date variables are internally stored as 64-bit Double values. The value to the left of the decimal represents a date, the value to the right represents a time. (Negative numbers represent dates between January 1, 100 and December 30, 1899.) If a date value is stored in a variable of the Date type, you can convert it to a Double value with the CDbl function. The following prints "37281.53125" in the Immediate window:

  Dim d As Date
d = "Jan 25, 2002 12:45" 
Debug.Print CDbl(d) 

That's good news, because it means we can compare two dates as if they were numbers. All we have to do is make sure that the variables contain valid date values. The following lines of code do exactly that:

  If IsDate(Value1) And IsDate(Value2) Then
  CompareResult = (CDate(Value1) > CDate(Value2)) 
End If

The IsDate function returns True if an expression can be converted to a date; CDate converts any valid date expression to a value in the Date format. If the date in Value1 is larger (i.e. later) than the one in Value2, the value of CompareResult is True.

FIGURE 3 shows a new version of the CompareResult function that handles all three data types. It uses the IsDate and IsNumeric functions to analyze the data type of the two arguments. If the values are dates or numbers, the > operator is used to compare them; otherwise StrComp is used to make a case-insensitive text comparison. If you replace lines 25 to 28 in FIGURE 1 with this new function, the Shell Sort algorithm is fully equipped to sort numeric, date, and string arrays.

  Private Function CompareResult(Value1 As Variant, _
  Value2 As Variant, Optional Descending as Boolean)
     If IsDate(Value1) And IsDate(Value2)
Then
    CompareResult = (CDate(Value1) > CDate(Value2)) 
  ElseIf IsNumeric(Value1) And IsNumeric(Value2) Then
    CompareResult = (Value1 > Value2) 
  Else
    CompareResult = _
       (StrComp(Value1, Value2, vbTextCompare) = 1) 
  End If
  CompareResult = CompareResult Xor Descending
End Function

FIGURE 3: Comparing numeric, alphabetical, and date values.

Sorting in Descending Order

In most cases you'll probably want to sort your arrays in ascending order, but sometimes you may wish to reverse the arrangement so numbers are sorted from highest to lowest, and dates from oldest to most recent. If you look again at FIGURE 3, you'll see that the CompareResult function takes an optional third Boolean argument: Descending.

If this argument is set to True, the function must return True if the first value isn't larger than the second one. This decision is taken in the last instruction. The Xor operator evaluates to True, if one - and only one - of the expressions is True. So, if Value1>Value2 evaluates to True, and Descending is also True, the comparison returns False. And if Descending is False, the comparison can only return True if Value1>Value2 evaluates to True. That's the beauty of logical operators!

Part II: The Final Touch

We're not there yet. In Part II, we'll tweak the Shell Sort algorithm so that it can process multi-dimensional arrays. We'll also introduce a class module that takes care of all the nitty-gritty, so you can create a simple object that copes with all your sorting needs. Watch this space for more VBA hacks!

Dutchman Romke Soldaat has been a human word-processor since he started as a copywriter in the sixties. After the birth of the PC, he used and customized every text-processing package on the market, adding menus and mouse support to WordPerfect years before WP itself introduced them. In 1988 he co-founded the Microsoft International Product Group in Dublin, Ireland, and wrote his first macros when WinWord was still a prototype. Since his retirement in 1992, he has created a number of successful add-ons for Office. He now lives with his wife and two dogs in Italy. Romke can be contacted at mailto:romke@soldaat.com.