Understanding Counting Sort
There exist a few specialized sorting algorithms that make use of extra knowledge about input data. This means: they exist, and you should know about them, but they work only in certain specialized circumstances. The input data for any sorting algorithm is normally defined like this:
- Input:
Lx
is a list ofn
objects for which some sort of ordering relationship is defined
Using this you can sort, for example, strings, integers and stereotypes. Counting Sort has a more restricted definition:
- Input:
Lx
is a list ofn
objects for which some sort of ordering relationship is defined, and - You know in advance the size
m
of the object universeUO
input items are taken from, and - There exists a relationship
r(x)
that maps an object to be sorted to an index, and m
is reasonably small.
Hm. Conditions 1-3 are fairly straightforward and can be defined for most input data. For example, when sorting strings:
- Input:
Lx
is a list ofn
strings m
is the size of the object universeUO
of all strings. Assume your input uses only the 26 letters of the alphabet, and you have a string length of 10 would be 2610 = 141.167.095.653.376- There exists a relationship
r(x)
that maps an object to be sorted to an index: lexographically orderUO
, and then use the index in that list
However, for many interpretations of reasonably, 141.167.095.653.376 is not a number that is reasonably small. Right? On the other hand, let's assume you have lots of integers in the range 0..100:
- Input:
Lx
is a list ofn
small integers m
is the size of the object universeUO
of all integers, which is 100- There exists a relationship
r(x)
that maps an object to be sorted to an index: the integer value itself m
is indeed reasonably small.
Given that "reasonably small" is not an exact definition, you'll have to try and come up with numbers that work on your system, with your data.
The basic idea behind Counting Sort
- Assume you have
m
integers in the range0..n-1
. - Create
n
integer counters, starting at 0 - For each item in the input array: increase the counter associated with the integer value
- At the end, recreate the sorted array from information sorted in the counters
For example, assume you have input numbers in the range 0..4, like this:
[3, 2, 0, 2, 2, 0, 2, 2, 1, 1]
If you count the numbers of 0s, 1s, 2s and 3s, you get this:
Number of 0s: 2
Number of 1s: 2
Number of 2s: 5
Number of 3s: 1
A sorted array can now be created using this information:
[0, 0, 1, 1, 2, 2, 2, 2, 2, 3]
^^^^
^^^^
^^^^^^^^^^^^^
^
This should give you an important type of insight: you can use this to recreate sorted arrays, but only if the type of array can be recreated. For example, you couldn't really use this to sort objects in memory, unless you have integer based keys for these objects.
The algorithm
The above lends itself to a straight-forward implementation:
def counting_sort(A):
# create counters initialized to zero
frequencies = [0] * (max(A)+1)
# count each item
for value in A:
frequencies[value] += 1
# recreate sorted array
writepos = 0
for value, count in enumerate(frequencies):
for i in range(count):
A[writepos] = value
writepos += 1
return A
But how does it perform
Here are the results of comparing "sort 1 million randomized integers in the range 0..9999":
Method | Python | Javascript | C# | C/C++ |
---|---|---|---|---|
Builtin array-sort | 847 ms | 443 ms | 84 ms | 110 ms |
Counting sort | 553 ms | 353 ms | 5 ms | 0 ms |
It is somewhat depressing to see how bad Python performance actually is... Also, I was surprised by C#, I had thought it to be much worse...