String Searching

The problem is almost trivial: given a huge input string (the "haystack"), find a small substring ("needle"). And the default search algorithm is so simple that it takes creativity to imagine something better. Now, I don't know about you but after years of seeing and using that trivial solution, it was harder and harder to think of alternatives. So one day I decided I really need to understand this better, and dug into the classic algorithms.

Ask yourself: Do I understand the problem?

First input is an array of characters, called haystack: the string we are searching in. In languages like C we typically do not know the size of the array up front, in other languages like Python we do.
Second input is an array of characters, called needle: the string we are searching for. We don't even know if the string is contained in haystack, neither where it could be found.
Output is the index where needle can be found in haystack.

Actually, there are more formal definitions, but we'll skip them for now: it would only serve to distract from the point I am trying to make.

Setting up a test harness

As usual, our first step is to prepare a testbed. Our code strategy for the test will be this:

We will implement a couple of functions for the same task. As usual, we are going to store them in a dictionary along with their name, so that we can call them, time their performance and print out results
We will verify that all algorithms return the same output for the same input.
Our haystack is going to be the online text of H.G.Wells 'War of the Worlds', available at Project Gutenberg
Actually, to make things a little bit more interesting: we're going to use 10 times the text, thus blowing up its size. And we'll deliberately put a unique needle right at the end of the text, so we can see how different algorithms treat skipping over huge amounts of text.

Therefor, our test code looks like this:

#! -*- Encoding: Latin-1 -*-

def read_text_file(filename, needle):
    data = list(open(filename).read())
    
    for index, char in enumerate(data):
        o = ord(char) - ord('a')
        if o < 0 or o >= 26:
            data[index] = ' '
    
    text = "".join(data).lower()
    return 10 * text + needle

def selftest():
    import time
    needle = "%"
    haystack = read_text_file(<Filename for H.G.Wells' War of the Worlds<, needle)

    functions = {
        'builtin search' : builtin_search,
        'naive_search' : naive_search,
        'robin_karp' : robin_karp,
        'knuth_morris_pratt' : knuth_morris_pratt, 
        'naive_search_variant' : naive_search_variant,
        'naive_search_pythonesque' : naive_search_pythonesque,
    }

    expected_result = None
    for function_name in functions:
        function = functions[function_name]
        t0 = time.time()
        result = function(needle, haystack)
        t0 = time.time() - t0
        print "%32s: %12.2f for result %6d" % (function_name, t0, result, ) 

        if expected_result is None:
            expected_result = result
        else:
            assert result == expected_result

if __name__ == "__main__":
    selftest()

Brute Force

The brute force algorithm works like this:

Enumerate all possible characters in haystack
Check if the substring at this position matches haystack, by checking all characters in the needle against the respective char in the haystack.
If there is a match, we're done

def naive_search(needle, haystack):

    # Enumerate all possible characters in haystack
    for i in xrange(len(haystack)):

        # Check if the substring at this position matches haystack
        found = True
        for j in xrange(len(needle)):
            if haystack[i+j] != needle[j]:
                found = False
                break

        # If there is a match, we're done
        if found:
            return i

    # no match: return -1 indicates failure
    return -1

The logic to compare the substring is very explicit, but also very un-pythonesque. A more pythonic code could read:

def naive_search_pythonesque(needle, haystack):
    n = len(needle)

    # Enumerate all possible characters in haystack
    for i in xrange(len(haystack)):

        # Check if the substring at this position matches haystack
        if haystack[i:i+n] == needle:

            # If there is a match, we're done
            return i

    # no match: return -1 indicates failure
    return -1

Robin-Karp

Now, the Robin Karp string search algorithm gets a bad press, but the idea behind it is really smart and generalizes to a lot of related problems. It was also the first "advanced" string search algorithm that I "got", so I have a kind of sentimental attachment to it.

The basic idea is this:

Treat the haystack not as an array of characters, but as an array of numbers. That is trivial in C, and in Python you can use 'ord' to map characters to ASCII (things are a bit more tricky once you head over to Unicode - as you should! - but for now we'll keep it simple). For example, the following python line: h = map(ord, haystack) converts a string into an array of ASCII ordinals. Example:
```
>>> test = "this is a test"
>>> map(ord, test)
[116, 104, 105, 115, 32, 105, 115, 32, 97, 32, 116, 101, 115, 116]
```
Do the same for the needle. Now you have two arrays of numbers. This reduces the problem of finding a substring to the problem of finding an array of numbers in another array of numebrs. Doesn't sound like too much of a reduction yet, but bear with me.
Now, observe the any substring in haystack matching needle must, by definition, be of length len(needle).
Observe also that since we have numbers, we can calculate a hash value for the needle.
Then, if we calculate a hash value for a subarray of length n in haystack, all we need to compare is the two hash values to find out if the strings match. Great: comparing two integers gives us a comparison of arbitrarily long strings, at a lot faster speed. There is only one minor problem: calculating the hash faster than string comparison...
Now, this is where the really smart thing comes in: choice of hash function:
- We need a hash function that will slide a window of n = len(needle) integers across all integers in haystack.
- In particular, this means: one character is added in the front, one character at the end is removed to update the hash.
- So, for the needle, we'll simply use the sum of all input values. We will use the very simple Python function
```
key = sum(map(ord, needle))
```
  . This wouldn't work in C, but we are here at pyalda.com, not at calda.com.
- For the haystack, we do the same thing, but we keep two indices in haystack:, one for the character in the front added to the sum, and one for the character and the end, being removed from the hash. This means updating the hash is now O(2) rather than O(N).
The beauty is this: rather than having to match potentially all m characters over and over again, we just compare two integers (fast!) and know if we have. But do we know? Actually, there could be a collision between two hashes. Because this can happen, we'll compare the actual substring once the hashes match, to skip spurious matches.

The following code translates this idea:

def robin_karp(needle, haystack):    
    # needle is treated as an arry of integers (not chars)
    # hash of the needle = sum of all integers in the needle
    key = sum(map(ord, needle))

    # minimum length of a subarray matching needle hash
    n = len(needle)

    # current hash
    act = 0

    # start position for current array
    j   = -n

    # haystack is treated as an arry of integers (not chars)
    h = map(ord, haystack)

    # enumerate all positions in haystack
    for i in range(len(haystack)):

        # update hash with new character
        act += h[i]

        # correct hash if we're in the positive region
        if j >= 0:
            act -= h[j]

        # update end-of-hash position
        j += 1

        # compare hash of substring, needle
        if act == key:

            # if the hashes match, compare the substrings anyway for want of hash collision
            if haystack[j:i+1] == needle:
                return j

    return -1

The code looks longer than it really is, what with all the comments and all that. Anyway, for me RK definitely proved enlightening: a new tack on an old problem. Comments:

It should be clear that this algorithm gives bad performance if the needle is really small - because of the overhead associated. For example, surely brute-force will be better than robin-karp when looking for a single character. The same is true of the other big idea (knuth-morris-pratt): all the smart string searching strings are smart only for certain needles.
A nice thing about this algorithm is that you can compare multiple strings with the same performance: just keep track of multiple hashes. This makes it an algorithm of choice when you match a lot of small text against one large text (for example, this is used when doing plagiarism detection)
The basic idea of the algorithm can be relaxed to improve the brute force algorithm (and we'll see just how good a relaxed algorithm is at the end in the performance shootout): instead of calculating a long hash, why not use a "micro-hash" of comparing equality for the single first character? Like this:

def naive_search_variant(needle, haystack):
    d = needle[0]
    for i, c in enumerate(haystack):

        # if the mini-hash (first character) matches, assume there is a collision and test the whole string now
        if c == d:
            found = True
            for j in range(len(needle)):
                if haystack[i+j] != needle[j]:
                    found = False
                    break

            if found:
                return i

    return -1

Knuth-Morris-Pratt

The venerable Mr. Knuth has written large tomes of computer science that are arguably more often bought than read. Virtually everyone recommends them (and I haven't read them yet, but one day I will), so it is tempting to oppose them, just to go against the grain - but then you'd probably have to still read them first to be able to meaningfully talk about them. So no rant here. Instead, a smart algorithm: Knuth-Morris-Pratt, or short KMP. How does it work?

Take a long look at these two words

Apfelstrudel
Apfelsine
      ^^^

Note that the first 6 characters are the same in both. But once you are the position of the t in Apfelstrudel, you see that t doesn't occur in Apfelsine at all, which means, you don't need to compare the sequences

Apfelstrudel <-> Apfelsine
 pfelstrudel <-> Apfelsine
  felstrudel <-> Apfelsine
   elstrudel <-> Apfelsine
    lstrudel <-> Apfelsine
     strudel <-> Apfelsine
      trudel <-> Apfelsine
       rudel <-> Apfelsine
        .... <-> Apfelsine

but only

Apfelstrudel <-> Apfelsine
      trudel <-> Apfelsine
       rudel <-> Apfelsine
        .... <-> Apfelsine

Note that the brute force algorithm above did not take this into account: it compares every possible position with the needle.

OK, so we know what happens when we reach characters that are not in needle: we can skip the whole sequence and start searching from the next character. But what about if we reach a character that is - at another position - in the needle?

dcabadcabadce
dcabadce
^^^^^^^x

We cannot start after the mismatch, but actually in the middle of the string, here:

dcabadcabadce
     dcabadce

So basically what we need is this: a mapping table that, for each input character, tells us the skip distance. This is the basic idea of the venerable KMP algorithm, the rest is just mechanics:

def knuth_morris_pratt(needle, haystack):

    # allow indexing into needle and protect against change during yield
    needle = list(needle)

    n = len(needle)

    # build table of shift amounts
    shifts = [1] * (n + 1)
    shift = 1
    for pos in range(n):

        while shift <= pos and needle[pos] != needle[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift

    # do the actual search
    startPos = 0
    matchLen = 0
    for c in haystack:
        while matchLen == n or (matchLen >= 0 and needle[matchLen] != c):
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == n:
            return startPos
    return -1

Builtin Function

For completeness, we also have to throw in the builtin str.find() method in to the test. The following small test function suffices:

def builtin_search(needle, haystack):
    return haystack.find(needle)

But how do they actually perform?

Method	Runtime
Naive Search	1.67
Knuth-Morris-Pratt	1.40
Robin-Karp	1.06
Pythonesque Naive-Search	0.33
Variant of Naive Search	0.27
Builtin `find()`	0.00

It is no surprise that naive search performs worst: that was expected.
It is no surprise that the builtin search is the fastest - it always is, it is written in fine tuned C rather than interpreted Python
It is a surprise that robin-karp performs better than KMP: this may be due to the RK code being shorter and more suited to pythonisms.
It is a big surprise that the two variants on brute force perform much better than the "smart" algorithms. I remember (but cannot find in google right now) an article I read about that a couple of years ago: these days, for many scenarios these two algorithms don't really outperform brute-force.
Moral of the story: when you read in an algorithms book that a particular algorithm is really well suited to a particular problem, test it yourself, in your environment, in your programming language. You might be in for a surprise!