String Searching

The problem is almost trivial: given a huge input string (the "haystack"), find a small substring ("needle"). And the default search algorithm is so simple that it takes creativity to imagine something better. Now, I don't know about you but after years of seeing and using that trivial solution, it was harder and harder to think of alternatives. So one day I decided I really need to understand this better, and dug into the classic algorithms.

Ask yourself: Do I understand the problem?

Actually, there are more formal definitions, but we'll skip them for now: it would only serve to distract from the point I am trying to make.

Setting up a test harness

As usual, our first step is to prepare a testbed. Our code strategy for the test will be this:

Therefor, our test code looks like this:

#! -*- Encoding: Latin-1 -*-

def read_text_file(filename, needle):
    data = list(open(filename).read())
    
    for index, char in enumerate(data):
        o = ord(char) - ord('a')
        if o < 0 or o >= 26:
            data[index] = ' '
    
    text = "".join(data).lower()
    return 10 * text + needle

def selftest():
    import time
    needle = "%"
    haystack = read_text_file(<Filename for H.G.Wells' War of the Worlds<, needle)

    functions = {
        'builtin search' : builtin_search,
        'naive_search' : naive_search,
        'robin_karp' : robin_karp,
        'knuth_morris_pratt' : knuth_morris_pratt, 
        'naive_search_variant' : naive_search_variant,
        'naive_search_pythonesque' : naive_search_pythonesque,
    }

    expected_result = None
    for function_name in functions:
        function = functions[function_name]
        t0 = time.time()
        result = function(needle, haystack)
        t0 = time.time() - t0
        print "%32s: %12.2f for result %6d" % (function_name, t0, result, ) 

        if expected_result is None:
            expected_result = result
        else:
            assert result == expected_result

if __name__ == "__main__":
    selftest()

Brute Force

The brute force algorithm works like this:

def naive_search(needle, haystack):

    # Enumerate all possible characters in haystack
    for i in xrange(len(haystack)):

        # Check if the substring at this position matches haystack
        found = True
        for j in xrange(len(needle)):
            if haystack[i+j] != needle[j]:
                found = False
                break

        # If there is a match, we're done
        if found:
            return i

    # no match: return -1 indicates failure
    return -1

The logic to compare the substring is very explicit, but also very un-pythonesque. A more pythonic code could read:

def naive_search_pythonesque(needle, haystack):
    n = len(needle)

    # Enumerate all possible characters in haystack
    for i in xrange(len(haystack)):

        # Check if the substring at this position matches haystack
        if haystack[i:i+n] == needle:

            # If there is a match, we're done
            return i

    # no match: return -1 indicates failure
    return -1

Robin-Karp

Now, the Robin Karp string search algorithm gets a bad press, but the idea behind it is really smart and generalizes to a lot of related problems. It was also the first "advanced" string search algorithm that I "got", so I have a kind of sentimental attachment to it.

The basic idea is this:

  1. Treat the haystack not as an array of characters, but as an array of numbers. That is trivial in C, and in Python you can use 'ord' to map characters to ASCII (things are a bit more tricky once you head over to Unicode - as you should! - but for now we'll keep it simple). For example, the following python line: h = map(ord, haystack) converts a string into an array of ASCII ordinals. Example:
    >>> test = "this is a test"
    >>> map(ord, test)
    [116, 104, 105, 115, 32, 105, 115, 32, 97, 32, 116, 101, 115, 116]
    
  2. Do the same for the needle. Now you have two arrays of numbers. This reduces the problem of finding a substring to the problem of finding an array of numbers in another array of numebrs. Doesn't sound like too much of a reduction yet, but bear with me.
  3. Now, observe the any substring in haystack matching needle must, by definition, be of length len(needle).
  4. Observe also that since we have numbers, we can calculate a hash value for the needle.
  5. Then, if we calculate a hash value for a subarray of length n in haystack, all we need to compare is the two hash values to find out if the strings match. Great: comparing two integers gives us a comparison of arbitrarily long strings, at a lot faster speed. There is only one minor problem: calculating the hash faster than string comparison...
  6. Now, this is where the really smart thing comes in: choice of hash function:
    • We need a hash function that will slide a window of n = len(needle) integers across all integers in haystack.
    • In particular, this means: one character is added in the front, one character at the end is removed to update the hash.
    • So, for the needle, we'll simply use the sum of all input values. We will use the very simple Python function
      key = sum(map(ord, needle))
      
      . This wouldn't work in C, but we are here at pyalda.com, not at calda.com.
    • For the haystack, we do the same thing, but we keep two indices in haystack:, one for the character in the front added to the sum, and one for the character and the end, being removed from the hash. This means updating the hash is now O(2) rather than O(N).
  7. The beauty is this: rather than having to match potentially all m characters over and over again, we just compare two integers (fast!) and know if we have. But do we know? Actually, there could be a collision between two hashes. Because this can happen, we'll compare the actual substring once the hashes match, to skip spurious matches.

The following code translates this idea:

def robin_karp(needle, haystack):    
    # needle is treated as an arry of integers (not chars)
    # hash of the needle = sum of all integers in the needle
    key = sum(map(ord, needle))

    # minimum length of a subarray matching needle hash
    n = len(needle)

    # current hash
    act = 0

    # start position for current array
    j   = -n

    # haystack is treated as an arry of integers (not chars)
    h = map(ord, haystack)

    # enumerate all positions in haystack
    for i in range(len(haystack)):

        # update hash with new character
        act += h[i]

        # correct hash if we're in the positive region
        if j >= 0:
            act -= h[j]

        # update end-of-hash position
        j += 1

        # compare hash of substring, needle
        if act == key:

            # if the hashes match, compare the substrings anyway for want of hash collision
            if haystack[j:i+1] == needle:
                return j

    return -1

The code looks longer than it really is, what with all the comments and all that. Anyway, for me RK definitely proved enlightening: a new tack on an old problem. Comments:

def naive_search_variant(needle, haystack):
    d = needle[0]
    for i, c in enumerate(haystack):

        # if the mini-hash (first character) matches, assume there is a collision and test the whole string now
        if c == d:
            found = True
            for j in range(len(needle)):
                if haystack[i+j] != needle[j]:
                    found = False
                    break

            if found:
                return i

    return -1

Knuth-Morris-Pratt

The venerable Mr. Knuth has written large tomes of computer science that are arguably more often bought than read. Virtually everyone recommends them (and I haven't read them yet, but one day I will), so it is tempting to oppose them, just to go against the grain - but then you'd probably have to still read them first to be able to meaningfully talk about them. So no rant here. Instead, a smart algorithm: Knuth-Morris-Pratt, or short KMP. How does it work?

Take a long look at these two words

Apfelstrudel
Apfelsine
      ^^^

Note that the first 6 characters are the same in both. But once you are the position of the t in Apfelstrudel, you see that t doesn't occur in Apfelsine at all, which means, you don't need to compare the sequences

Apfelstrudel <-> Apfelsine
 pfelstrudel <-> Apfelsine
  felstrudel <-> Apfelsine
   elstrudel <-> Apfelsine
    lstrudel <-> Apfelsine
     strudel <-> Apfelsine
      trudel <-> Apfelsine
       rudel <-> Apfelsine
        .... <-> Apfelsine

but only

Apfelstrudel <-> Apfelsine
      trudel <-> Apfelsine
       rudel <-> Apfelsine
        .... <-> Apfelsine

Note that the brute force algorithm above did not take this into account: it compares every possible position with the needle.

OK, so we know what happens when we reach characters that are not in needle: we can skip the whole sequence and start searching from the next character. But what about if we reach a character that is - at another position - in the needle?

dcabadcabadce
dcabadce
^^^^^^^x

We cannot start after the mismatch, but actually in the middle of the string, here:

dcabadcabadce
     dcabadce

So basically what we need is this: a mapping table that, for each input character, tells us the skip distance. This is the basic idea of the venerable KMP algorithm, the rest is just mechanics:

def knuth_morris_pratt(needle, haystack):

    # allow indexing into needle and protect against change during yield
    needle = list(needle)

    n = len(needle)

    # build table of shift amounts
    shifts = [1] * (n + 1)
    shift = 1
    for pos in range(n):

        while shift <= pos and needle[pos] != needle[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift

    # do the actual search
    startPos = 0
    matchLen = 0
    for c in haystack:
        while matchLen == n or (matchLen >= 0 and needle[matchLen] != c):
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == n:
            return startPos
    return -1

Builtin Function

For completeness, we also have to throw in the builtin str.find() method in to the test. The following small test function suffices:

def builtin_search(needle, haystack):
    return haystack.find(needle)

But how do they actually perform?

Method Runtime
Naive Search 1.67
Knuth-Morris-Pratt 1.40
Robin-Karp 1.06
Pythonesque Naive-Search 0.33
Variant of Naive Search 0.27
Builtin find() 0.00