String Searching
The problem is almost trivial: given a huge input string (the "haystack"), find a small substring ("needle"). And the default search algorithm is so simple that it takes creativity to imagine something better. Now, I don't know about you but after years of seeing and using that trivial solution, it was harder and harder to think of alternatives. So one day I decided I really need to understand this better, and dug into the classic algorithms.
Ask yourself: Do I understand the problem?
- First input is an array of characters, called
haystack
: the string we are searching in. In languages like C we typically do not know the size of the array up front, in other languages like Python we do. - Second input is an array of characters, called
needle
: the string we are searching for. We don't even know if the string is contained inhaystack
, neither where it could be found. - Output is the index where
needle
can be found inhaystack.
Actually, there are more formal definitions, but we'll skip them for now: it would only serve to distract from the point I am trying to make.
Setting up a test harness
As usual, our first step is to prepare a testbed. Our code strategy for the test will be this:
- We will implement a couple of functions for the same task. As usual, we are going to store them in a dictionary along with their name, so that we can call them, time their performance and print out results
- We will verify that all algorithms return the same output for the same input.
- Our haystack is going to be the online text of H.G.Wells 'War of the Worlds', available at Project Gutenberg Oh, and we will need to preprocess this text, to strip it of crap punctuation and make it all lowercase - just to make the demo cases more easy to understand.
- Actually, to make things a little bit more interesting: we're going to use 10 times the text, thus blowing up its size. And we'll deliberately
put a unique
needle
right at the end of the text, so we can see how different algorithms treat skipping over huge amounts of text.
Therefor, our test code looks like this:
#! -*- Encoding: Latin-1 -*-
def read_text_file(filename, needle):
data = list(open(filename).read())
for index, char in enumerate(data):
o = ord(char) - ord('a')
if o < 0 or o >= 26:
data[index] = ' '
text = "".join(data).lower()
return 10 * text + needle
def selftest():
import time
needle = "%"
haystack = read_text_file(<Filename for H.G.Wells' War of the Worlds<, needle)
functions = {
'builtin search' : builtin_search,
'naive_search' : naive_search,
'robin_karp' : robin_karp,
'knuth_morris_pratt' : knuth_morris_pratt,
'naive_search_variant' : naive_search_variant,
'naive_search_pythonesque' : naive_search_pythonesque,
}
expected_result = None
for function_name in functions:
function = functions[function_name]
t0 = time.time()
result = function(needle, haystack)
t0 = time.time() - t0
print "%32s: %12.2f for result %6d" % (function_name, t0, result, )
if expected_result is None:
expected_result = result
else:
assert result == expected_result
if __name__ == "__main__":
selftest()
Brute Force
The brute force algorithm works like this:
- Enumerate all possible characters in
haystack
- Check if the substring at this position matches
haystack
, by checking all characters in the needle against the respective char in the haystack. - If there is a match, we're done
def naive_search(needle, haystack):
# Enumerate all possible characters in haystack
for i in xrange(len(haystack)):
# Check if the substring at this position matches haystack
found = True
for j in xrange(len(needle)):
if haystack[i+j] != needle[j]:
found = False
break
# If there is a match, we're done
if found:
return i
# no match: return -1 indicates failure
return -1
The logic to compare the substring is very explicit, but also very un-pythonesque. A more pythonic code could read:
def naive_search_pythonesque(needle, haystack):
n = len(needle)
# Enumerate all possible characters in haystack
for i in xrange(len(haystack)):
# Check if the substring at this position matches haystack
if haystack[i:i+n] == needle:
# If there is a match, we're done
return i
# no match: return -1 indicates failure
return -1
Robin-Karp
Now, the Robin Karp string search algorithm gets a bad press, but the idea behind it is really smart and generalizes to a lot of related problems. It was also the first "advanced" string search algorithm that I "got", so I have a kind of sentimental attachment to it.
The basic idea is this:
- Treat the
haystack
not as an array of characters, but as an array of numbers. That is trivial in C, and in Python you can use 'ord' to map characters to ASCII (things are a bit more tricky once you head over to Unicode - as you should! - but for now we'll keep it simple). For example, the following python line:h = map(ord, haystack)
converts a string into an array of ASCII ordinals. Example:>>> test = "this is a test" >>> map(ord, test) [116, 104, 105, 115, 32, 105, 115, 32, 97, 32, 116, 101, 115, 116]
- Do the same for the
needle
. Now you have two arrays of numbers. This reduces the problem of finding a substring to the problem of finding an array of numbers in another array of numebrs. Doesn't sound like too much of a reduction yet, but bear with me. - Now, observe the any substring in
haystack
matchingneedle
must, by definition, be of lengthlen(needle)
. - Observe also that since we have numbers, we can calculate a hash value for the
needle
. - Then, if we calculate a hash value for a subarray of length n in
haystack
, all we need to compare is the two hash values to find out if the strings match. Great: comparing two integers gives us a comparison of arbitrarily long strings, at a lot faster speed. There is only one minor problem: calculating the hash faster than string comparison... - Now, this is where the really smart thing comes in: choice of hash function:
- We need a hash function that will
slide a window of
n = len(needle)
integers across all integers inhaystack
. - In particular, this means: one character is added in the front, one character at the end is removed to update the hash.
- So, for the
needle
, we'll simply use the sum of all input values. We will use the very simple Python function. This wouldn't work in C, but we are here at pyalda.com, not at calda.com.key = sum(map(ord, needle))
- For the
haystack
, we do the same thing, but we keep two indices in haystack:, one for the character in the front added to the sum, and one for the character and the end, being removed from the hash. This means updating the hash is now O(2) rather than O(N).
- We need a hash function that will
slide a window of
- The beauty is this: rather than having to match potentially all
m
characters over and over again, we just compare two integers (fast!) and know if we have. But do we know? Actually, there could be a collision between two hashes. Because this can happen, we'll compare the actual substring once the hashes match, to skip spurious matches.
The following code translates this idea:
def robin_karp(needle, haystack):
# needle is treated as an arry of integers (not chars)
# hash of the needle = sum of all integers in the needle
key = sum(map(ord, needle))
# minimum length of a subarray matching needle hash
n = len(needle)
# current hash
act = 0
# start position for current array
j = -n
# haystack is treated as an arry of integers (not chars)
h = map(ord, haystack)
# enumerate all positions in haystack
for i in range(len(haystack)):
# update hash with new character
act += h[i]
# correct hash if we're in the positive region
if j >= 0:
act -= h[j]
# update end-of-hash position
j += 1
# compare hash of substring, needle
if act == key:
# if the hashes match, compare the substrings anyway for want of hash collision
if haystack[j:i+1] == needle:
return j
return -1
The code looks longer than it really is, what with all the comments and all that. Anyway, for me RK definitely proved enlightening: a new tack on an old problem. Comments:
- It should be clear that this algorithm gives bad performance if the
needle
is really small - because of the overhead associated. For example, surely brute-force will be better than robin-karp when looking for a single character. The same is true of the other big idea (knuth-morris-pratt): all the smart string searching strings are smart only for certain needles. - A nice thing about this algorithm is that you can compare multiple strings with the same performance: just keep track of multiple hashes. This makes it an algorithm of choice when you match a lot of small text against one large text (for example, this is used when doing plagiarism detection)
- The basic idea of the algorithm can be relaxed to improve the brute force algorithm (and we'll see just how good a relaxed algorithm is at the end in the performance shootout): instead of calculating a long hash, why not use a "micro-hash" of comparing equality for the single first character? Like this:
def naive_search_variant(needle, haystack):
d = needle[0]
for i, c in enumerate(haystack):
# if the mini-hash (first character) matches, assume there is a collision and test the whole string now
if c == d:
found = True
for j in range(len(needle)):
if haystack[i+j] != needle[j]:
found = False
break
if found:
return i
return -1
Knuth-Morris-Pratt
The venerable Mr. Knuth has written large tomes of computer science that are arguably more often bought than read. Virtually everyone recommends them (and I haven't read them yet, but one day I will), so it is tempting to oppose them, just to go against the grain - but then you'd probably have to still read them first to be able to meaningfully talk about them. So no rant here. Instead, a smart algorithm: Knuth-Morris-Pratt, or short KMP. How does it work?
Take a long look at these two words
Apfelstrudel
Apfelsine
^^^
Note that the first 6 characters are the same in both. But once you are the position of the t
in Apfelstrudel
, you see that t
doesn't
occur in Apfelsine
at all, which means, you don't need to compare the sequences
Apfelstrudel <-> Apfelsine
pfelstrudel <-> Apfelsine
felstrudel <-> Apfelsine
elstrudel <-> Apfelsine
lstrudel <-> Apfelsine
strudel <-> Apfelsine
trudel <-> Apfelsine
rudel <-> Apfelsine
.... <-> Apfelsine
but only
Apfelstrudel <-> Apfelsine
trudel <-> Apfelsine
rudel <-> Apfelsine
.... <-> Apfelsine
Note that the brute force algorithm above did not take this into account: it compares every possible position with the needle
.
OK, so we know what happens when we reach characters that are not in needle
: we can skip the whole sequence and start searching from the next character.
But what about if we reach a character that is - at another position - in the needle
?
dcabadcabadce
dcabadce
^^^^^^^x
We cannot start after the mismatch, but actually in the middle of the string, here:
dcabadcabadce
dcabadce
So basically what we need is this: a mapping table that, for each input character, tells us the skip distance. This is the basic idea of the venerable KMP algorithm, the rest is just mechanics:
def knuth_morris_pratt(needle, haystack):
# allow indexing into needle and protect against change during yield
needle = list(needle)
n = len(needle)
# build table of shift amounts
shifts = [1] * (n + 1)
shift = 1
for pos in range(n):
while shift <= pos and needle[pos] != needle[pos-shift]:
shift += shifts[pos-shift]
shifts[pos+1] = shift
# do the actual search
startPos = 0
matchLen = 0
for c in haystack:
while matchLen == n or (matchLen >= 0 and needle[matchLen] != c):
startPos += shifts[matchLen]
matchLen -= shifts[matchLen]
matchLen += 1
if matchLen == n:
return startPos
return -1
Builtin Function
For completeness, we also have to throw in the builtin str.find() method in to the test. The following small test function suffices:
def builtin_search(needle, haystack):
return haystack.find(needle)
But how do they actually perform?
Method | Runtime |
---|---|
Naive Search | 1.67 |
Knuth-Morris-Pratt | 1.40 |
Robin-Karp | 1.06 |
Pythonesque Naive-Search | 0.33 |
Variant of Naive Search | 0.27 |
Builtin find() |
0.00 |
- It is no surprise that naive search performs worst: that was expected.
- It is no surprise that the builtin search is the fastest - it always is, it is written in fine tuned C rather than interpreted Python
- It is a surprise that robin-karp performs better than KMP: this may be due to the RK code being shorter and more suited to pythonisms.
- It is a big surprise that the two variants on brute force perform much better than the "smart" algorithms. I remember (but cannot find in google right now) an article I read about that a couple of years ago: these days, for many scenarios these two algorithms don't really outperform brute-force.
- Moral of the story: when you read in an algorithms book that a particular algorithm is really well suited to a particular problem, test it yourself, in your environment, in your programming language. You might be in for a surprise!