Finding the longest common subsequence
You have two large, potentially huge array of objects, in a random order. You want to find the longest subsequence common to both arrays. The term subsequence means that the items of the sequence must retain the original order, but are not necessarily next to each other.
For example, take these arrays:
A = [14, 57, 32, 8, 17, 27, 20, 18, 1, 36]
B = [99, 24, 14, 5, 8, 22, 30, 60, 27, 17]
There are many valid subsequences; for example, [14, 8, 18, 1, 36]
is a valid subsequence of A,
and [99, 14]
is one of B. But what is the longest common subsequence? It turns out to be
[14, 8, 27]
.
Ask yourself: Do I understand the problem?
- Input are two arrays of N values.
- Without loss of generality, let's assume all values are integers.
- Subsequence is ascending number of items of the original array. It can be any number of items long.
- The output should be the length of the longest common subsequence (not the subsequence itself - that is a separate problem discussed further down below as a side)
Setting up a test harness
As usual, our first step is to prepare a testbed. Our code strategy for the test will be this:
- We create two arrays of random values
- We use each of n different methods to calcualte the longest common subsequence
- We verify that each of those returns the same result
The code for this function is:
#! -*- Encoding: Latin-1 -*-
def selftest():
A = [random.randrange(100) for i in range(20)]
B = [random.randrange(100) for i in range(20)]
methods = { 'lookup dictionary' : lcs_lookup_dict,
'naive recursive backwards' : lcs_recursive_backwards,
'native recursive forwards' : lcs_recursive_forwards,
'recursive forwards indices' : lcs_recursive_forwards_indices,
'iterative' : lcs_iterative,
'iterative with sequence' : lcs_iterative_with_sequence,
'iterative space optimal' : lcs_iterative_space_optim }
expected_result = None
for method_name in methods.keys():
t0 = time.time()
result = methods[method_name](A, B, 0, 0)
print "%30s: %s in %.4f sec" % (method_name, result, time.time() - t0, )
if expected_result is None:
expected_result = result
else:
assert expected_result == result
if __name__ == "__main__":
selftest()
Brute-Force approach
There is no brute-force approach. Really, there is one, but it is not feasible: there are n! possible subsequences for A and m! for B. If you want to compare all n! to all m! subsequences, you are in for a really long haul. We need something better.
Recursive approach using Optimal Substructure
What does the term Optimal Substructure mean? Informally, it means this: We can construct an optimal solution for n elements out of the optimal solutions for (n-1) elements. This is straight forward recursive: solution(x) = f(solution(x-1)).
Let's see how this relates to our problem.
- Take a look at the following two sequences:
Both A and B end in 36. So any possible substructure in
A = [14, 57, 32, 8, 17, 27, 20, 18, 1, 36] B = [99, 24, 14, 5, 8, 22, 30, 60, 27, 36]
A[1..n-1]
andB[1..n-1]
can be extended by addingA[n]=36
. In other words: if both A and B end in the same item: the result is 1 + the longest common subsequence ofA[:-1]
andB[:-1]
. Now, what happen's if they DO NOT match? -
If they do not match, then either
A = [14, 57, 32, 8, 17, 27, 20, 18, 1, 36] B = [99, 24, 14, 5, 8, 22, 30, 60, 27, 22]
A[-1]
orB[-1]
must go. So the result will be the maximum of the longest common subsequence ofA[-1]
, and the longest common subsequence ofB[-1]
.
Once you understand this definition, it is trivial to write a recursive implementation:
def lcs_recursive_backwards(A, B):
# if one of either A or B is empty, this means that there is no possible common subsequence left
if not A or not B:
return 0
# case 1 of the recursion: A and B end in the same value
if A[-1] == B[-1]:
return 1 + lcs_recursive_backwards(A[:-1], B[:-1])
# case 2 of the recursion: they don't match, so we need to test deleting either one
return max(
lcs_recursive_backwards(A, B[:-1]),
lcs_recursive_backwards(A[:-1], B))
It is also straightforward to see that the direction of this recursion doesn't matter.
- Take a look at the following two sequences:
Both A and B begin with 99. So any possible substructure in
A = [99, 57, 32, 8, 17, 27, 20, 18, 1, 36] B = [99, 24, 14, 5, 8, 22, 30, 60, 27, 84]
A[1..n]
andB[1..n]
can be extended by addingA[0]=99
at the beginning. In other words: if both A and B begin in the same item: the result is 1 + the longest common subsequence ofA[1:]
andB[1:]
. And if they do not match: -
If they do not match, then either
A = [14, 57, 32, 8, 17, 27, 20, 18, 1, 36] B = [99, 24, 14, 5, 8, 22, 30, 60, 27, 22]
A[0]
orB[0]
must go. So the result will be the maximum of the longest common subsequence ofA[1:]
, and the longest common subsequence ofB[1:]
.
Which gives us this code:
def lcs_recursive_forwards(A, B):
if not A or not B:
return 0
if A[0] == B[0]:
return 1 + lcs_recursive_forwards(A[1:], B[1:])
return max(
lcs_recursive_forwards(A, B[1:]),
lcs_recursive_forwards(A[1:], B))
This code still leaves a lot to be desired: all that harmless list slice statements like A[1:]
and B[1:]
all
allocate new temporary arrays. We can do better by utilizing indices instead: rather than actually cutting the array, we
just pass the original array and the new end position, like this:
def lcs_recursive_backwards_indices(A, B, i = 0, j = 0):
if i >= len(A) or j >= len(B):
return 0
elif A[i] == B[j]:
return 1 + lcs_recursive_backwards_indices(A, B, i+1, j+1)
return max(
lcs_recursive_backwards_indices(A, B, i, j+1),
lcs_recursive_backwards_indices(A, B, i+1, j))
Recursive Approach with Backtracking
If you run the code, you'll see pretty soon that it is awfully slow. The main reason is that it calculates the same subsequence over and over and over again. For example, I did a test run with 10 (ten) items: one of the (i, j) combinations was called in total 48.620 times. And if you add a single item, so you now test with 11 (eleven) items, that number goes up to a mind-boggling 152.416 times. Clearly, the lcs_recursive definition isn't very useful in its naive implementation. What we need is memoization: keeping track of combinations already seen, and storing them in memory. In Python this is trivial (since dictionaries can have tuples as indices):
# lookup dictionary
k = {}
def lcs_lookup_dict(A, B, i, j):
# try dictionary first
try:
return k[(i, j)]
except KeyError:
pass
# quite similar to lcs_recursive_backwards_indices, with one change:
# - the result is stored in a temp variable, so that it can be kept in memory
if i >= len(A) or j >= len(B):
result = 0
elif A[i] == B[j]:
result = 1 + lcs_lookup_dict(A, B, i+1, j+1)
else:
result = max(
lcs_lookup_dict(A, B, i, j+1),
lcs_lookup_dict(A, B, i+1, j))
k[(i, j)] = result
return result
This gives a much much better performance. The following is an example comparing two sequences with a whopping 12 (twelve) items each:
lcs_recursive_backwards_indices: 1 in 1.4740 sec naive recursive backwards: 1 in 1.4490 sec native recursive forwards: 1 in 1.3290 sec lookup dictionary: 1 in 0.0010 sec
So, some surprises here.
- Dictionary lookup speeds up the algorithm so much, it isn't even funny. Basically, without memoization, you can forget searching even a measly 20 items. But with memoization, the code doesn't even blink.
- The "optimization" we introduced above to get rid of list slice syntax actually slowed down the code. Once again, proof that if in Python, do as Pythonistas do.
So, the algorithm with memoization is OK fast. Here are some results:
- Comparing two sequences with 100 items each: 0.0220 sec
- Comparing two sequences with 500 items each: 0.6830 sec
- Comparing two sequences with 1000 items each: whoa!
RuntimeError: maximum recursion depth exceeded
So we need something better still.
Iterative Solution with Memoization
It turns out that the recursive solution checks all combinations anyway. So we might as well do those explicitly, by enumerating all possible positions, storing temporary results in a cache. We only need to do this in a way that ensures we never read any cache value before we have written it. It turns out that this is pretty simple:
def lcs_iterative(A, B, i, j):
def empty_list():
return [0] * (len(B)+1)
K = [empty_list() for temp in range(len(A)+1)]
for i in range(len(A)-1,-1,-1):
for j in range(len(B)-1,-1,-1):
if A[i] == B[j]:
K[i][j] = 1+K[i+1][j+1]
else:
K[i][j] = max(K[i][j+1], K[i+1][j])
return K[0][0]
Compare the code logic to the one from the last recursive solution:
if i >= len(A) or j >= len(B):
result = 0
elif A[i] == B[j]:
result = 1 + lcs_lookup_dict(A, B, i+1, j+1)
else:
result = max(
lcs_lookup_dict(A, B, i, j+1),
lcs_lookup_dict(A, B, i+1, j))
You can see that they are basically identical:
K[i][j]
is a memoization cache (here: a list of lists, giving us a multidimensional array)- We don't need the recursive call because the code accesses
i, j
from the back, thus ensuring that all values are initialized before they are being read. - The main logic is: either the positions are equal, then use 1+next pos, or they are not, then use the max.
- So this is really just a recursive -> iterative translation
It is pretty easy to extend the solution to actually give us the sequence members (not just the sequence length), if we do this:
- As per
lcs_iterative
, build up the lookup array - Once done, instead of returning the maximum item, restore the maximum item from the information stored in the lookup array
The following code does this:
def lcs_iterative_with_sequence(A, B, i, j):
def empty_list():
return [0] * (len(B)+1)
# Step 1: build lookup array to determine LCS
K = [empty_list() for temp in range(len(A)+1)]
for i in range(len(A)-1,-1,-1):
for j in range(len(B)-1,-1,-1):
if A[i] == B[j]:
K[i][j] = 1+K[i+1][j+1]
else:
K[i][j] = max(K[i][j+1], K[i+1][j])
# Step 2: restore sequence from information in lookup array
S = []
i = 0
j = 0
while (i < len(A)) and (j < len(B)):
if A[i] == B[j]:
S.append(A[i])
i += 1
j += 1
elif K[i+1][j] >= K[i][j+1]:
i += 1
else:
j += 1
return S
One final optimization: in the code above, we have a two-dimensional array. But it turns out that if we don't need to reconstruct the array, then we don't need all that variables either. Two arrays with lookups for A and B suffice. The following code implements this idea:
def lcs_iterative_space_optim(A, B, i, j):
X = [0] * (len(A)+1)
Y = [0] * (len(B)+1)
m = min(len(Y), len(X))
for i in range(len(A)-1,-1,-1):
for j in range(len(B)-1,-1,-1):
if A[i] == B[j]:
X[j] = 1+Y[j+1]
else:
X[j] = max(Y[j], X[j+1])
for n in range(m):
Y[n] = X[n]
return X[0]
OK, given all that: how do these variants perform? On an input of 1.000 items (that we failed to analyse using the recursive solution), we get this:
iterative with sequence: 176 in 0.4360 sec iterative: 176 in 0.4360 sec iterative space optimal: 176 in 0.4220 sec
Much better.