Finding the longest nondescending subsequence

You have a large, potentially huge array of objects, in a random order. You want to find the longest nondescending subsequence. The term subsequence means that the items of the sequence must retain the original order, but are not necessarily next to each other.

For example, take this array:

[14, 57, 32, 8, 17, 27, 20, 18, 1, 36]

A nondescending subsequence is [14, 57], another is [14, 20, 36]. It so happens that in this array, the longest nondescending subsequence has length 4 and could be [8, 17, 27, 36] - but this is not a unique solution: for example [8, 17, 20, 36] also has length 4, but is a different subsequence.

Ask yourself: Do I understand the problem?

Input is an array of N values. We don't know N yet, but obviously the algorithm, whatever it is, shouldn't depend on a particular N.
Without loss of generality, let's assume all values are integers. We shall check at the end what happens if our values are different things - like for files, complex numbers or words in a document.
Subsequence is ascending number of items of the original array. It can be any number of items long.
The output should be the length of the longest nondescending subsequence (not the subsequence itself)

Setting up a test harness

As usual, our first step is to prepare a testbed. Our code strategy for the test will be this:

We create an array of random values
We call a method to find the longest subsequence
We verify that it is indeed the longest subsequence

Sounds harmless, doesn't it? But

#! -*- Encoding: Latin-1 -*-

def selftest():
    A = [random.randrange(1860) for i in range(10)]
    result = our-method-here(A)
    verify_is_longest_subsequence(result)

if __name__ == "__main__":
    selftest()

The question is: how do we verify that? Well, it turns out that the only real means of verification is by testing all possible subsequences and verifying that there are, indeed, none longer than our result.

Now it turns out that this is more or less exactly what the brute force solution does in the first place, so how could we turn up an error in the brute-force solution? What we'll do is this:

Define a list of different solutions to the problem. (It'll be a short list with only two solutions, but the intent counts :)
Call them each, and ensure that each of them returns the same result

we'll throw in a timing measure and have something like this:

def selftest():
    A = [random.randrange(1860) for i in range(21)]
    
    methods = { 'name 1' : method 1,
                'name 2' : method 2,
                ...
                'name n' : method n }
    
    result_expected = None
    for method_name in methods.keys():
        t0 = time.time()
        result = methods[method_name](A)
        print "%20s: took %.2f sec to find %r" % (method_name, time.time() - t0, result, )
        if result_expected is None:
            result_expected = result
        else:
            assert result == result_expected

Brute-Force approach

The brute force algorithm tests all possible subsequences, checks their lengths and finds the longest one. Now, how many such subsequences are there? A whopping 2ⁿ. At this point you should already have alarm bells ringing: testing 2ⁿ subsequences can only mean that our algorithm will be at best of O(2ⁿ) and that is unusable for any real world n. But let's try to implement it anyway.

How do we find the 2ⁿ subsequences? Note that this problem is equivalent to the following:

Given a set S of n elements, how do you find the powerset P(S) of all subsets of S?

You need to realize that the integer numbers from 0 .. 2ⁿ already encode all possible subsets. Let's make an example with a small set of S = {a, b, c}. The size |S| is 3, so we need to look at the numbers 0 .. 2³ = 0 .. 7



    
            Integer Bitmask Set equivalent
    
    
               0     0 0 0 {}
    
    
               1     0 0 1 { c }
    
    
               2     0 1 0 { b }
    
    
               3     0 1 1 { b, c }
    
    
               4     1 0 0 { a }
    
    
               5     1 0 1 { a, c }
    
    
               6     1 1 0 { a, b }
    
    
               7     1 1 1 { a, b, c }

Integer	Bitmask	Set equivalent
0	0 0 0	{}
1	0 0 1	{ c }
2	0 1 0	{ b }
3	0 1 1	{ b, c }
4	1 0 0	{ a }
5	1 0 1	{ a, c }
6	1 1 0	{ a, b }
7	1 1 1	{ a, b, c }

The brute force solution will enumerate all numbers in the range 0..2ⁿ, which is to say: all possible bitmasks. Then it will select a sequence matching that subset and check if it is, in fact, nondescending. If it is, its length will be compared to all other lengths.

def lns_brute_force(A):
    result = None
    nmax = 2 ** len(A)
    i = 0
    while i < nmax:
        n = # TODO: check subsequence and find its length (A, i)
        if (result is None) or (n > result):
            result = n
        i += 1
    return result

One note: we do not use range(2 ** len(A)), because this will quickly exceed the available memory. For example, for testing an input array of 32 numbers, we'd be allocating 4 billion bitmasks - which for example exceeds available memory on all 32-bit environments. So therefor we don't allocate that range, but instead use a single number that is increased time and again.

Now, the bit that is missing here is the checking of the subsequence. The code is straight-forward, but a bit lengthy:

def length_if_is_nondecreasing(A, bitmask):

    # setup variables
    length_of_this_sequence = 1
    last_known = None

    # test each bit in the bitmask
    for bit in range(len(A)):

        # if this bit is set
        if bitmask & (1 << bit) != 0:

            # either we have the initial candidate
            if last_known is None:
                last_known = A[bit]

            # or we have a new length
            elif A[bit] >= last_known:
                last_known = A[bit]
                length_of_this_sequence += 1

            # or the sequence is non-ascending, in which case we abort
            else:
                break

    return length_of_this_sequence

This algorithm is terribly inefficient. As mentioned before, the main loop as 2ⁿ repeats, and for each of those we're testing up to n values, giving us time complexity of O(N 2^N). To give you an idea what this means:

Number of items	Runtime in seconds
20	1.75
21	3.60
22	7.15

This is so bad, it borders on the useless.

Using Dynamic Programming

Assume you somehow can construct an array S, where S[i] is the length of the longest subsequence ending at index i. Then, clearly, the longest such subsequence is the result we are looking for. In pseudocode:

def longest_nondecreasing_subsequence(A):
    S = ... MAGIC! ...       
    return max(S)

is the essence of our solution. So now we only need to construct the subsequence!

What can we say about S?

Well, surely len(S) must be len(A), because by definition we want to construct all the longest subsequence lengths, for each index in A. Also, it is clear that the minimum subsequence length for each of those indices is going to be 1 (the "sequence" begining and ending with that element itself). So:

def longest_nondecreasing_subsequence(A):
    S = [1] * len(A)
    ...
    return max(S)

OK, so say you have such a sequence S for all numbers 0..i-1. You want to construct S[i]. What can you say about S[i]?

Per definition S[i] is the length of the longest subsequence ending at index i, you are not looking at any subsequences ending in indices bigger than i. That seems trivial, but what it really means that if you are looking at constructing S[i], then you need to consider only S[0..(i-1)] and A[0..(i-1)]. This in turn means our main loop is going to look something like this:
```
def longest_nondecreasing_subsequence(A):
    S = [1] * len(A)
    for i in range(1,len(A)):
        for j in range(i):
            ...
    return max(S)
```
So now we are looking at A[j] for all j < i. If A[j] is > A[i], then clearly it cannot be part of a subsequence ending in A[i]. For example, if you look at the sequence 0 5 1, the 5 cannot be part of a nondescending subsequence ending in 1, as it is bigger than 1. This in turn means we need a clause like this:
```
def longest_nondecreasing_subsequence(A):
    S = [1] * len(A)
    for i in range(1,len(A)):
        for j in range(i):
            if A[j] <= A[i]:
                ...
    return max(S)
```

OK, we know that because of this if-clause, A[j] can be part of a subsequence ending in i. And we also know the length of the longest subsequence ending in A[j]. Therefor, the minimum subsequence ending in our i is - the length of the subsequence ending in j+1. Voila!

def longest_nondecreasing_subsequence(A):
    S = [1] * len(A)
    for i in range(1,len(A)):
        for j in range(i):
            if A[j] <= A[i]:
                if S[j]+1 > S[i]:
                    S[i] = S[j]+1
    return max(S)

Now, let's take a peek inside this algorithm, for the number sequence

  i    0   1   2   3   4   5   6   7   8   9
A[i]   0   8   4  12   2  10   6  14   1   9
S[i]   1   2   2   3   2   1  ...

Assume you are looking at i=5: A[i] is 10, and S[i] starts with 1 because that is what the array was initialized with.

Now, let's look at each j<i in turn:

j=0: A[j] = 0, S[j] = 1
Clearly, the longest subsequence ending in the first element is - 1. No surprise there. However, assume that in the worst case all numbers in the range 1..4 are descending, then there still would be a subsequence of length 2 ending in A[5]: the sequence A[0], A[5]. Therefor, S[5] is updated to 2.
j=1: A[j] = 8, S[j] = 2
Again, since 0 < 8 < 10, the "minimum longest subsequence" ending at A[5]=10 is 3. So S[5] is updated again.
j=2: A[j] = 4, S[j] = 2
Again, since 0 < 4 < 10, the "minimum longest subsequence" ending at A[5]=10 is 3. S[5] is not really updated, because it is already 3.
j=3: A[j] = 12, S[j] = 3
Again, since 12 > 10 there can be no subsequence including both A[3] and A[5], so S[3] is ignored.
j=4: A[j] = 2, S[j] = 2
No big surprise here either.

Abstraction time: from int to arbitrary objects

In the beginning of this section we restricted ourselves to arrays of integers. This being python, the code will work pretty much the same for other numeric types - like decimals, or float numbers. It will not work for complex numbers - but this is not a fault of our algorithm, this is so because complex numbers have no natural ordering and subsequently there is no such thing as a "nondescending list of complex numbers".

But what else could we relax? I honestly think we ought to sit down calmly, take a stress pill, and think things over.

Well, we could provide additional output: for example, it would obviously be interesting to find the numbers contained in such a long sequence.

Adding the actual subsequence data to the output

This is actually quite simple.

def longest_nondecreasing_subsequence(A):
    S = [1] * len(A)
    # initially, all results are lists of single elements: the item itself
    R = [[A[i]] for i in range(len(A))]
    
    for i in range(1,len(A)):
        for j in range(i):
            if A[j] <= A[i]:
                if S[j]+1 > S[i]:
                    S[i] = S[j]+1
                    
                    # Remember, the condition above meant that we're using the subsequence at j instead.
                    # so this means we create a new subsequence at i combining j and the current item
                    R[i] = R[j][:] + [A[i]]
    
    for index, item in enumerate(S):
        if index == 0:
            max_item = item
            max_seq  = R[index]
        elif item > max_item:
            max_item = item
            max_seq  = R[index]
    return max_item, max_seq

And once you see it like this, you might as well reduce it even more: you can see that you don't need the array S if you have the array R:

def longest_nondecreasing_subsequence(A):
    R = [[A[i]] for i in range(len(A))]
    
    for i in range(1,len(A)):
        for j in range(i):
            if A[j] <= A[i]:
                if len(R[j])+1 > len(R[i]):
                    R[i] = R[j][:] + [A[i]]
    
    for index, item in enumerate(R):
        if index == 0:
            max_item = len(item)
            max_seq  = R[index]
        elif len(item) > max_item:
            max_item = len(item)
            max_seq  = R[index]
    return len(max_seq), max_seq