Finding the equilibrium point of an array
You have an array of n random, unsorted numbers. Choose an index to split the array in two halfs. The goal is to choose the index in such a way as to make the sum of the elements in each part as close as possible. Another way of saying this is: minimize the difference of the element sums of both arrays.
For example, take this array:
[2, 2, 3, 1, 4, 1, 4, 3, 1, 2]
If you choose to split the array in two at i=3, you would have
A = [2, 2, 3] -> sum(A) is 7
B = [1, 4, 1, 4, 3, 1, 2] -> sum(B) is 16
If instead you choose to split the array in two at i=5, you would have
A = [2, 2, 3, 1, 4] -> sum(A) is 12
B = [1, 4, 3, 1, 2] -> sum(B) is 11
This is a good as it gets for this array: the difference between the two sums is 1. You can convince yourself that if you choose 6 instead, then sum(A) would be 13 and sum(B) would be 10, so A and B would be 3 apart.
Ask yourself: Do I understand the problem?
- Input is an array of N values. We don't know N yet, but obviously the algorithm, whatever it is, shouldn't depend on a particular N.
- Without loss of generality, let's assume all values are integers. We shall check at the end what happens if our values are different things - like for files, complex numbers or words in a document.
- Output is the index that splits the array into two halfs.
- The order of the array is to be preserved. This rules out approaches like Sorting or Greedy, that typically rearrange input data.
Setting up a test harness
As usual, our first step is to prepare a testbed. Now, the result criteria is pretty complex. It asserts that the difference of the sum of the two array halfs is minimal - even writing that down makes you cringe, ever so slightly. So let's do it like this: define an array where we know the split is identical: when the two halfs have exactly the same value. Therefor, our code strategy for the test will be this:
- Create an array of N/2 random values.
- Duplicate the array. By definition, the expected split lies at N/2.
- Assert that the result splits the array such that the sum of both halfs is equal
This is the code:
#! -*- Encoding: Latin-1 -*-
def selftest():
import random
A = [random.randrange(6000) for i in range(10)]
A = A[:] + A[:]
n = our-test-function-here(A)
assert sum(A[:n]) == sum(A[n:])
if __name__ == "__main__":
selftest()
Brute-Force approach
The brute force solution tries every possible split: for every possible split, it calculates the two sums and compares them. If the result is better than the best-known last result, use this as new best-result. The straight-forward implementation looks like this:
def find_equilibrium_brute_force(data):
min_distance = None
for i in range(len(data)):
# split the array in two parts
A, B = data[:i], data[i:]
# check the distance of this split
distance = abs(sum(A)-sum(B))
# use it, if it is better than the last best split
if (min_distance is None) or (distance < min_distance):
min_distance = distance
min_distance_index = i
return min_distance_index
The runtime of this algorithm is of complexity O(N2): for every item, it calculates the sums of every item. Surely, we should try to do better than that!
Using dynamic programming
The basic observation in the above algorithm is that we are calculating the sum of each half of the array repeatedly. We can do better than that by keeping track of the last sum reached, and updating that. Our first attempt looks like this:
def find_equilibrium_slightly_better(A):
lower_sum = 0
min_distance = None
for i in range(len(A)):
# update lower sum immediately
lower_sum += A[i]
# calculate upper sum
upper_sum = sum(A[i+1:])
# check the distance of this split
distance = abs(lower_sum-upper_sum)
# use it, if it is better than the last best split
if (min_distance is None) or (distance < min_distance):
min_distance = distance
min_distance_index = i+1
return min_distance_index
Note that the min distance index is not i, but i+1: because i is included in the lower array.
Now, how can we get rid of the slow calculation for the upper sum? By recognizing that it is actually the sum of all elements in the array, minus the sum of all lower elements. Duh!
def find_equilibrium_best(data):
upper_sum = sum(data)
lower_sum = 0
min_distance = None
for index, item in enumerate(data):
# calculate both sums immediately
lower_sum += item
upper_sum -= item
# check the distance of this split
distance = abs(lower_sum-upper_sum)
# use it, if it is better than the last best split
if (min_distance is None) or (distance < min_distance):
min_distance = distance
min_distance_index = index+1
if min_distance == 0:
break
return min_distance_index
This algorithm is O(N), which means it is pretty much the optimum for this problem.
Could we do much better than that? Probably not. The next best algorithmic complexity would be something like O(log N), and it seems unlikely that there is a way to zoom in on the solution like that without calculating all the sums.
How does the performance look like in real life? For sorting 10000 random numbers, this is what the test driver prints out:
find_equilibrium_brute_force: result 4974 after 1.00 secs find_equilibrium_slightly_better: result 4974 after 0.48 secs find_equilibrium_best: result 4974 after 0.00 secs
If you increase the array size to 1 Million, the results are way more dramatic: the find_equilibrium_slightly_better will take up to 5 minutes, while find_equilibrium_best will still take only 0.3 seconds.
Abstraction time: from int to arbitrary objects
In the beginning of this section we restricted ourselves to arrays of integers. This being python, the code will work pretty much the same for other numeric types - like decimals, or complex numbers. So there is no restriction there.
But could we relax the condition? I honestly think we ought to sit down calmly, take a stress pill, and think things over.
What we are looking for in the original problem is to split the array in to halfs, and enforce a condition on that split. The default condition is even worse: it asks for minimizing the difference of the sum of each half. Let us think of alternatives.
First idea: Choose the split in such a way that the difference is maximal. That should be a trivial change: rather than testing if a distance is less than any previous distances, we test if it is more than any previous distances.
Second idea: Choose the split in such a way that the number of odd numbers in both halfs is similar. Or the number of primes. Or the average of numbers.
If you think about this second idea, you realize it is really two slightly different abstractions:
- Condition 1: We are asking to match all items with a certain function - like in our original example - for example, when we use the average of numbers, we must consider all numbers.
- Condition 2: We are asking to match the count of items matching a certain attribute. For example, the attribute could be 'is-odd', we are testing how many match 'is-odd', and compare those numbers. One key thing: the attribute-function is a function of a single individual item, it is not - like in our original example! - a function of all items in a subarray.
It turns out that Condition 2 really transfers our problem to something completely different,
so we will look at it in the section TODO. For now, let us briefly look at relaxing for condition 1.
Revisit the code for find_equilibrium_best
.
def find_equilibrium_condition(data, condition):
# initialize
upper_sum = sum(data)
lower_sum = 0
min_distance = None
for index, item in enumerate(data):
# do stuff with the item
lower_sum += item
upper_sum -= item
distance = abs(lower_sum-upper_sum)
# use it, if it is better than the last best split
if (min_distance is None) or (distance < min_distance):
min_distance = distance
min_distance_index = index+1
if min_distance == 0:
break
return min_distance_index
Look at the comments. If you see it this way, then the code has this structure:
def find_conditioned_split(data, condition):
condition.initialize(data)
for index, item in enumerate(data):
if not condition.visit(index, item):
break
return condition.result
Let's try to formulate our initial criteria - equilibrium - as such a context.
class split_by_sum(object):
def initialize(self, data):
# This is basically the code we had initially: setup the two sums
self.upper_sum = sum(data)
self.lower_sum = 0
self.min_distance = None
self.result = None
def visit(self, index, item):
# This is the loop body we had previously
self.upper_sum -= item
self.lower_sum += item
distance = abs(self.lower_sum-self.upper_sum)
use it, if it is better than the last best split
if (self.min_distance is None) or (distance < self.min_distance):
self.min_distance = distance
self.result = index+1
if self.min_distance == 0:
return False
return True
That wasn't too bad. Let's reformulate that to split by the average:
class split_by_average(object):
def initialize(self, data):
self.upper_sum = sum(data)
self.lower_sum = 0
self.upper_count = len(data)
self.lower_count = 0
self.min_distance = None
self.result = None
def visit(self, index, item):
self.upper_sum -= item
self.lower_sum += item
self.lower_count += 1
self.upper_count -= 1
if self.upper_count:
lower_avg = self.lower_sum / self.lower_count
upper_avg = self.upper_sum / self.upper_count
distance = abs(lower_avg-upper_avg)
if (self.min_distance is None) or (distance < self.min_distance):
self.min_distance = distance
self.result = index+1
if self.min_distance == 0:
return False
return True
So going back to our original intent: if we separate the main logic from the split condition, we can split the array with more complex logic.