Heaps
A heap is a common data structure that can be used to maintain a priority queue, implement a sorting mechanism, and generally keep track of extremal (minimal or maximal) values in an efficient manner. Note that the Python Standard Library contains a heap implementation in heapq, but we're going to write our own first and then take a peak inside the reference implementation.
The Wikipedia article on Heaps contains pointers to a couple of specialized heaps with additional features, but we're going to start off easy with a very simple one: a min heap.
A simple Min Heap implementation
We're going to start with a min heap, and then check how we can generalize it to a MaxHeap. We need a capacity for the heap, and since each level is going to have 2**n items maximum, it makes sense to start off with a capacity that is a power of 2.
class MinHeap(object):
def __init__(self, capacity = 2 ** 20):
# the capacity should be a power of two - since all levels in a heap
# are densely populated, and why would you want to waste space?
# the default is 1MB of arbitrary data
self.data = [None] * capacity
self.used = 0
Inserting an item is a two-step process:
- Insert the item at the end of the heap
- Push the item up until we've reached a proper insert position
class MinHeap(object):
...
def Insert(self, value):
assert self.used < len(self.data)
self.data[self.used] = value
self.used += 1
self.PushUp(self.used-1)
That was the easy part. Now for the fun part: the item is at the end of the heap. we need to push it up to a position, where the heap property is satisfied.
class MinHeap(object):
...
def PushUp(self, item_position):
assert item_position >= 0
assert item_position < self.used
# the value to be pushed up the tree
item_value = self.data[item_position]
# when does the loop end? when the item has reached
# the top position - there is no place higher than the top!
while item_position > 0:
# get index and value of parent
parent_position = GetIndexOfParent(item_position)
parent_value = self.data[parent_position]
# if the parent is less than this value, we need to push ourselves up
if parent_value > item_value:
self.data[parent_position] = item_value
self.data[item_position] = parent_value
item_position = parent_position
else:
# the parent is larger, so we are in a good position
break
return item_position
This code is missing an important function: GetIndexOfParent
. Now, in the literature (e.g. Cormen et al) you'll find the following definitions:
PARENT(i) = i/2
LEFT(i) = 2i
RIGHT(i) = 2i+1
This won't work with Python, because Python arrays are zero-based. So i
is really (i+1)
, and from the result we need
to subtract 1 to get to the actual position. Sounds complicated? Let's try to insert it:
PARENT(i) = PARENT(n+1) - 1 = (n+1)/2 - 1 = (n+1)/2 - (2/2) = (n+1-2)/2 = (n-1)/2
LEFT(i) = LEFT(n+1) - 1 = 2*(n+1) - 1 = 2*n + 2- 1 = 2*n + 1
RIGHT(i) = RIGHT(n+1) - 1 = 2*(n+1) - 1 + 1 = 2 * n + 2
Therefor we have:
def GetIndexOfParent(n):
return (n-1)/2
def GetIndexOfLeftChild(n):
return (2*n)+1
def GetIndexOfRightChild(n):
return (2*n)+2
Getting the minimum is trivial:
class MinHeap(object):
...
def GetMin(self):
assert self.used > 0
return self.data[0]
Extracting the minimum is another important operation: we special case for a heap that has only one element (so there is no need to invoke Heapify()).
class MinHeap(object):
...
def ExtractMin(self):
assert self.used > 0
result = self.data[0]
if self.used > 1:
# move item from the end of the heap to the top, and heapify it (move it down
# until the heap property is satisified again)
self.data[0] = self.data[self.used-1]
self.used -= 1
self.Heapify()
else:
self.used = 0
return result
def Heapify(self):
# start at the top
index_of_this_node = 0
value_of_this_node = self.data[index_of_this_node]
# loop end criteria: we've reached the last most item in the heap
while index_of_this_node < self.used:
# get left child (or None)
index_of_left_node = GetIndexOfLeftChild(index_of_this_node)
value_of_left_node = value_of_this_node
if index_of_left_node < self.used:
value_of_left_node = self.data[index_of_left_node]
# get right child (or None)
index_of_right_node = GetIndexOfRightChild(index_of_this_node)
value_of_right_node = value_of_this_node
if index_of_right_node < self.used:
value_of_right_node = self.data[index_of_right_node]
# find index of the node that is bigger than ours
index_of_max_node = index_of_left_node
value_of_max_node = value_of_left_node
if value_of_right_node < value_of_max_node:
value_of_max_node = value_of_right_node
index_of_max_node = index_of_right_node
# if there is a match, swap the nodes
if value_of_max_node < value_of_this_node:
self.data[index_of_this_node] = value_of_max_node
self.data[index_of_max_node] = value_of_this_node
index_of_this_node = index_of_max_node
else:
break;
Let's add some more helper functions:
class MinHeap(object):
...
def IsEmpty(self):
return self.used == 0
def GetCount(self):
return self.used
def GetDepth(self):
if self.used < 2:
return self.used
level = 1
pos = self.used -1
while pos > 0:
pos = GetIndexOfParent(pos)
level += 1
return level
Pretty-Printing a Heap
One of the more interesting functions to write for a heap class is a repr()
method to print
the heap as a string. We're going to go for a simple plaintext function. For example, a heap with 15 items
could look like this:
3
15 9
17 15 12 34
44 22 40 32 46 23 47
The code for this has several steps:
- We enumerate all items
- Convert the items on this level to strings and store them in a list
- As we go along, keep track of the maximum string size needed to represent a single element.
- We expect 2i items on level i
- A new level starts when the items on the current level have reached the expected size.
- We enumerate the list of string lists in reverse order
- For each item in the current level:
- Center it around the expected cell size for this level, adding spaces if necessary
- Convert the list of cells to a single string
- Adjust the expected cell size: cell size grows by 2n
- Finally, enumerate all strings again from the top down: adding newlines to combine all the lines
class MinHeap(object):
...
def __repr__(self):
# special case: empty heap
if self.used == 0:
return "<MinHeap: None>"
max_cell_width = 1
# We expect 2<sup>i</sup> items on level i
expected_size_of_this_level = 1
levels = []
reprs_on_this_level = None
# We enumerate all items
for i in range(self.used):
if reprs_on_this_level is None:
reprs_on_this_level = []
levels.append(reprs_on_this_level)
# Convert the items on this level to strings and store them in a list
k = repr(self.data[i])
reprs_on_this_level.append(k)
# As we go along, keep track of the maximum string size needed to represent a single element.
if len(k) > max_cell_width:
max_cell_width = len(k)
if len(reprs_on_this_level) == expected_size_of_this_level:
# We expect 2**i items on level i
expected_size_of_this_level *= 2
reprs_on_this_level = None
max_cell_width += 1
# Center a single word around the expected cell size for this level, adding spaces if necessary
def center(word, max_cell_width):
result = [' '] * max_cell_width
word = list(word)
insert_pos = (max_cell_width - len(word))/2
for c in word:
result[insert_pos] = c
insert_pos += 1
return "".join(result)
# We enumerate the list of string lists in reverse order
for i in range(len(levels)-1,-1,-1):
level = levels[i]
for k in range(len(level)):
level[k] = center(level[k], max_cell_width)
# Convert the list of cells to a single string
levels[i] = "".join(levels[i])
# Adjust the expected cell size: cell size grows by 2**n
max_cell_width *= 2
# Finally, enumerate all strings again from the top down: adding newlines to combine all the lines
return "\n".join(levels)
Comparing our code with the reference implementation
As mentioned in the beginning, the python standard library contains a reference implementation of the heap data structure. You can read the source on the web at http://svn.python.org/projects/python/trunk/Lib/heapq.py. A couple of comments on that code:
- The code starts out with a very interesting and thorough discussion of heaps - much better than I can explain it ;)
- The code doesn't define a class - it imposes the heap structure on items in an existing list. It is vaguely interesting but ultimately pointless to discuss which approach is better for reuse of code...
heappush
is essentially the same as theInsert()
method above: insert at the end, then move the item to the correct insert positionheappop
also has the same essential structure, but_siftup
(the equivalent ofHeapify()
above) is much more dense (and probably considerably faster). But note the extensive comment describing the operation