Tries

Tries are one of the lesser known data structures. When you look up Trie on wikipedia, you'll see an imposing picture and long sample code. But let's try to make an informal definition, and come up with some simple code to see what actually happens.

A Trie is a data structure that typically encodes strings. It could also be used to encode lists of objects, but we're going to restrict ourselves to strings for now.
A string is essentially an ordered list of characters. A trie encodes each character in a list on its own level. For example:
- The first level will be an array (or dictionary) encoding the first character of any string inserted.
- Each item in that array will be an array (or dictionary) encoding the second character of any string inserted.
- and so on, right up until the maximum string length.

Dictionary Based Tries

In Python, a dictionary based trie is both simple to understand, space efficient and fast. Let's look at the code and try to understand it:

class DictBasedTrie(object):

    # the constructor creates an empty trie
    def __init__(self):
        self.data = {}

    # add a word by adding each character on different levels
    def Add(self, word):

        # k is the current level
        k = self.data
        
        # for each character in the word
        for c in word:

            # if we've seen this character on this level before, use it
            try:
                k = k[c]
            except KeyError:
                # we haven't => create a new sub dict
                k[c] = {}
                k = k[c]

To see how this works, let's take the following ingenious sentence that has its own wikipedia article:

James while John had had had had had had had had had had had a better effect on the teacher

Let's write a small test code to add the words in this sentence to a trie, and then look at the trie:

#! -*- Encoding: Latin-1 -*-
...
def selftest():
    t = DictBasedTrie()
    for word in ("James while John had had had had had had had had had had had a better effect on the teacher").split():
        t.Add(word.lower())
        
    pprint.pprint(t.data)

if __name__ == "__main__":
    selftest()

This code outputs the following:

{'a': {},
 'b': {'e': {'t': {'t': {'e': {'r': {}}}}}},
 'e': {'f': {'f': {'e': {'c': {'t': {}}}}}},
 'h': {'a': {'d': {}}},
 'j': {'a': {'m': {'e': {'s': {}}}}, 'o': {'h': {'n': {}}}},
 'o': {'n': {}},
 't': {'e': {'a': {'c': {'h': {'e': {'r': {}}}}}}, 'h': {'e': {}}},
 'w': {'h': {'i': {'l': {'e': {}}}}}}

Another way of looking at this is by looking at the sub-dict for key 'j':

 'j' : 
     {'a': {'m': {'e': {'s': {}}}}, 
      'o': {'h': {'n': {}}}
     }

You can see that both james and john have the same suffix, j, and the rest of the words are different.

List Based Tries

An alternative representation of a trie uses lists. You cannot use chars as indices in a list, but you can use char ordinals. Together with the fact that there are 26 characters (excluding punctuation, and assuming that the characters are not case-sensitive) you can encode each level of a trie in an array with 26 elements. This will use up massively more memory than a dict-based trie, but it can be made to work nonethelesss:

class ListBasedTrie(object):

    # the constructor creates an empty trie
    def __init__(self):
        self.data = [None] * 26

    # add a word by adding each character on different levels
    def Add(self, word):

        # k is the current level
        k = self.data

        # for each character in the word
        for c in word:
        
            # get index of list by mapping ord(c) to 0..26
            o = ord(c) - ord('a')
            assert (o >= 0) and (o < 26)

            # if we have not seen this character on this level before, create a new level
            if k[o] is None:
                k[o] = [None] * 26

            # move to the next level
            k = k[o]

Autocomplete - A Trie Problem

Many text editors, operating system functions, web pages (Google!) have an auto-complete function, where once you type a few characters, it will come up with a list of suggestions for complete words. Think about how you would encode that.

A trie could be a very efficient solution: Once you've matched the first n characters, you automatically get a list of words that also match the same suffix.

Your mission, should you choose to accept it:

Create a list of words by using the text of H.G.Wells' War Of The Worlds. This is a free download at Project Gutenberg.
Add all the words in the text
Find a auto-complete list for words starting with 'impl'.

The main body of the program is a direct match of these instructions:

#! -*- Encoding: Latin-1 -*-
...
def selftest():
    words = read_words_from_file(filename)
    trie = insert_words_in_trie(words)
    pprint.pprint(trie.AutoCompleteList("impl"))

if __name__ == "__main__":
    selftest()

To read the words from the file, you can just read the file in a string and split it. But since the input document contains punctuation and special characters, you should clean the data before processing it. The following small code will do just that:

def read_words_from_file(filename):
    data = list(open(filename).read())
    
    for index, char in enumerate(data):
        o = ord(char) - ord('a')
        if o < 0 or o >= 26:
            data[index] = ' '
    
    data = "".join(data)
    return data.split()

Inserting the words in the trie is also very simple:

def insert_words_in_trie(words):
    trie = DictBasedTrie()
    for word in words:
        trie.Add(word)
    return trie

That leaves us with two small problems:

Looking up a word in the trie
Creating a list of all words matching the same suffix

The first task is really easy - essentially the same as the insertion code. The second task requires a bit of recursive thinking (and the logic is actually quite close to the one used when generating permutations).

class ListBasedTrie(object):
    ...
    def AutoCompleteList(self, word):
        
        # lookup word in trie - essentially the same as add word to trie, but without the adding part
        k = self.data
        for c in word:
            try:
                k = k[c]
            except KeyError:
                # word is not found in list -> no result
                return []

        # creating the list of words matching the same suffix, recursively
        result = []
        current_word = [None] * 60
        def recurse(k, index):
            if not k:
                result.append( word + "".join(current_word[:index]) )
                
            else:
                for c in k.keys():
                    current_word[index] = c
                    recurse(k[c], index+1)
        recurse(k, 0)
        return result

The results for 'impl' are ('implied', 'implored'), the results for 'imp' are ('impassable', 'impact', 'imparted', 'impatience', 'impatient', 'impetus', 'impersonal', 'imperceptibly', 'imperfect', 'impediment', 'impeded', 'impenetrably', 'impenetrable', 'impinged', 'implied', 'implored', 'impossibility', 'impossible', 'imposed', 'important', 'importunities', 'impotent', 'imprisoned', 'imprisonment', 'impressions', 'impressively', 'impressed', 'improbable', 'impulsively', 'impulse'). Improbably impressive, I should say!