next up previous
Next: Stage 6 Up: Stage 5 Previous: How we Found the

Description of Solution Method

Here we will give a more detailed description of the program cwizard used to solve the book cipher. For a given text all possible offsets are tested, and the characters are counted both from the beginning and from the end of the text. The program considers four ways to count letters in a text.
Count all characters as letters.
Count separators but not whitespace as letters.
Count whitespace but not separators as letters.
Count neither whitespace nor separators as letters.
For each text all four ways to count are tested.

Given a hypothetical plaintext we use an automatic sanity check, using an English dictionary. The dictionary is stored in a trie. A trie is a tree-like data structure where each node represents a letter. Any path from the root down the tree thus makes a potential word. If the word obtained by following a path from the root to a node v is in the dictionary, we mark the node v as a word-ending node. With this data structure it is possible to see if a word is present in the dictionary in time linear in the length of the word.

The potential plaintext is evaluated using dynamic programming. A word w is given the score | w|2, if the length of the word | w| > 2and if w is present in the dictionary. Otherwise the score of the word is 0. Our aim is to partition the plaintext into disjoint words so that the sum of the score for each word is maximized. This is easy to do with dynamic programming. Let S(i) be the maximum score for the first i letters in the plaintext, and let s(i, j) be the score for a word beginning at index i and ending at index j in the plaintext. Then

S(i) = max1 $\scriptstyle \leq$ k $\scriptstyle \leq$ i{S(k) + s(k, i)}    

This function can be computed efficiently in a bottom-up fashion, and the optimal score for the plaintext is given by S(n) where n is the length of the plaintext. The plaintext was finally output if the score S(n) was at least 3.2n.

The reason for using a quadratic scoring function was that we wanted to reward longer words more than shorter words. Short real words tend to be present, en masse, in nonsense text as well. With a quadratic function and the selected threshold the number of false hits was reasonably small. The hope was that the correct text be detected with this methodology. This was also eventually confirmed.

next up previous
Next: Stage 6 Up: Stage 5 Previous: How we Found the