2012-01-01

Generating Anagrams the Efficient and Easy Way

Generating Anagrams The Efficient and Easy Way

I was interviewing at a company that gives take home problems before the phone interview. One of the problems was to write a program that would generate anagrams. The question specifically asked for all combinations to be generated.

// Returns a list of all words that use the same
 // letters as the input string.
 list < string > generateAnagrams(const string& input);

(Note the ambiguity: is the function returning multiple words per list element, or a single one?)

The easiest solution is to strip out non-alphabetic characters from the input, and then call the STL's "permute" function. Check each version against the entries in the dictionary and you are done. Except that this version won't return multi-word anagrams, which is rather important in that it makes for a much more challenging problem.

It is also very slow. For an N-letter long input (assuming no repeat letters), the number of permutations (not counting that we only have 26 letters and ordering is unimportant for multiples of the same letter) is n factorial, n!:













N N!
1 1
2 2
3 6
4 24
5 120
6 720
10 3628800
15 1307674368000
20 2432902008176640000
25 15511210043330985984000000
30 265252859812191058636308480000000

I use a dictionary with 45,424 words (a spell-checking file from Linux, but with single letter words removed). That is equivalent to about eight letters of permutations. Of course, you'd need to have access to all the letters before you could hope to make every word in the dictionary. This is to show how explosive the possibilities become with only a few letters. The possibilities can be severely curtailed by refusing to continue permuting when the word so far does not match any existing words.

After about seven letters, the number of possible words that can be made out of the permutation actually starts to go down. Without limiting constructions to valid starts of words, the number of permutations being handled would still be going up.

I tried to get a metric on how much this improved the performance, but the slow version was too slow to run on a letter length that was long enough to get a non-zero number from the fast version. For "percentage speed boost" I'll jot that one down as "infinite".







Phrase Number Produced Slow Version Fast Version
two words 14 1 second 0 seconds
two words a 326 16 seconds 0 seconds
two words ab 522 319 seconds 0 seconds
three words hu 21,122 too slow 3 seconds
SETEC Astronomy 1,800,508 did not attempt 161 seconds

I could have made it even faster by trimming the dictionary by any letters that were not in the input string. Just think about how much smaller the dictionary would be for a fifteen letter input when we can take out all words from the dictionary that use even one of the letters that aren't in the input string! However, this optimization has a reducing benefit as the sizes get larger (and more of the alphabet is filled out).

There is a much faster way to do this than permuting the input. Instead, sort the input, and all words in the dictionary, to obtain the "key" for an arbitrary series of letters. Then, you can find the items that share the input key. Ideally, store the dictionary of words as a map (or Python dict) using the key. That way you can easily turn the key from an input string into all the words in the dictionary that use it.







Wordkey
partaptr
prataptr
tarpaptr
trapaptr
raptaptr

However, this technique doesn't really turn out to be much easier for the multi-word case.

But there is an even faster way. I've never seen this explained anywhere, although a long time ago I did read a science fiction story with a similar element. I'm going to try to walk into this so that readers can have their own "a-ha!" or "Eureka!" moment.

The concept of keys are useful. They make the order of letters unimportant, which they are. Completely unimportant, both in the input and output words. Only the count of them are important. Why then, do I store a string at all? Perhaps I'd prefer to have an array of counts of each letter:



ab...opqrstu...
10...0101010...

But checking one versus another would be so difficult. For each entry in the word dictionary I'd have to iterate along all 26 characters just to verify that everything the word requires is in my source anagram. That might not be an improvement. What might work better? Have a think about that.


But if I store it as a bitfield, then I can just do a bit-operation to verify the presence of all the items! Each letter becomes a bit in a 32-bit word. Characters available in my source string are set in a 32-bit integer, and each entry in the word dictionary has a 32-bit word key set. In order to determine if I have all the letters necessary for a word in the dictionary, I simply do a binary AND of the two keys and compare to the prospective word. Of course, "sap" and "pass" are not anagrams of one another, so there has to be some additional tests. How can I solve that one?

We could expand the bitfield, so that each location was able to store a count of letters. This is not infinitely scalable, and somewhat hacky. So the next problem becomes, how can I store the word in an easily manipulatable form that will store both the characters, and their counts? I can't just add the key for "pas" and "s", because then I'd have two "s"s, which sum to the key for the next letter "t". How can we fix that?


If you assign a number to each letter of the alphabet, and added all of the numbers for a word together, then you could take out letters as you constructed anagram words, and still have the rest of the letters represented. This is a very poor representation, however, since you wouldn't be able to tell the difference between "aa" (1+1) and "b" (2). No matter how far apart you spaced the numbers for the letters, you could still get there eventually. You could pick a maximum word length, and start at a sufficiently high number. But how do you know if an arbitrary letter is part of the number without finding all the letters?


If you multiply instead of add the numbers, then you can try to divide the letter number from the product of the input letters, and if there is no remainder in the division then that number went in. But then, how can you tell the difference between "ab" (2*3) and "g" (6)? You need to be careful to choose numbers for each letter than cannot be produced by any combination of a previous number. What is the set of numbers with this property? Prime numbers.

Here's the algorithm, and there is a copy at the end of this post as well. (A few performance steps are marked with an asterisk.):

  • Assign prime numbers to each letter.
    • * Read in the dictionary and count the number of times each letter is seen.
    • For the most common letter, assign it to "2".
    • For each next most common letter, assign the next prime number.
    • For each word in the dictionary, calculate the product of the primes associated with each letter of the word. Store all words in a list, where each list element starts with a key, and is followed by the words that use that key.
    • * Sort the list by the key, so that the smallest values are first.
  • For each input to be turned into anagrams:
    • Calculate the product of the input letters.
    • * Filter the list of words to remove any keys which cannot divide into the input product.
    • Look at each key for words.
      • If it evenly divides the input number, then add that list of words to a temporary list of "matched" words, reduce the input string, and recursively continue processing later keys in the list.
      • If the key ever matches the whatever is left of the input key, then a complete multi-word-list match is found. Print out the various lists of words that have been assembled and continue to look for other matches.
      • When the dictionary is run through, pop back to the state before the previous match.

The result will be a list of lists of lists of words. For instance the result for "lumberjack" is:

lumberjack:
Result: [22, 1437443, 2359823] - re  jack  Blum
Result: [94, 331639, 2393863] - em,me  bark  Cluj
Result: [94, 552299, 1437443] - em,me  blur,burl  jack
Result: [118, 264187, 2393863] - be  mark  Cluj
Result: [329, 155782, 1456061] - am  jerk  club
Result: [1739, 40061, 1071202] - mu  jab  clerk
Result: [2162, 7469, 4621411] - elm,Mel  jar  buck
Result: [2162, 24013, 1437443] - elm,Mel  rub  jack
Result: [3358, 31913, 696377] - elk  jam  curb
Result: [4669, 102601, 155782] - Lac  bum  jerk
Result: [31913, 2338433966L] - jam  buckler
Result: [163009, 457805662] - rack  jumble
Result: [1437443, 51916106] - jack  lumber,rumble
Result: [2188978, 34091911] - Merck  Jubal
Result: [2393863, 31174066] - Cluj  embark
Timings:
0.753687 - Assign Primes to the alphabet.
1.065416 - Create signatures dictionary.
0.068219 - Solve anagrams for "lumberjack"
Done.

Each line indicates a family of anagrams that can be produced. The commas separate words which are anagrams of each other and are interchangeable within a produced phrase. The order of the groups that make up a family are unimportant. They are shown with the lowest number first.

For instance, the second line can produce "em bark Cluj" or "me bark Cluj". It can also produce any rearrangement of those words, such as "Cluj bark me".

The basic algorithm shown above, without limiting the input keys, took 932 seconds to produce "SETEC Astronomy" anagrams. The time was reduced to 21 seconds with the code change to filter the dictionary to include only words which can evenly divide the input phrase. Another change to re-filter at every level of recursion took the time down to 4.9 seconds.

This handily beats my best submission for the interview, which was a C++ program that took 161 seconds on the same anagram input. Of course, I could implement this in C++ and it would probably be faster yet.

But, it must be noted, the C++ implementation did not have the dictionary reducing optimization, so it would likely be even faster. The C++ method worked fundamentally differently. It would permute the input string, recursively adding new letters so long as there was a word in the dictionary that could match.

Finally, a practical use for prime numbers! (Well, assuming one day I can find a practical use for Anagrams...) The most recent version can be found here.

#!/usr/bin/python

#Copyright 2011, 2012 Mark Santesson
#
#   Licensed under the Apache License, Version 2.0 (the "License");
#   you may not use this file except in compliance with the License.
#   You may obtain a copy of the License at
#
#       http://www.apache.org/licenses/LICENSE-2.0
#
#   Unless required by applicable law or agreed to in writing, software
#   distributed under the License is distributed on an "AS IS" BASIS,
#   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#   See the License for the specific language governing permissions and
#   limitations under the License.



'''
Anagram generator. Use prime numbers to represent the phrase given.
Each letter has a unique prime number. A phrase's signature is
the product of all the prime numbers assigned to the component
letters.  The signatures for any set of phrases which are anagrams
of each other will all be the same number.
By Mark Santesson.
'''

import time, os, sys, collections


timings = []
last_time = time.clock()

def elapsedTime():
    '''Return the number of seconds that has passed since the last time this
    function was called. The first time it is called it will return the time
    since the function was defined.'''
    global last_time
    this_time = time.clock()
    elap_time = this_time - last_time
    last_time = this_time
    return elap_time

def addTiming(desc):
    timings.append( ( elapsedTime(), desc, ) )



def nextprime(x):
    """Quick prime number calculator. Pass in a number and it will return
    the smallest prime strictly larger than the input. It is very
    inefficient."""

    # Create a function which returns the next prime larger than the
    # passed value. A faster way to do this is to keep track of all
    # primes seen so far and then start trying primes incrementally
    # higher than that, dividing only by the other primes already
    # found. You can also stop at sqrt(x). I'm only going to generate
    # 26 of them, so this easy method is fine.

    while True:    # On the theory that there are an infinite number of primes...
        x += 1
        for pot_fac in range(2,x):    # Not even using the sqrt(x) or even numbers optimizations!
            if x % pot_fac == 0:
                break
        else:
            return x



def prime_generator():
    '''Returns a generator that produces the next largest prime as
    compared to the one returned from this function the last time
    it was called. The first time it is called it will return 2.'''
    lastprime = 1
    while True:
        lastprime = nextprime(lastprime)
        yield lastprime






class AnagramGenerator:
    '''The AnagramGenerator class produces all anagram combinations
    that are possible from the input phrase using the given dictionary.
    If no dictionary is given to the constructor (as a filename)
    then it assumes "words" is the filename.'''
    def __init__(self, dictionary_filename='words'):
        '''Make a AnagramGenerator object. There is no way to
        switch dictionaries; create a new instance instead.'''
        self.dictionary_filename = dictionary_filename
        self.__loadDictionary()

    def __getPrime(self, c):
        '''Returns the prime number that represents the passed
        in character. The character must be lower case.'''
        # assert c==c.lower()
        return self.chars_list[ ord(c) - ord('a') ]

    def __getSignature(self, s):
        '''Returns the integer that represents the characters in
        a string.  The integer is the product of the prime number
        representation of each character in the string.'''
        s = s.lower()
        sig = reduce(lambda x,y: x * self.__getPrime(y),
            [ch for ch in s if ord('a') <= ord(ch) <= ord('z')], 1)
        assert sig >= 1, repr( ('Product of primes produced non-positive number',s, sig, [ (z,self.__getPrime(z)) for z in s if ord('a') <= ord(z) <= ord('z') ]) )
        return sig

    def __loadDictionary(self):
        '''Load a dictionary from the given input file.  The file
        should have one word per line.'''

        addTiming('Start of loading dictionary from "%s"' % (self.dictionary_filename,))

        # Count chars in the dictionary. Create a dict of lower
        # case character to the number of instances found in the
        # words dictionary. It might be useful to break this
        # function into two parts to enable custom dictionaries
        # passed in as a list/string instead of just specifying
        # a dictionary filename. However, since I'm testing
        # the performance of this script, I need to combine the
        # counting and the line loading to prevent a second
        # loop. There are ways around that, but it isn't a
        # focus for now. If loading from something other than
        # a dictionary file is needed then this is the place
        # to make that change.

        all_lines = list()
        counts = dict([ (chr(x),0) for x in range(ord('a'), ord('z')+1) ])
        for line in open( self.dictionary_filename ).readlines():
            all_lines.append(line.strip())
            for c in line.lower():
                if ord('a') <= ord(c) <= ord('z'):
                    counts[ c ] += 1


        # Construct a dict mapping from lower case character to the
        # prime number for it. Initialize it in the order of most
        # used character first, so that the letters most frequently
        # used get lower numbers in the hopes that the products
        # will be smaller numbers.

        primes = prime_generator() # get the prime number generator
        chars_map = dict( [ (x,primes.next()) for x,y in \
            sorted( counts.items(), key=lambda i:i[1],
                reverse=True ) ] )

        # Recast the dict as a list, where the index is the index
        # in the alphabet. The value is the prime number for that
        # character.

        self.chars_list = [ chars_map[chr(x)] for x in range(ord('a'), ord('z') + 1) ]

        addTiming( 'Assign Primes to the alphabet.' )


        # Insert all the dictionary words into the list
        self.sigs = collections.defaultdict( list )
        for word in all_lines:
            sig = self.__getSignature( word )
            if sig > 1:    # ignore lines without letters
                self.sigs[ sig ].append( word )


        # We need the keys in list form so we can have it sorted.
        # Lookups into the signatures dictionary are pretty rare.
        self.sigs_keys = sorted( self.sigs.keys() )
        addTiming( 'Create signatures dictionary.' )

    def __getSignatureKeys(self):
        '''Returns the sorted list of all signature keys. This is
        used to populate a list of all available words.'''
        return self.sigs_keys

    def formulateAnagramPhraseCombo(self, res):
        '''Render an anagram "phrase" into text for display. Each key
        is separate by double spaces, and each key displays as a comma
        separated list of the words choices.'''
        return "  ".join( [ ','.join(self.sigs[s]) for s in res ] )



    def __solveAnagramPhrase(self, letters, unreduced_keys, start=0, \
                so_far=[], display_progress=False):
        '''This is the recursive function that helps produce anagrams.
        It should not be called directly.'''

        # letters: Product of signatures of letters remaining.
        # unreduced_keys: Keys representing items in the dictionary
        #    that can non necessarily be reached by the remaining
        #    letters.
        # start: The index of unreduced_keys to start considering.
        #    keys before this do not need to be considered in
        #    combination with this pick because earlier keys
        #    will be considered separately and they will consider
        #    combining with this pick as part of their recursion.
        #    The start value should be considered again in case
        #    the same key can be used multiple times.
        # so_far: A list containing the keys that have been picked
        #    so far in this chain of recursions.
        # display_progress: Flag, if true then results are displayed
        #    as they are discovered.

        if letters == 1:
            # There are no letters left.
            if display_progress:
                print "Result: %s - %s" % \
                    ( repr(so_far),
                    self.formulateAnagramPhraseCombo(so_far) )
            return [ so_far ]

        # Filter list of keys to remove any that can no longer
        # be constructed using some of the input letters.
        reduced_keys = [ x for x in unreduced_keys[start:] if letters % x == 0 ]
        result = []

        # Recurse on all items remaining in the dictionary.
        for index,sig in enumerate(reduced_keys):
            remaining_letters = letters // sig
            result += self.__solveAnagramPhrase(
                    letters        = remaining_letters,
                    unreduced_keys    = reduced_keys,
                    start         = index,
                    so_far        = so_far + [ sig ],
                    display_progress= display_progress
                    )

        return result


    def anagrams( self, phrase_string, display_progress=False ):
        '''This function takes an input phrase string and returns
        a list of all anagrams that can be generated from it. The
        return value is a list of lists of lists. The inner lists
        represent the individual words which are interchangeable
        in an anagram. For instance, "tarp" and "trap" are
        anagrams of each other. Whenever one can be used, the
        other can substitute. They will be presented together
        in the innermost list. An example output for "lumberjack"
        might be:
        [ [ ['me'], ['bark'], ['Cluj'] ],
          [ ['me'], ['blur','burl'], ['jack'], ],
          [ ['be'], ['mark'], ['Cluj'] ]
          [ ['am'], ['jerk'], ['club'] ]
          [ ['elm','Mel'], ['jar'], ['buck'] ]
          ...
          ]
        '''
        all_keys = self.__getSignatureKeys()
        if options.dumpdict:
            for k,v in sorted(sigs.items(), key=lambda x: x[0]):
                print repr(k),'->',repr(v)

        r = self.__solveAnagramPhrase(
                self.__getSignature(phrase_string),
                all_keys,
                display_progress=display_progress )

        r = [ [ self.sigs[s] for s in row ] for row in r ]
        addTiming( 'Solve anagrams for "%s": %d rows' % (phrase_string, len(r),) )
        return r






if __name__ == "__main__":
    import optparse

    description = '''\
Anagram generator by Mark Santesson. This program rearranges letters
in any non-option command line parameters and output the anagrams
that can be generated from them. The potential words are filtered
through a dictionary. To generate anagrams for more than one phrase,
separate the phrases with (escaped) semicolons.
'''

    parser = optparse.OptionParser(
                usage='usage: %prog [options] args',
                description=description )
    parser.add_option('-t', '--time', action='store_true', default=False,
            dest='timing', help='Display timing information but not anagrams generated.')
    parser.add_option('-d', '--dictionary', default='words',
            dest='dictionary_name',
            help='Specify a location for the dictionary. Default is "%default".')
    parser.add_option('--dumpdict', action='store_true', default=False,
            dest='dumpdict',
            help='Dump dictionary after load, for debug purposes.')

    options, test = parser.parse_args()

    if len(test) >= 1:
        test = ' '.join(test).split(';')
    else:
        test = [ "face", "astronomy", "SETEC Astronomy" ]

    ag = AnagramGenerator( options.dictionary_name )
    for x in test:
        print "%s:" % x
        ag.anagrams( x, not options.timing )

    if options.timing:
        print "Timings:"
        for et,desc in timings:
            print "%f - %s" % (et, desc,)
    print "Done.\n"


# 932 seconds to solve SETEC Astronomy without limiting dictionary.
# The C version took 161 secs.
# 21 seconds after reducing the keys from the dict.
# With dictionary reduction at every recursion, it goes down to 4.25 seconds.


Creative Commons License Generating Anagrams the Efficient and Easy Way by Mark Santesson is licensed under a Creative Commons Attribution 3.0 Unported License.

No comments:

Post a Comment