I was interviewing at a company that gives take home problems before the phone interview. One of the problems was to write a program that would generate anagrams. The question specifically asked for all combinations to be generated.
// Returns a list of all words that use the same
// letters as the input string.
list < string > generateAnagrams(const string& input);
(Note the ambiguity: is the function returning multiple words per list element, or a single one?)
The easiest solution is to strip out non-alphabetic characters from the input, and then call the STL's "permute" function. Check each version against the entries in the dictionary and you are done. Except that this version won't return multi-word anagrams, which is rather important in that it makes for a much more challenging problem.
It is also very slow. For an N-letter long input (assuming no repeat letters), the number of permutations (not counting that we only have 26 letters and ordering is unimportant for multiples of the same letter) is n factorial, n!:
I use a dictionary with 45,424 words (a spell-checking file from Linux, but
with single letter words removed).
That is equivalent to about eight letters of permutations. Of course,
you'd need to have access to all the letters before you could hope to
make every word in the dictionary. This is to show how explosive
the possibilities become with only a few letters. The
possibilities can be severely curtailed by refusing to continue permuting
when the word so far does not match any existing words.
I tried to get a metric on how much this improved the performance, but the slow version was too slow to run on a letter length that was long enough to get a non-zero number from the fast version. For "percentage speed boost" I'll jot that one down as "infinite".
|Phrase||Number Produced||Slow Version||Fast Version|
|two words||14||1 second||0 seconds|
|two words a||326||16 seconds||0 seconds|
|two words ab||522||319 seconds||0 seconds|
|three words hu||21,122||too slow||3 seconds|
|SETEC Astronomy||1,800,508||did not attempt||161 seconds|
I could have made it even faster by trimming the dictionary by any letters that were not in the input string. Just think about how much smaller the dictionary would be for a fifteen letter input when we can take out all words from the dictionary that use even one of the letters that aren't in the input string! However, this optimization has a reducing benefit as the sizes get larger (and more of the alphabet is filled out).
There is a much faster way to do this than permuting the input. Instead, sort the input, and all words in the dictionary, to obtain the "key" for an arbitrary series of letters. Then, you can find the items that share the input key. Ideally, store the dictionary of words as a map (or Python dict) using the key. That way you can easily turn the key from an input string into all the words in the dictionary that use it.
However, this technique doesn't really turn out to be much easier for the multi-word case.
But there is an even faster way. I've never seen this explained anywhere, although a long time ago I did read a science fiction story with a similar element. I'm going to try to walk into this so that readers can have their own "a-ha!" or "Eureka!" moment.
The concept of keys are useful. They make the order of letters unimportant, which they are. Completely unimportant, both in the input and output words. Only the count of them are important. Why then, do I store a string at all? Perhaps I'd prefer to have an array of counts of each letter:
But checking one versus another would be so difficult. For each entry
in the word dictionary I'd have to iterate along all 26 characters
just to verify that everything the word requires is in my source
anagram. That might not be an improvement. What might work better? Have a
think about that.
But if I store it as a bitfield, then I can just do a bit-operation to verify the presence of all the items! Each letter becomes a bit in a 32-bit word. Characters available in my source string are set in a 32-bit integer, and each entry in the word dictionary has a 32-bit word key set. In order to determine if I have all the letters necessary for a word in the dictionary, I simply do a binary AND of the two keys and compare to the prospective word. Of course, "sap" and "pass" are not anagrams of one another, so there has to be some additional tests. How can I solve that one?
We could expand the bitfield, so that each location was able to store a count of
letters. This is not infinitely scalable, and somewhat hacky. So the next problem
becomes, how can I store the word in an easily manipulatable form that
will store both the characters, and their counts? I can't just add
the key for "pas" and "s", because then I'd have two "s"s, which
sum to the key for the next letter "t". How can we fix that?
If you assign a number to each letter of the alphabet, and added all of
the numbers for a word together, then you could take out letters as
you constructed anagram words, and still have the rest of the letters
represented. This is a very poor representation, however, since you wouldn't
be able to tell the difference between "aa" (1+1) and "b" (2). No matter how
far apart you spaced the numbers for the letters, you could still get there
eventually. You could pick a maximum word length, and start at a sufficiently
high number. But how do you know if an arbitrary letter is part of the
number without finding all the letters?
If you multiply instead of add the numbers, then you can try to divide the letter number from the product of the input letters, and if there is no remainder in the division then that number went in. But then, how can you tell the difference between "ab" (2*3) and "g" (6)? You need to be careful to choose numbers for each letter than cannot be produced by any combination of a previous number. What is the set of numbers with this property? Prime numbers.
Here's the algorithm, and there is a copy at the end of this post as well. (A few performance steps are marked with an asterisk.):
- Assign prime numbers to each letter.
- * Read in the dictionary and count the number of times each letter is seen.
- For the most common letter, assign it to "2".
- For each next most common letter, assign the next prime number.
- For each word in the dictionary, calculate the product of the primes associated with each letter of the word. Store all words in a list, where each list element starts with a key, and is followed by the words that use that key.
- * Sort the list by the key, so that the smallest values are first.
- For each input to be turned into anagrams:
- Calculate the product of the input letters.
- * Filter the list of words to remove any keys which cannot divide into the input product.
- Look at each key for words.
- If it evenly divides the input number, then add that list of words to a temporary list of "matched" words, reduce the input string, and recursively continue processing later keys in the list.
- If the key ever matches the whatever is left of the input key, then a complete multi-word-list match is found. Print out the various lists of words that have been assembled and continue to look for other matches.
- When the dictionary is run through, pop back to the state before the previous match.
The result will be a list of lists of lists of words. For instance the result for "lumberjack" is:
lumberjack: Result: [22, 1437443, 2359823] - re jack Blum Result: [94, 331639, 2393863] - em,me bark Cluj Result: [94, 552299, 1437443] - em,me blur,burl jack Result: [118, 264187, 2393863] - be mark Cluj Result: [329, 155782, 1456061] - am jerk club Result: [1739, 40061, 1071202] - mu jab clerk Result: [2162, 7469, 4621411] - elm,Mel jar buck Result: [2162, 24013, 1437443] - elm,Mel rub jack Result: [3358, 31913, 696377] - elk jam curb Result: [4669, 102601, 155782] - Lac bum jerk Result: [31913, 2338433966L] - jam buckler Result: [163009, 457805662] - rack jumble Result: [1437443, 51916106] - jack lumber,rumble Result: [2188978, 34091911] - Merck Jubal Result: [2393863, 31174066] - Cluj embark Timings: 0.753687 - Assign Primes to the alphabet. 1.065416 - Create signatures dictionary. 0.068219 - Solve anagrams for "lumberjack" Done.
Each line indicates a family of anagrams that can be produced. The commas separate words which are anagrams of each other and are interchangeable within a produced phrase. The order of the groups that make up a family are unimportant. They are shown with the lowest number first.
For instance, the second line can produce "em bark Cluj" or "me bark Cluj". It can also produce any rearrangement of those words, such as "Cluj bark me".
The basic algorithm shown above, without limiting the input keys, took 932 seconds to produce "SETEC Astronomy" anagrams. The time was reduced to 21 seconds with the code change to filter the dictionary to include only words which can evenly divide the input phrase. Another change to re-filter at every level of recursion took the time down to 4.9 seconds.
This handily beats my best submission for the interview, which was a C++ program that took 161 seconds on the same anagram input. Of course, I could implement this in C++ and it would probably be faster yet.
But, it must be noted, the C++ implementation did not have the dictionary reducing optimization, so it would likely be even faster. The C++ method worked fundamentally differently. It would permute the input string, recursively adding new letters so long as there was a word in the dictionary that could match.
Finally, a practical use for prime numbers! (Well, assuming one day I can
find a practical use for Anagrams...)
#!/usr/bin/python #Copyright 2011, 2012 Mark Santesson # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. ''' Anagram generator. Use prime numbers to represent the phrase given. Each letter has a unique prime number. A phrase's signature is the product of all the prime numbers assigned to the component letters. The signatures for any set of phrases which are anagrams of each other will all be the same number. By Mark Santesson. ''' import time, os, sys, collections timings =  last_time = time.clock() def elapsedTime(): '''Return the number of seconds that has passed since the last time this function was called. The first time it is called it will return the time since the function was defined.''' global last_time this_time = time.clock() elap_time = this_time - last_time last_time = this_time return elap_time def addTiming(desc): timings.append( ( elapsedTime(), desc, ) ) def nextprime(x): """Quick prime number calculator. Pass in a number and it will return the smallest prime strictly larger than the input. It is very inefficient.""" # Create a function which returns the next prime larger than the # passed value. A faster way to do this is to keep track of all # primes seen so far and then start trying primes incrementally # higher than that, dividing only by the other primes already # found. You can also stop at sqrt(x). I'm only going to generate # 26 of them, so this easy method is fine. while True: # On the theory that there are an infinite number of primes... x += 1 for pot_fac in range(2,x): # Not even using the sqrt(x) or even numbers optimizations! if x % pot_fac == 0: break else: return x def prime_generator(): '''Returns a generator that produces the next largest prime as compared to the one returned from this function the last time it was called. The first time it is called it will return 2.''' lastprime = 1 while True: lastprime = nextprime(lastprime) yield lastprime class AnagramGenerator: '''The AnagramGenerator class produces all anagram combinations that are possible from the input phrase using the given dictionary. If no dictionary is given to the constructor (as a filename) then it assumes "words" is the filename.''' def __init__(self, dictionary_filename='words'): '''Make a AnagramGenerator object. There is no way to switch dictionaries; create a new instance instead.''' self.dictionary_filename = dictionary_filename self.__loadDictionary() def __getPrime(self, c): '''Returns the prime number that represents the passed in character. The character must be lower case.''' # assert c==c.lower() return self.chars_list[ ord(c) - ord('a') ] def __getSignature(self, s): '''Returns the integer that represents the characters in a string. The integer is the product of the prime number representation of each character in the string.''' s = s.lower() sig = reduce(lambda x,y: x * self.__getPrime(y), [ch for ch in s if ord('a') <= ord(ch) <= ord('z')], 1) assert sig >= 1, repr( ('Product of primes produced non-positive number',s, sig, [ (z,self.__getPrime(z)) for z in s if ord('a') <= ord(z) <= ord('z') ]) ) return sig def __loadDictionary(self): '''Load a dictionary from the given input file. The file should have one word per line.''' addTiming('Start of loading dictionary from "%s"' % (self.dictionary_filename,)) # Count chars in the dictionary. Create a dict of lower # case character to the number of instances found in the # words dictionary. It might be useful to break this # function into two parts to enable custom dictionaries # passed in as a list/string instead of just specifying # a dictionary filename. However, since I'm testing # the performance of this script, I need to combine the # counting and the line loading to prevent a second # loop. There are ways around that, but it isn't a # focus for now. If loading from something other than # a dictionary file is needed then this is the place # to make that change. all_lines = list() counts = dict([ (chr(x),0) for x in range(ord('a'), ord('z')+1) ]) for line in open( self.dictionary_filename ).readlines(): all_lines.append(line.strip()) for c in line.lower(): if ord('a') <= ord(c) <= ord('z'): counts[ c ] += 1 # Construct a dict mapping from lower case character to the # prime number for it. Initialize it in the order of most # used character first, so that the letters most frequently # used get lower numbers in the hopes that the products # will be smaller numbers. primes = prime_generator() # get the prime number generator chars_map = dict( [ (x,primes.next()) for x,y in \ sorted( counts.items(), key=lambda i:i, reverse=True ) ] ) # Recast the dict as a list, where the index is the index # in the alphabet. The value is the prime number for that # character. self.chars_list = [ chars_map[chr(x)] for x in range(ord('a'), ord('z') + 1) ] addTiming( 'Assign Primes to the alphabet.' ) # Insert all the dictionary words into the list self.sigs = collections.defaultdict( list ) for word in all_lines: sig = self.__getSignature( word ) if sig > 1: # ignore lines without letters self.sigs[ sig ].append( word ) # We need the keys in list form so we can have it sorted. # Lookups into the signatures dictionary are pretty rare. self.sigs_keys = sorted( self.sigs.keys() ) addTiming( 'Create signatures dictionary.' ) def __getSignatureKeys(self): '''Returns the sorted list of all signature keys. This is used to populate a list of all available words.''' return self.sigs_keys def formulateAnagramPhraseCombo(self, res): '''Render an anagram "phrase" into text for display. Each key is separate by double spaces, and each key displays as a comma separated list of the words choices.''' return " ".join( [ ','.join(self.sigs[s]) for s in res ] ) def __solveAnagramPhrase(self, letters, unreduced_keys, start=0, \ so_far=, display_progress=False): '''This is the recursive function that helps produce anagrams. It should not be called directly.''' # letters: Product of signatures of letters remaining. # unreduced_keys: Keys representing items in the dictionary # that can non necessarily be reached by the remaining # letters. # start: The index of unreduced_keys to start considering. # keys before this do not need to be considered in # combination with this pick because earlier keys # will be considered separately and they will consider # combining with this pick as part of their recursion. # The start value should be considered again in case # the same key can be used multiple times. # so_far: A list containing the keys that have been picked # so far in this chain of recursions. # display_progress: Flag, if true then results are displayed # as they are discovered. if letters == 1: # There are no letters left. if display_progress: print "Result: %s - %s" % \ ( repr(so_far), self.formulateAnagramPhraseCombo(so_far) ) return [ so_far ] # Filter list of keys to remove any that can no longer # be constructed using some of the input letters. reduced_keys = [ x for x in unreduced_keys[start:] if letters % x == 0 ] result =  # Recurse on all items remaining in the dictionary. for index,sig in enumerate(reduced_keys): remaining_letters = letters // sig result += self.__solveAnagramPhrase( letters = remaining_letters, unreduced_keys = reduced_keys, start = index, so_far = so_far + [ sig ], display_progress= display_progress ) return result def anagrams( self, phrase_string, display_progress=False ): '''This function takes an input phrase string and returns a list of all anagrams that can be generated from it. The return value is a list of lists of lists. The inner lists represent the individual words which are interchangeable in an anagram. For instance, "tarp" and "trap" are anagrams of each other. Whenever one can be used, the other can substitute. They will be presented together in the innermost list. An example output for "lumberjack" might be: [ [ ['me'], ['bark'], ['Cluj'] ], [ ['me'], ['blur','burl'], ['jack'], ], [ ['be'], ['mark'], ['Cluj'] ] [ ['am'], ['jerk'], ['club'] ] [ ['elm','Mel'], ['jar'], ['buck'] ] ... ] ''' all_keys = self.__getSignatureKeys() if options.dumpdict: for k,v in sorted(sigs.items(), key=lambda x: x): print repr(k),'->',repr(v) r = self.__solveAnagramPhrase( self.__getSignature(phrase_string), all_keys, display_progress=display_progress ) r = [ [ self.sigs[s] for s in row ] for row in r ] addTiming( 'Solve anagrams for "%s": %d rows' % (phrase_string, len(r),) ) return r if __name__ == "__main__": import optparse description = '''\ Anagram generator by Mark Santesson. This program rearranges letters in any non-option command line parameters and output the anagrams that can be generated from them. The potential words are filtered through a dictionary. To generate anagrams for more than one phrase, separate the phrases with (escaped) semicolons. ''' parser = optparse.OptionParser( usage='usage: %prog [options] args', description=description ) parser.add_option('-t', '--time', action='store_true', default=False, dest='timing', help='Display timing information but not anagrams generated.') parser.add_option('-d', '--dictionary', default='words', dest='dictionary_name', help='Specify a location for the dictionary. Default is "%default".') parser.add_option('--dumpdict', action='store_true', default=False, dest='dumpdict', help='Dump dictionary after load, for debug purposes.') options, test = parser.parse_args() if len(test) >= 1: test = ' '.join(test).split(';') else: test = [ "face", "astronomy", "SETEC Astronomy" ] ag = AnagramGenerator( options.dictionary_name ) for x in test: print "%s:" % x ag.anagrams( x, not options.timing ) if options.timing: print "Timings:" for et,desc in timings: print "%f - %s" % (et, desc,) print "Done.\n" # 932 seconds to solve SETEC Astronomy without limiting dictionary. # The C version took 161 secs. # 21 seconds after reducing the keys from the dict. # With dictionary reduction at every recursion, it goes down to 4.25 seconds.
Generating Anagrams the Efficient and Easy Way by Mark Santesson is licensed under a Creative Commons Attribution 3.0 Unported License.