These days I was given a reeeeally interesting homework at the university. I was given a set of MD5 hashes, calculated from single words (taken from Libre Office’ dictionaries) with a given sault. And the task was to find all those words.
So, the first idea which came to my mind was using an internet service for MD5 breaking. But… aaarrrggghhh! There’s a sault, so the webservice, looking for words over a dictionary fails to find mines…
So the second idea was to take that dictionary from Libre Office and iterate through it. At the end, it worked =) And worked reeeally fast. But that is not an interesting part.
I wandered if I could find those words in my dictionary, generated by my own code.
I was thinking of writing it functionally. Really functionally, in Clojure. And, highly motivated by how good Clojure is at parallelism (sarcasm?..), I started thinking. After many tries and even installing Clojure REPL on my phone and writing code all the way home from university and work, I asked my dear friend to help me with this.
And so we went non-functional way first… We wrote a Java class, having two methods for generating
words recursively. The idea was, to make a new word by appending it with each letter, then appending
each of the alphabet’ letters to this word and so on, until we get all the words of
When our pretty class was created and working (for some cases), we decided to do it more
Java-correct and created two more classes, implementing a single interface, doing the same thing
but differently each. And decided to skip writing
context.xml and going deeper with Spring =)
The solution was made. It worked reaaaally loooooong for words with six letters. I decided to start
working on Clojure implementation of this idea. I started writing a function, generating a list of
words. Word in this code was represented by a list of numbers from 0 to 25, associated with
corresponding letters of latin alphabet. That allowed me not to bore myself with generating a
list of chars and use a built-in Clojure’
But ended up with the same results as for Java on six letters. Moreover, my implementation
gave me strangely-nested list, which caused writing special flattening functions; Clojure’s
flatten gave me a plain list, where all the words were merged with others.
I needed performance boost!
I tried using
pmap instead of
map to parallelize the list generation process, but that
did not helped. I also tried generating a list of
future objects, so each list element will be
calculated only when needed:
But that caused my futures never to be finished. And then my collegue reminded me I can create a function, which will only rely on a previously generated word and will return the next one, not a list of words. “That might save me lot of memory!”, I thought and started coding.
And finished with this pretty implementation:
See, in this implementation no word is being saved nowhere except the return value or an input argument (well, actually the output from a function is then being used as an input for itself).
I converted all the word representations into words with this over-complicated function:
And then I created a function, generating a list of those words.
Yeah, again! But not a plain list or vector, as previously, nooo. I generated a lazy-sequence! This is where Clojure meets ES6 generators =) See, lazy sequence in Clojure is a (possibly) endless collection, each element of which could be evaluated (or assigned a value) when it is needed. Besides that, collection takes almost no memory and its creation costs almost no time. Saying almost I mean the time, needed to construct and store an object in memory. Just a simple, plain class’ instance.
But that was not an option - it took nearly
45 minutes to find all the 6-letter words even
pmap and eaten SO much damn memory!..
However, idle REPL eats much memory too:
Then I enhanced this implementation eliminating the
lazy-seq completely. So, there was no
collection creation at all! All the results were printed onto screen right away when found.
This could be replaced with any other storage - file, database, anything! Printing results
on the screen is a habit from my student years…
The memory consumption got minimal:
And the time consumption was not that good, though…
Running it from non-REPL environment helped a bit:
But where’s the Saint Graal for brute-forcing like that? How to speed that algorithm? I don’t know, actually =) Well, I know how to do it with OpenMP, I have a couple of ideas on how this task could be distributed, but those desire a separate article.