Loooong lists with Clojure

16 Oct 2015

Whaaaaat?

These days I was given a reeeeally interesting homework at the university. I was given a set of MD5 hashes, calculated from single words (taken from Libre Office' dictionaries) with a given sault. And the task was to find all those words.

So, the first idea which came to my mind was using an internet service for MD5 breaking. But... aaarrrggghhh! There's a sault, so the webservice, looking for words over a dictionary fails to find mines...

So the second idea was to take that dictionary from Libre Office and iterate through it. At the end, it worked =) And worked reeeally fast. But that is not an interesting part.

I wandered if I could find those words in my dictionary, generated by my own code.

Hoooow?

I was thinking of writing it functionally. Really functionally, in Clojure. And, highly motivated by how good Clojure is at parallelism (sarcasm?..), I started thinking. After many tries and even installing Clojure REPL on my phone and writing code all the way home from university and work, I asked my dear friend to help me with this.

And so we went non-functional way first... We wrote a Java class, having two methods for generating words recursively. The idea was, to make a new word by appending it with each letter, then appending each of the alphabet' letters to this word and so on, until we get all the words of N characters.

When our pretty class was created and working (for some cases), we decided to do it more Java-correct and created two more classes, implementing a single interface, doing the same thing but differently each. And decided to skip writing context.xml and going deeper with Spring =)

import java.util.*;
import java.util.stream.*;

interface IWordGenerator {
    public List<String> get(final int length);
}

class MooWordGenerator implements IWordGenerator {
    // Best practices in naming convention =)

    /** not necessary */
    public MooWordGenerator() {}

    public void completeWords(final int length, List<Character> cs, List<String> words) {
        if (length == 0) {
            String word = cs.stream().map(Object::toString).collect(Collectors.joining()); // what the fuuck???

            words.add(word);

            return;
        }

        for (char c = 'a'; c <= 'z'; c++) {
            List<Character> ncs = new ArrayList<>();

            ncs.addAll(cs);
            ncs.add(c);

            completeWords(length - 1, ncs, words);
        }
    }

    @Override
    public List<String> get(final int length) {
        List<String> words = new ArrayList<>();
        List<Character> cs = new ArrayList<>();

        completeWords(length, cs, words);

        return words;
    }
}

class FooWordGenerator implements IWordGenerator {
    /** */
    private StringBuilder sb;
    /** */
    private List<String> words;

    /**
     * Creates list of strings containing all possible words with given length.
     *
     * @param length - the length of words to generate.
     * @return List<String> words.
     *
     */
    @Override
    public List<String> get(final int length) {
        List<String> allWords = getAllWords(length);
        sb = null;
        words = null;
        return allWords;
    }

    /**
     * Creates list of strings containing all possible words with given length.
     * Possible to run this method ONLY ONCE! Next calls will
     * give the wrong result.
     *
     * @param length - the length of words to generate.
     * @return List<String> words.
     *
     */
    private List<String> getAllWords(final int length) {
        for (char c = 'a'; c <= 'c'; c++) { //doesn't work
            getSB().append(c);

            if (getSB().length() < length) {
                getAllWords(length);
            } else {
                getWords().add(getSB().toString());
            }

            getSB().deleteCharAt(getSB().length() - 1);
        }

        return getWords();
    }

    /**
     * Get string builder. create new if null.
     * @return sb.
     */
    private StringBuilder getSB() {
        if (sb == null) {
            sb = new StringBuilder();
        }

        return sb;
    }

    /**
     * Get words array. create new if null.
     * @return words.
     */
    private List<String> getWords() {
        if (words == null) {
            words = new ArrayList<String>();
        }

        return words;
    }
}

class Main {
    public static void main(String[] args) {
        /** D: go smoke
         *
         *
         *
         *
         **/
        IWordGenerator gen1 = new MooWordGenerator();
        IWordGenerator gen2 = new FooWordGenerator();

        System.out.println("Moo:");

        gen1.get(6).stream().forEach(System.out::println);

        System.out.println("Foo:");

        gen1.get(6).stream().forEach(System.out::println);
    }
}

The solution was made. It worked reaaaally loooooong for words with six letters. I decided to start working on Clojure implementation of this idea. I started writing a function, generating a list of words. Word in this code was represented by a list of numbers from 0 to 25, associated with corresponding letters of latin alphabet. That allowed me not to bore myself with generating a list of chars and use a built-in Clojure' range function.

(defn gen-words [len cs]
    (if (= 0 len)
        cs
        (map #(gen-words (dec len) (conj cs %)) (range 26))))

But ended up with the same results as for Java on six letters. Moreover, my implementation gave me strangely-nested list, which caused writing special flattening functions; Clojure's flatten gave me a plain list, where all the words were merged with others.

I needed performance boost!

I tried using pmap instead of map to parallelize the list generation process, but that did not helped. I also tried generating a list of future objects, so each list element will be calculated only when needed:

(defn gen-words [len cs]
    (if (= 0 len)
        cs
        (map #(future (gen-words (dec len) (conj cs %))) (range 26))))

But that caused my futures never to be finished. And then my collegue reminded me I can create a function, which will only rely on a previously generated word and will return the next one, not a list of words. "That might save me lot of memory!", I thought and started coding.

And finished with this pretty implementation:

(defn next-word [prev-word]
    (let [
            l (drop-while (partial = 25) prev-word)
            l-size (- (count prev-word) (count l))
            zeroes (repeat l-size 0)
            new-head (inc (first l))
            new-tail (rest l)
        ]
        (concat zeroes [new-head] new-tail)))

See, in this implementation no word is being saved nowhere except the return value or an input argument (well, actually the output from a function is then being used as an input for itself).

I converted all the word representations into words with this over-complicated function:

(defn word-to-str [w] (clojure.string/join (map #(char (+ (int \a) (int %))) w)))

And then I created a function, generating a list of those words.

(defn words
    ([n] (words n (repeat n 0)))
    ([n w]
        (let [str-w (word-to-str w)]
            (if (every? (partial = 25) w)
                [str-w]
                (lazy-seq (cons str-w (words n (next-word w))))))))

Yeah, again! But not a plain list or vector, as previously, nooo. I generated a lazy-sequence! This is where Clojure meets ES6 generators =) See, lazy sequence in Clojure is a (possibly) endless collection, each element of which could be evaluated (or assigned a value) when it is needed. Besides that, collection takes almost no memory and its creation costs almost no time. Saying almost I mean the time, needed to construct and store an object in memory. Just a simple, plain class' instance.

But that was not an option - it took nearly 45 minutes to find all the 6-letter words even when using pmap and eaten SO much damn memory!..

However, idle REPL eats much memory too:

(defn seq-contains? [coll target]
    (some #(= target %) coll))

(defn decode-1 [n]
    (let [ ws (filter #(seq-contains? input-hashes (encode %)) (words n)) ]
        (pmap (fn [w] [w (encode w)]) ws)))

Then I enhanced this implementation eliminating the lazy-seq completely. So, there was no collection creation at all! All the results were printed onto screen right away when found. This could be replaced with any other storage - file, database, anything! Printing results on the screen is a habit from my student years...

(defn decode-2
    ([n] (decode-2 n (repeat n 0)))
    ([n prev-word]
        (let [
                prev-word-str (word-to-str prev-word)
            ]
            (if (seq-contains? input-hashes (encode prev-word-str))
                (println (encode prev-word-str) prev-word-str)
                nil)
            (if (every? (partial = 25) prev-word)
                nil
                (recur n (next-word prev-word))))))

The memory consumption got minimal:

And the time consumption was not that good, though...

task1.core=> (time (decode-2 5))
4adbc43d3e1b432aea5e657cd57016de rampa
bba60169d41b2dce7d0b37b2f9d637e0 kolej
7914da949837cbdecf35cbf5951ad518 argon
"Elapsed time: 92879.138203 msecs"
nil

Running it from non-REPL environment helped a bit:

╰─$ lein run
4adbc43d3e1b432aea5e657cd57016de rampa
bba60169d41b2dce7d0b37b2f9d637e0 kolej
7914da949837cbdecf35cbf5951ad518 argon
"Elapsed time: 73796.575567 msecs"

Profit?

But where's the Saint Graal for brute-forcing like that? How to speed that algorithm? I don't know, actually =) Well, I know how to do it with OpenMP, I have a couple of ideas on how this task could be distributed, but those desire a separate article.