“Clothed all in the morning you know by the river” – A Lite Introduction to Markov Chains and Nonsense

(Poorly written and badly structured Python code for this blog post can be found here, this makes quite a fun and quick project if you’re trying to learn a new language)

Imagine reading the following half of a sentence in a sherlock holmes novel:

"My name is Sherlock".....

What is the probability the next word will be “Holmes”? Well, thanks to project gutenberg who host public domain e-books, and the wonders of “find” in nearly any text editing software, we can see that that collection of words appears exactly twice in all of the Sherlock Holmes books, and more importantly for us, 100% of the time it appears it will be followed by the word “Holmes.”

Now lets look at the following fragment of that previous sentence:

"name is"

We can probably assume the range of words that are likely to follow this fragment within the books is greater than just "Sherlock",and with a bit of tedious searching, we can find exactly what words appear and the number of times that they do.

From this table of figures it is now possible to derive a series of probabilties, so that now when you read the fragment "name is" in a Sherlock Holmes novel you can thoughtfully pause before reading the next word and take an informed guess at what it will be.

What if we now take the pair of words:

"My name"

What is the range of words that could follow that fragment within the books? Or even better, what if we take the entirety of the Sherlock Holmes novels, split it all down into consecutive pairs of words and, for every pair, calculate the probabilites for the next word? What we’d have would be a model we could use to predict the next word for any given pair of words within the books of Sherlock Holmes.

Lets abstract this idea a bit:

We have a model of a system, which given a state of the system, we can use to predict the next state.

In our case the “system” is the books, the “state” is a pair of words, and the “model” is a massive list of all the pairs of words in the books with a corresponding table for every pair showing the possible following words (the “next state”). An important thing to note is the predictions are based purely on the current state, no previous states are considered.

Pretty cool eh? but what to do with it? Well, there’s probably some pretty interesting analysis/classification problems this could help with, but I’m just going to use it to make nonsense sentences.

To make nonsense we first pick a random pair of consecutive words from our book as a starting point. We then look at all the possible following words in our model and roll a virtual weighted die to determine what the next word should be. We then take the 2nd word from out first pair and our new word, form them into a new pair and find the possible words that could follow this new pair….and so on.

The result of this can be quite interesting, sentences can seem to have a structure and a semblance of meaning, yet still make no sense. This is due in part to some pairs of words having very few possible words following them (for example "is Sherlock"), while other pairs of words have many possible words that could follow them (for example "it is").

Where can we go from here? Well, there are a number of possible options, instead of using whole words, why not use letters to make nonsense words? Change the number of consecutive words used to create the model and see what happens (my guess is you’ll reach a point where trying to get a random sentence will result in just reciting parts of the book word for word), how about feeding two different books into the model instead of just one? (feeding Sherlock Holmes and the Wizard of Oz in results in some pretty weird sentences that seem to “flow” between Oz and Sherlock).

11 responses to ““Clothed all in the morning you know by the river” – A Lite Introduction to Markov Chains and Nonsense

  1. Thanks for your blog post – it finally explained Markov chains pretty clearly and I managed to knock up a basic implementation in CoffeeScript in an hour.

    It’s rough and ready, but does the job – hopefully I can improve it over the weekend and release the code as well.

    • Glad you found it useful! Let me know if you release the code, would be nice to have a look through someone else’s implementation (mine was pretty shoddy, have been meaning to spruce it up a bit).

  2. “…and crime records, unique! Violin player, boxer, swordsman, lawyer, and self poisoner by cocaine, and ambition. The drowsiness of the sort.”

    (punctuation added)

  3. Pingback: Tweets that mention “Clothed all in the morning you know by the river” – A Lite Introduction to Markov Chains and Nonsense | Shotton Electric -- Topsy.com

  4. Very cool. Markov chains can lots of interesting stuff. I used them to generate pronounceable passwords: http://exyr.org/2011/random-pronounceable-passwords/

  5. programmingpraxis

    I did an exercise on Markov chains at my blog Programming Praxis.

  6. Here’s my C++ implementation, and a writeup about it from some time ago, specifically looking at using Markov chains to generate random inputs for testing software.

  7. Great post. A basic “here’s the point” sort of intro to a topic is really hard to write but makes the rigorous explanation much more accessible.

    I wanted to try changing the number of words in the ‘state’ so I changed around your implementation a bit to make this easier. I’ve found that at 3 or 4 words full sentence quotes start showing up.

    Other than making thematic lorem ipsum generators what are the most common uses for this method?

    • Excellent, thank you! Do you mind if I push your changes to the github repo?I’ll add you in the readme as a contributor or something.

      I’ve heard talk of this method being used to produce spam that’ll get through spam filters better due to the sentence structure and word distribution being similar to genuine emails. Beyond that I’m afraid at haven’t really looked into possible uses. Although I’ve heard hidden markov models have loads of uses in natural language processing.

      • Feel free to use my changes, no need to worry about attribution. I’ll say that it has some known buggyness though: “it’s” going to get turned to [“it”, ” s”], the word splitter doesn’t respect sentence boundaries, etc.

        If you are learning Python and interested in the implementation differences I suspect two things will be of particular relevance to you: the defaultdict from the standard library and the use of negative numbers as indexes. Those two dramatically simplify your code. The somewhat strange ' '.join([...]) idiom is totally anti-zen in my opinion but widely regarded as correct.

        Always good to know that SPAM is in with a fighting chance… I’ll keep an eye out for your post on hidden Markov models.

  8. Pingback: 3D Scanner Mk 3 – Lasering Time! « Shotton Electric

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s