Monthly Archives: February 2011

“Clothed all in the morning you know by the river” – A Lite Introduction to Markov Chains and Nonsense

(Poorly written and badly structured Python code for this blog post can be found here, this makes quite a fun and quick project if you’re trying to learn a new language)

Imagine reading the following half of a sentence in a sherlock holmes novel:

"My name is Sherlock".....

What is the probability the next word will be “Holmes”? Well, thanks to project gutenberg who host public domain e-books, and the wonders of “find” in nearly any text editing software, we can see that that collection of words appears exactly twice in all of the Sherlock Holmes books, and more importantly for us, 100% of the time it appears it will be followed by the word “Holmes.”

Now lets look at the following fragment of that previous sentence:

"name is"

We can probably assume the range of words that are likely to follow this fragment within the books is greater than just "Sherlock",and with a bit of tedious searching, we can find exactly what words appear and the number of times that they do.

From this table of figures it is now possible to derive a series of probabilties, so that now when you read the fragment "name is" in a Sherlock Holmes novel you can thoughtfully pause before reading the next word and take an informed guess at what it will be.

What if we now take the pair of words:

"My name"

What is the range of words that could follow that fragment within the books? Or even better, what if we take the entirety of the Sherlock Holmes novels, split it all down into consecutive pairs of words and, for every pair, calculate the probabilites for the next word? What we’d have would be a model we could use to predict the next word for any given pair of words within the books of Sherlock Holmes.

Lets abstract this idea a bit:

We have a model of a system, which given a state of the system, we can use to predict the next state.

In our case the “system” is the books, the “state” is a pair of words, and the “model” is a massive list of all the pairs of words in the books with a corresponding table for every pair showing the possible following words (the “next state”). An important thing to note is the predictions are based purely on the current state, no previous states are considered.

Pretty cool eh? but what to do with it? Well, there’s probably some pretty interesting analysis/classification problems this could help with, but I’m just going to use it to make nonsense sentences.

To make nonsense we first pick a random pair of consecutive words from our book as a starting point. We then look at all the possible following words in our model and roll a virtual weighted die to determine what the next word should be. We then take the 2nd word from out first pair and our new word, form them into a new pair and find the possible words that could follow this new pair….and so on.

The result of this can be quite interesting, sentences can seem to have a structure and a semblance of meaning, yet still make no sense. This is due in part to some pairs of words having very few possible words following them (for example "is Sherlock"), while other pairs of words have many possible words that could follow them (for example "it is").

Where can we go from here? Well, there are a number of possible options, instead of using whole words, why not use letters to make nonsense words? Change the number of consecutive words used to create the model and see what happens (my guess is you’ll reach a point where trying to get a random sentence will result in just reciting parts of the book word for word), how about feeding two different books into the model instead of just one? (feeding Sherlock Holmes and the Wizard of Oz in results in some pretty weird sentences that seem to “flow” between Oz and Sherlock).

On a personal note….

Half of all the blog posts I write seem to begin with me semi-apologetically mentioning how long it’s been since I’ve last written something down here (apologising to who, I’m not sure) or about the 3D scanner (which is still making progress). Well this one shan’t (except in this round-about sort of way). This one shall begin with me stating:

I’ve Been Busy.

Trying to stay true to my new years resolution of “doing more stuff” I’ve taken up rock climbing again and I’m thoroughly enjoying it, I’ve also done a bit of snowboarding which was immensely fun, and to top it off I’ve also drunkenly agreed to go along to one of the tango lessons my landlady teaches…. although i may slightly be regretting that.

Kindle!

I love books, and for a long time I despised the idea of e-books, they seemed so impersonal and transient compared to a solid physical object that you could hold, see and smell (who doesn’t like the smell of fresh print, or old dusty and worn books?). Until 4 days ago i had read the grand total of one e-book (the excellent Makers by Cory Doctrow, a massive inspiration for the 3D scanner project), if it wasn’t for how good the book was my total would of stood at 0 e-books, the experience of endless scrolling and reading a backlit screen for extended periods of time is both tiring and uncomfortable.

I’d occasionally eyed kindles and other e-ink display readers in shops and over peoples shoulders on the tube, and apart from being mildly impressed by how nice the screens are to read, I was never terribly taken by them. “Why not?” I started asking myself, I have a stack of books sitting at home with the spines un-cracked, i have piles of books I’ve impulse bought, read once and then left in a box to collect dust and I’m forever ferrying books between London and Manchester to read on the train. These books are doing nothing more than taking up space and/or weighing my backpack down, an electronic device that could hold all of them would be perfect.

So i bought myself a kindle.

The moment I knew it was awesome was when I forgot I was reading a book on a kindle. While reading books normally I often get so “into it” I don’t consciously realise I’m turning the pages, or that my eyes are skipping along lines of ink, the physical book just sort of melts away. That’s the same experience I had reading my first book on the kindle.

I’ve read two books on it so far and its been excellent. The screen is easy to read, they’ve refined the button placement almost perfectly on the 3rd generation, and the ability to download any book at almost any time over 3G is incredibly useful (although the web browser leaves a little to be desired, but hey it’s not what it’s aimed at).

There are a few downsides. There’s been many times I’ve looked up a book only to find it not available for the kindle (although hopefully this situation will only improve with time), the keyboard it pretty mediocre, and I don’t like the restrictive DRM they put on books downloaded from the amazon store (I’ve heard rumours that the e-book manager calibre has some helpful plug-ins to deal with this). I can also imagine it not working very well with books that use the physical properties of the book as part of the narrative (Such books as Albert Angelo by B.S.Johnson), but with that in mind the form of the kindle, along with amazon working on an SDK, has potential to open up a whole new realm of narrative possibilities (keeping my fingers crossed for a revival of choose your own adventure books).

I’ve also jail-broken it so I can do custom screen-savers (some of the built in kindle ones are a bit freaky), and although its not exactly original I’ve done this to it:

I mean, come on. Portable, connected to a free world wide network (I bought the 3G version) with access to wikipedia/wikitravel…..

Bloody hell……. I was going write about some other stuff I’ve been doing but this post has practically turned into me spouting my opinions of the kindle. Well Amazon, enjoy your free publicity on this grand Blog with its staggering readership of about 2 (hi Lawrence!).