(Poorly written and badly structured Python code for this blog post can be found here, this makes quite a fun and quick project if you’re trying to learn a new language)
Imagine reading the following half of a sentence in a sherlock holmes novel:
"My name is Sherlock".....
What is the probability the next word will be “Holmes”? Well, thanks to project gutenberg who host public domain e-books, and the wonders of “find” in nearly any text editing software, we can see that that collection of words appears exactly twice in all of the Sherlock Holmes books, and more importantly for us, 100% of the time it appears it will be followed by the word “Holmes.”
Now lets look at the following fragment of that previous sentence:
We can probably assume the range of words that are likely to follow this fragment within the books is greater than just
"Sherlock",and with a bit of tedious searching, we can find exactly what words appear and the number of times that they do.
From this table of figures it is now possible to derive a series of probabilties, so that now when you read the fragment
"name is" in a Sherlock Holmes novel you can thoughtfully pause before reading the next word and take an informed guess at what it will be.
What if we now take the pair of words:
What is the range of words that could follow that fragment within the books? Or even better, what if we take the entirety of the Sherlock Holmes novels, split it all down into consecutive pairs of words and, for every pair, calculate the probabilites for the next word? What we’d have would be a model we could use to predict the next word for any given pair of words within the books of Sherlock Holmes.
Lets abstract this idea a bit:
We have a model of a system, which given a state of the system, we can use to predict the next state.
In our case the “system” is the books, the “state” is a pair of words, and the “model” is a massive list of all the pairs of words in the books with a corresponding table for every pair showing the possible following words (the “next state”). An important thing to note is the predictions are based purely on the current state, no previous states are considered.
Pretty cool eh? but what to do with it? Well, there’s probably some pretty interesting analysis/classification problems this could help with, but I’m just going to use it to make nonsense sentences.
To make nonsense we first pick a random pair of consecutive words from our book as a starting point. We then look at all the possible following words in our model and roll a virtual weighted die to determine what the next word should be. We then take the 2nd word from out first pair and our new word, form them into a new pair and find the possible words that could follow this new pair….and so on.
The result of this can be quite interesting, sentences can seem to have a structure and a semblance of meaning, yet still make no sense. This is due in part to some pairs of words having very few possible words following them (for example
"is Sherlock"), while other pairs of words have many possible words that could follow them (for example
Where can we go from here? Well, there are a number of possible options, instead of using whole words, why not use letters to make nonsense words? Change the number of consecutive words used to create the model and see what happens (my guess is you’ll reach a point where trying to get a random sentence will result in just reciting parts of the book word for word), how about feeding two different books into the model instead of just one? (feeding Sherlock Holmes and the Wizard of Oz in results in some pretty weird sentences that seem to “flow” between Oz and Sherlock).