There’s nothing as bittersweet as finishing a really great book. What do you do when you complete a great series and are left yearning for more?
If you are like me, you train a machine to read the books and let it generate new (unseen) passages from the book!
About the book
Let me introduce you to my favourite book - The Hitchhiker’s Guide to the Galaxy.
A sci-fi comedy unlike no other, a trilogy in five parts.
The book is smart, funny, well-written and full of wonderful commentary on the human condition.
Plus, the book has some of the best quotes EVER!
Why would you not read it?
Generating new passages
- It is trained on the first 2 books - The Hitchhiker’s Guide to the Galaxy and The Restaurant at the End of the Universe
- Once I obtained the text version of the books, I divided the text into individual sentences.
- Each sentence was then subdivided into words (dealing with punctuation as well)
- I used the word2vec algorithm to generate word-embeddings for each word in the text’s vocabulary.
(Word embeddings are numerical representations of contextual similarities between words) - Generating the dataset:
- Iterate through the entire text and divide it into overlapping, semi-redundant chunks of 31 words each. The first 30 words act as X (data) while the last word serves as the Y
- Train an LSTM model using the X, Y datasets constructed earlier. (Note that the last layer of the LSTM model should be a softmax layer with the same size as the total number of words). Essentially, the model outputs probabilities of the next word given a sentence
- Once trained, generate a random string of exactly 30 words (preferably from the text itself) and pass it to the LSTM model, select a random word according to the probabilities given by the model. And repeat!
Example: Consider the text:
“Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.” The X,Y pairs generated from this text are:
X | Y |
---|---|
“Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.” | “digital” |
“this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital” | “watches” |
“at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches” | “are” |
… | … |
“roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat” | “idea” |
Usage
- Visit http://hg2g-42.herokuapp.com/ (it’s pretty much self-explanatory)
- The top section displays a random one-liner from the original trilogy (of 5 parts).
- You can enter some seed text that my model can use as a starting point (Leave it empty for an arbitrary starting sentence).
- The length of the generated text can also be varied (between 5-75 words).
- Since the sampling mechanism is random, you will get different (but similar) results even for the same seed text.
- The algorithm works only on lowercase letters for simplicity. So, the case of the seed text is irrelevant
Some good starting points (Found after hours of experimentation ;p)
- A nice cup …
- Once again, Marvin was …
- The heart of gold …
- The mice …
- The total perspective vortex …
- The Sirius cybernetics corporation …