Posts New passages from an old book.
Post
Cancel

New passages from an old book.

There’s nothing as bittersweet as finishing a really great book. What do you do when you complete a great series and are left yearning for more?
If you are like me, you train a machine to read the books and let it generate new (unseen) passages from the book!

About the book

Let me introduce you to my favourite book - The Hitchhiker’s Guide to the Galaxy.

hg2g

A sci-fi comedy unlike no other, a trilogy in five parts.
The book is smart, funny, well-written and full of wonderful commentary on the human condition.
Plus, the book has some of the best quotes EVER!
Why would you not read it?

Generating new passages

  • It is trained on the first 2 books - The Hitchhiker’s Guide to the Galaxy and The Restaurant at the End of the Universe
  • Once I obtained the text version of the books, I divided the text into individual sentences.
  • Each sentence was then subdivided into words (dealing with punctuation as well)
  • I used the word2vec algorithm to generate word-embeddings for each word in the text’s vocabulary.
    (Word embeddings are numerical representations of contextual similarities between words)
  • Generating the dataset:
    • Iterate through the entire text and divide it into overlapping, semi-redundant chunks of 31 words each. The first 30 words act as X (data) while the last word serves as the Y
  • Train an LSTM model using the X, Y datasets constructed earlier. (Note that the last layer of the LSTM model should be a softmax layer with the same size as the total number of words). Essentially, the model outputs probabilities of the next word given a sentence
  • Once trained, generate a random string of exactly 30 words (preferably from the text itself) and pass it to the LSTM model, select a random word according to the probabilities given by the model. And repeat!

Example: Consider the text:
“Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.” The X,Y pairs generated from this text are:

XY
“Orbiting this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat idea.”“digital”
“this at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital”“watches”
“at a distance of roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches”“are”
“roughly ninety-two million miles is an utterly insignificant little blue green planet whose ape-descended life forms are so amazingly primitive that they still think digital watches are a pretty neat”“idea”

Usage

  • Visit http://hg2g-42.herokuapp.com/ (it’s pretty much self-explanatory)
    • The top section displays a random one-liner from the original trilogy (of 5 parts).
    • You can enter some seed text that my model can use as a starting point (Leave it empty for an arbitrary starting sentence).
    • The length of the generated text can also be varied (between 5-75 words).
  • Since the sampling mechanism is random, you will get different (but similar) results even for the same seed text.
  • The algorithm works only on lowercase letters for simplicity. So, the case of the seed text is irrelevant

Some good starting points (Found after hours of experimentation ;p)

  • A nice cup …
  • Once again, Marvin was …
  • The heart of gold …
  • The mice …
  • The total perspective vortex …
  • The Sirius cybernetics corporation …
This post is licensed under CC BY 4.0 by the author.

Contents

Comments powered by Disqus.