Generate inedited Beatles lyrics using Markov Chain
I love The Beatles so much, I wish they wrote more lyrics... wait, I'm a programmer, I can generate new lyrics based on their discography! Let's see how!

"๐ŸŽค Yesterday... all my troubles seemed so far awayyyyy"

Oh, wait, this is an article, you can't hear me sing! (luckily ๐Ÿ˜„)

After many years listening to my favourite artists, I know every single song and lyric. I wish that artists like Beatles, Queen, Nirvana wrote more songs!

That's what we're going to do, we will create a program that will take a bunch of lyrics as input and produces a new lyric every time that it's executed.

Incredibly skilled data scientists and programmers would end up by creating an AI with TensorFlow or stuff like that... but we want to keep it easy! We're gonna use an algorithm called Markov Chain.

#Why using Markov Chain algorithm?

I'm glad you asked!
Wikipedia describes Markov Chain as:

"a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event"

In few words, let's take a look at the following lyrics by The Beatles:

"Love is old, love is new, love is all, love is you"

let's break the sentence above in single words.

Now answer the following questions:

  1. How many times is the word "love" repeated in the above sentence?
  2. Which are the words coming immediately next to the word "love"?
  3. How many times is the word "is" repeated?
  4. Which are the words coming immediately next to the word "is"?

after we answer those questions, we can clearly visualize the following diagram:

Generate inedited Beatles lyrics using Markov Chain

So, given the diagram above, we can tell that:

  1. After the word "love", there's a 100% possibility to encounter the word "is"
  2. After the word "is", there's a 25% possibility to encounter one of the following word: "old", "new", "all", "you".

Now, imagine doing the same research on a more complex song, such as Hey Jude:

"Hey Jude, don't make it bad
Take a sad song and make it better
Remember to let her into your heart
Then you can start to make it better

Hey Jude, don't be afraid
You were made to go out and get her
The minute you let her under your skin
Then you begin to make it better

And anytime you feel the pain
Hey Jude, refrain
Don't carry the world upon your shoulders

For well you know that it's a fool
Who plays it cool
By making his world a little colder

Na na na...

Hey Jude, don't let me down
She has found you, now go and get her
Remember to let her into your heart
Then you can start to make it better

So let it out and let it in
Hey Jude, begin
You're waiting for someone to perform with
And don't you know that it's just you
Hey Jude, you'll do
The movement you need is on your shoulder
La, la, la, la...

Hey Jude, don't make it bad
Take a sad song and make it better
The minute you let her into your heart
Then you can start to make it better

Na na na..."

A Reddit user posted an interesting flowchart about this song that will save me a lot of work:

Generate inedited Beatles lyrics using Markov Chain

(source)

Can you see where I'm heading to?
We want to estimate the probability for each word to appear next to another one.
Given that value, we'll be able to compose new sentences based on the probability of a given word to pop up after another given word.

#Building the Markov Chain

We're gonna build the Markov Chain in TypeScript, so that it can run in our browsers after the compilation phase.

If you're not used to TypeScript syntax, don't worry! It will be very easy to understand.

We'll start by defining the Markov Chain interface:

interface IMarkovChain<T> {
  add:      (txt: string)     => T
  train:    ()                => T
  generate: (length?: number) => string
}

our MarkovChain class will have three public methods:

  1. add: it will let us to add more and more lyrics to our model.
  2. train: it will train our model. A model is basically an object  which will hold all the relationships between a given word to all the words that may come next to it.
  3. generate: it will return a new lyric of a given length based on our model.

Now we can start to build our MarkovChain class:

class MarkovChain implements IMarkovChain<MarkovChain> {
constructor() {} private data: string = ''; private model: {[key: string]: string[]} = {};
}

as you can see, inside our class we also want to specify two private variables.

The first one is data. It will store all the lyrics that we're adding as a unique string.

The second one is model. It will be an object which should look like this:

{
love: ['is'],
is: ['old', 'new', 'all', 'you'],
old: [],
'new': [],
all: [],
you: []
}

(if you're wondering why 'new' is quoted, it's because it is a reserved keyword in JS).

As you may have guessed, the model stores every single word in data, and all the words that will come next to them.

Let's start to implement the add method:

public add(txt: string): MarkovChain {
this.data += `${txt} `
return this
}

with this method, we just want to append the input string to the data variable. Please note that we're also adding a space after the txt variable, so that it won't concatenate the last word of the given lyrics with the first of the next one.

Now move on with the train method.
First of all, we want to make sure that we already inserted some data before starting the training. Then, we want to remove all the special characters (excluding the single quote) from our data.

public train(): MarkovChain {
if (!this.data.length) {
throw Error(`You should add some data before starting the training`) } const tokens = this.data .toLowerCase() .replace(/\n/gm, ' ') .replace(/[^(\w|\d|'|\s)]/gm, '') .split(' ') .filter(Boolean)
}

That way the following input string: 'hey there! I\'m trying the Markov Chain for the first time!!!' will be transformed into: ['hey', 'there', 'i\'m', 'trying', 'the', 'markov', 'chain', 'for', 'the', 'first', 'time'].

Now we want to train our model, so we basically want to loop over every single token. If the given token doesn't exist yet in our model as a key, we'll add it (and we'll add an empty array as a value). Otherwise, we'll push the next token as a value inside the current model array.

public train(): MarkovChain {
  if (!this.data.length) {
    throw Error(`You should add some data before starting the training`)
  }

  const tokens =
    this.data
      .toLowerCase()
      .replace(/\n/gm, ' ')
      .replace(/[^(\w|\d|'|\s)]/gm, '')
      .split(' ')
      .filter(Boolean)
for (let i = 0; i < tokens.length; i++) { const token = tokens[i] if (!(token in this.model)) { this.model[token] = [] }
const nextToken = tokens[i + 1];
if (Boolean(nextToken)) { this.model[token].push(nextToken) } } return this; }

Awesome! Now we can train our model.
Now we just need to add the generate method.

We'll also add a private helper function called getRandomElement, which will return a random element from a given array.

public generate(length: number = 100): string {
  const words = Object.keys(this.model)

  if (!words) {
    throw Error(`You should add some data and then train the model before generating a new sentence`)
  }

  let result     = ''
  let randomWord = this.getRandomElement(words)
    
  for (let i = 0; i < length; i++) {
    const newWord = this.getRandomElement(this.model[randomWord])
    result        += `${newWord} `
      
    if (!(newWord in this.model)) {
      randomWord = this.getRandomElement(words)
    } else {
      randomWord = newWord
    }
  }

  return result;
}

private getRandomElement<T>(input: T[]): T {
  return input[Math.floor(Math.random() * input.length)]
}

As you can see, these methods are pretty straightforward.
First of all, we want to check if the use has already  trained the model.
After that, we start by getting a random word from our model.
Then we want to loop x times, where x is the length of the final lyrics that we want to create.

Inside every loop we want to get a word that comes next to the previously instantiated newWord variable.

We then concatenate this word to the result string. Then we want to check if that word has its own successor inside our model.
If so, we just reinstantiate the randomWord variable with the successor, otherwise we take a new random word.

Once the loop is over, we just return the resulting string, and that's it!

Here's the full implementation:

interface IMarkovChain<T> {
  add:      (txt: string)     => T
  train:    ()                => T
  generate: (length?: number) => string
}

class MarkovChain implements IMarkovChain<MarkovChain> {
  
  constructor() {}

  private data:  string = '';
  private model: {[key: string]: string[]} = {};
  
  public add(txt: string): MarkovChain {
    this.data += `${txt} `
    return this;
  }

  public train(): MarkovChain {
    if (!this.data.length) {
      throw Error(`You should add some data before starting the training`)
    }

    const tokens =
      this.data
        .toLowerCase()
        .replace(/\n/gm, ' ')
        .replace(/[^(\w|\d|'|\s)]/gm, '')
        .split(' ')
        .filter(Boolean)

    for (let i = 0; i < tokens.length; i++) {
      const token = tokens[i]

      if (!(token in this.model)) {
        this.model[token] = []
      }

      const nextToken = tokens[i + 1];
      if (Boolean(nextToken)) {
        this.model[token].push(nextToken)
      }
    }

    return this;
  }

  public generate(length: number = 100): string {
    const words = Object.keys(this.model)

    if (!words) {
      throw Error(`You should add some data and then train the model before generating a new sentence`)
    }

    let result     = ''
    let randomWord = this.getRandomElement(words)
    
    for (let i = 0; i < length; i++) {
      const newWord = this.getRandomElement(this.model[randomWord])
      result        += `${newWord} `
      
      if (!(newWord in this.model)) {
        randomWord = this.getRandomElement(words)
      } else {
        randomWord = newWord
      }
    }

    return result;
  }

  private getRandomElement<T>(input: T[]): T {
    return input[Math.floor(Math.random() * input.length)]
  }

}

#Playing with our new toy

Now that our class is ready, we can play with it as follows:

const chain = new Markov Chain()

chain
.add('...')
.add('...')
.add('...')
.train()

console.log(chain.generate())

You can add as many lyrics you want!I
I've also created a little Codepen so that you can play with it ๐Ÿ˜„

You can also download the code from GitHub!
Enjoy!