The most straightforward way to extend the first-order Markov model is to add an order and make it a second-order model. This means that instead of matching token pairs it matches token triples. It's like using the two most recent words to predict the third. It's surprisingly effective. How would you complete these phrases?

You probably said "big bad wolf" (possibly "big bad bear" or "big bad dude") but it probably wasn't "big bad artichoke" and I'm certain it wasn't "big bad consider". The pattern set by two words is stronger than the pattern set by one. Tokens aren't quite the same thing. They are sometimes just parts of words and can include symbols and punctuation. But the idea still holds.

Similar to first-order models, the occurrences of all three-token sequences are counted up. To get from counts to probabilities, the sequence count is divided by total two-token prefix count. The probability that the next token after A and B will be C is given by

P(ABC | AB) = count(ABC) / count(AB)

In pseudo-Python the count array is initialized as a three-dimensional array of zeros

sequence_counts = zeros(dictionary_size, dictionary_size, dictionary_size)

and the counts are collected as the code walks the length of token ids that make up the full tokenized training corpus.

for i_pair in range(len(token_ids) - 2):
    sequence_counts[
        token_ids[i_pair],
        token_ids[i_pair + 1],
        token_ids[i_pair + 2]
    ] += 1

To get the probabilities, the counts are summed across all final tokens (axis 2) to get the total count of occurrences of the two-token prefix of the sequence. Then the count of each sequence is divided by the count of its prefix.

sequence_probabilities = sequence_counts / sum(sequence_counts, axis=2)

Memory management

Working with a three-token sequence means that the count and probability arrays become three-dimensional. That means that the total number of parameters jumps from the square of the tokenizer dictionary size to its cube. What was a manageable 20,000 squared (400 million) is now 20,000 cubed (8 trillion), and my laptop doesn't have the 32 TB of RAM to handle that many parameters.

With some trial and error I found that a dictionary size of 1,000, resulting in a 1 billion element array, is somthing I can handle with a handful of GB of memory. That means that the representation will be quite different than it would be for a 20,000-element dictionary tokenizer. But we can still run it on the evals and see how it stacks up.

Performance

model name capitalization spelling
random proof_01 (5) 8 (8) 16
FOMM_00 proof_02 (7) 98 (7) 98
SOMM_00 proof_03 (6) 100 (6) 98

Even with 1/20 th the dictionary, the second-order model performs on par with the first-order model. The 98% and 100% recall, coupled with the 6% precision, indicates that both the models are under-trained. Most sequences they encounters are entirely novel to them, and so get tagged as errors. They lack the eperience (training data) to competently distinguish correct from incorrect writing.

Doubling the training data

Currently, the models are all training on ten novels. It's telling to create new versions of the models that have been trained on an additional ten novels and compare the results.

I added the next ten most popular novels from the Project Gutenberg Science Fiction and Fantasy category:

and I added a volume to the evaluation data collection

Some of these I have never heard of before, unlike the first batch, all of which I was familiar with.

The resulting first-order model with double training data I numbered version 4 and assigned the second-order model version number 5.

The results on the doubled training data were underwhelming. Recall mostly held constant, and precision ticked up by a single percentage point. More training data is useful with these models, but it appears that it is going to the a metric ton of the stuff to make a dent in performance using these models.

model name capitalization spelling
random proof_01 (8) 12 (10) 16
FOMM_00 proof_02 (7) 98 (7) 98
SOMM_00 proof_03 (6) 100 (6) 98
FOMM_01 proof_04 (8) 100 (8) 98
SOMM_01 proof_05 (8) 100 (7) 96

Adding in an eval for punctuation

Writing evals is somewhat tedious so I'm spreading out the work, adding them in gradually. Here's what the updated results look like with punctuation added in.

model name capitalization punctuation spelling
random proof_01 (12) 26 (14) 20 (13) 24
FOMM_00 proof_02 (7) 98 (7) 98 (7) 98
SOMM_00 proof_03 (6) 100 (6) 94 (6) 98
FOMM_01 proof_04 (8) 100 (8) 98 (8) 98
SOMM_01 proof_05 (8) 100 (7) 90 (7) 96

The precision numbers for punctuation are consistent with the others, showing a huge number of false positives consistent with an under-trained language model. But the recall numbers are lower for the second-order Markov models, which also have a smaller tokenizer dictionary and shorter tokens. This suggests that some inaccuracies in punctuation only become clear when more context is taken into account. A missing comma, or an extra one, requires understanding the flow of a sentence. I'll get to test this in future verstions that incorporate more context, more tokens.

Adding a heartbeat

One thing about working with models and data sets that strain the capacity of your hardware is that they can be sloooow. A psychological trick for not driving yourself insane is to add a heartbeat to the code. Something, anything really, that prints to the terminal and gives a sense of continued progress. For instance, while training models on 20 novels, I started printing the name of each novel as it came up. Having something happen every second or two is all you need. It's also a useful indicator for when something has stopped working entirely. It can save you waiting for 90 minutes before you decide that the code has hung.

Next steps

It feels like this is progress, but the proofreading performance gains haven't hit yet. Time to try some new things.