Implicit Grammar: Language Modeling for Spell-Checking
Using not-very-small language models for spell-checking before LLMs became a big thing.
In my studies of natural language processing (NLP), I was always curious about how we can discover structure in linguistic data and how far we can get using relatively simple statistical models in capturing human knowledge. Recent years have shown that we can go pretty far with embarrassingly complex models, but let us keep it simple this time.
It is very convenient to double-check the spelling of a phrase using a search engine. Google, Bing, and others gathered big collections of documents in their indices. They can serve as powerful indicators of the popularity of a word or a phrase on the Internet. So, a correctly spelt phrase will appear in the search index often. And if it is not, then something requires additional attention. For example, let us check how many documents in Google have each of the two phrases:
this country is wonderful "as well as it's people": 6 results
this country is wonderful "as well as its people": 3,680,000 results
The more grammatical version of the sentence (its people instead of it’s people) appears in many more documents. This is the beauty of the statistical approach for NLP. We can make useful things without explicit linguistic knowledge. We defined no grammar rules and performed no parsing for spell-checking. Instead, we just queried a huge collection of texts (aka corpus) called the World Wide Web.
What we observed above is an example of statistical language modeling. A bit more strictly, a language model computes the probability of a sequence of words in the language. A good language model assigns a high probability to the phrase that frequently occurs in the language and vice versa; it is a fundamental building block of statistical NLP. Although the idea is very simple, language models built on vast amounts of text demonstrate impressive results. They can not just estimate the phrase probabilties, but also generate new texts, resembling the ones in the underlying corpus. Just look at the GPT family of neural language networks with huge amounts of parameters and training data.
Can we avoid making manual requests to Google and have a more convenient way to spell-check the texts? When I was a Ph.D. student, I had a chance to try this idea during the hackathon at the 2014 Microsoft Research summer school in Moscow.
The dominant approach for language modeling in 2014 used n-grams. We split the input text into sequences of n subsequent words, counted their occurrences in a corpus, and emitted the joint probability. The main reason for an n-gram language model is that the exact phrase can be missing in the corpus. Since products of many terms quickly become useless due to numerical errors, people use log probabilities in practice. Hence, the log-probability of the sequence of k words is:
The simplest way to compute the probability of an n-gram is to count its occurrences and divide this number by the number of (n−1)-gram occurrences:
For example, to compute the trigram (3-gram) probability of money for nothing
, we divide the count of money for nothing
by money for
(using Google: 11,300,000 / 304,000,000 ≈ 0.04). Such an approach will not treat out-of-vocabulary words and long-term word relationships correctly, but it is still good enough (read the Speech and Language Processing book for learning more). An essential property of n-grams is that they count the actual occurrences of phrases in real texts in the corpus, not try to estimate them using a neural network. This is very nice for spell-checking.
A large search index and the underlying corpus are a very confidential resource. Fortunately, at that time, Microsoft Research maintained a service called Web N-Gram Service with the entire Bing search index, similarly to how OpenAI offers their completion APIs now. We built a mashup called Spellah that queried these large n-gram language models to verify the spelling.
The spell-checker had two assumptions:
A single word (a unigram) should appear frequently in a language (because otherwise, it might be a typo).
A pair of words (a bigram) should appear more frequently in a language rather than each of the words independently (because otherwise, it might be a grammar error).
The first assumption checks only the unigram probability of a word (the threshold of 10⁻¹⁰ was guessed in a trial-and-error process):
The second assumption compares the joint probability of a bigram to the product of unigram probabilities:
The overall spell-checking algorithm was very simple. Initially, each word had a zero error score. Running each of the two checks increased the score of each word. Finally, words were highlighted according to the score: green for zero, yellow for one, and red for two or more. Unfortunately, we did not have time to produce suggestions. If one fixes the mistake, everything becomes nice and green.
Spellah received the best overall solution award at the 2014 Microsoft Research Russia Summer School.
Best overall: this honor went to an interactive cloud service that takes a user’s English text and discovers subtle grammar errors. By building on Microsoft Web N-gram services, the winning team produced extremely impressive results.
I hope it counts as 9+ years of leadership experience with large language models, as some job advertisements require.
The source code for Spellah is available on GitHub: https://github.com/dustalov/spellah. Unfortunately, since the underlying Web N-Gram service has been discontinued, it is not possible to run the mashup again without plugging in another language model anymore. Also, one should get better results by prompting modern neural language models carefully.
The funniest part of this event organized by Microsoft was that I got an iPad mini as a prize. At the time, Microsoft Surface tablet was already on a shelf.
My command of English has surely improved since then.