An empirical study of smoothing techniques for language modeling

Abstract

We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the cross-entropy of test data. In addition, we introduce two novel smoothing techniques, one a variation of Jelinek-Mercer smoothing and one a very simple linear interpolation technique, both of which outperform existing methods.

https://dl.acm.org/doi/pdf/10.3115/981863.981904

This paper is about helping computers guess the next word in a sentence. Sometimes, computers don’t know what word comes next because they haven’t seen it enough before. To help them, people use special tricks called "smoothing." This study looks at different smoothing tricks and tests them to see which ones work best. It also checks if things like how much practice the computer gets, the kind of writing (like stories or news), and if the computer looks at two words or three words at a time make a difference.

Here’s what they found:

Some tricks work better depending on the kind of writing or how much practice the computer gets.

More practice usually helps the computer guess better.

The researchers also made two new tricks:

A better version of an old trick.
A very simple trick that mixes things together.

Both new tricks worked better than the old ones, making the computer better at guessing words. This study helps us know which tricks to use and gives us new ways to make computers smarter at understanding words.