The answer is…Zipf distribution!

Unless you’re a linguist, it’s unlikely you’ve heard of this before - so let us explain…

Zipf distribution (named after the American linguist George Kingsley Zipf) is a statistical law in natural language that states the frequency of a word is inversely proportional to its rank in the frequency table. This means that the second most common word will appear 1/2 as often as the most frequent word, the third 1/3 as often and so on. 

This essentially means that words other than the most common ones are used very infrequently. In fact, despite there being 171,000 words in the English dictionary, 50% of any substantially sized text data set is made up of just 170 (think of words like if, and, the but, because etc.).

Because we expect most words to appear in a data set very infrequently, even low frequencies can yield statistically significant results.

Did this answer your question?