The answer is…Zipf distribution! 📉
Unless you’re a linguist, it’s unlikely you’ve ever heard of this before - so let us explain…
Zipf distribution is a statistical law in natural language that states the frequency of a word is inversely proportional to its rank in the frequency table. This means that the second most common word will appear 1/2 as often as the most frequent word, the third 1/3 as often and so on.
This essentially means that words other than the most common ones are used very infrequently. In fact, despite there being 171,000 words in the English dictionary, 50% of any decent sized language set is made up of just 170.
Because we expect most words to appear in a language set very infrequently, even low frequencies can yield statistically significant results. 💥