Relative Insight’s comparative approach to text analysis surfaces statistically significant differences and similarities between language sets. In doing so, our software directs your attention to the things that actually matter in a body of text.

The Relative difference metric is a measure of how much more prevalent a topic, phrase, word, emotion or grammar element is in one body of text compared to others. The platform also displays frequency and similarity metrics.

How is Relative difference calculated?

For each language set uploaded into the platform, Relative Insight conducts a detailed linguistic analysis. Our natural language processing algorithms ‘read’ the text, identifying topics, grammar, emotions, words and phrases. The frequencies of each language element are then determined and normalised based on the size of the language set to enable ‘apples to apples’ comparisons between language sets of different sizes.

Relative difference is calculated by dividing the normalised frequency of a particular language element in one language set by the normalised frequency of the same element in the comparison language set(s).

Where the relative difference calculation reveals a difference, the platform applies an additional layer of statistical analysis to provide confidence that the difference is not surfacing due to chance. Log-likelihood calculations are performed to assess this possibility, and the output of the analysis viewable within the platform will only display differences that meet a 99% confidence interval. This means that there will be a maximum of 1% chance that a difference was identified where one does not truly exist.

Why should I trust insights based on low frequencies?

This is one of the most common questions we get from new users of Relative Insight.

The frequency of word usage follows what is called a Zipf distribution. This statistical law dictates that the frequency of a word is inversely proportional to its rank in the frequency table. Put simply, this means the second most common word will appear half as often as the most common, the third one third as often and so on. Because of this, most words are expected to occur very infrequently and thus even a few occurrences can result in a statistically significant finding.

The nature of dealing with words

Words are less precise than numbers. This means that even the most advanced text analysis solution may surface findings that don’t make perfect sense. Relative Insight is no exception. For example, consider the word ‘spring’ which has context-specific meanings as a verb, to describe a season or in reference to a mechanical component. This can pose a challenge when it comes to topical classifications. The ability to view verbatim examples from the text can help you overcome this and better understand the data you have analysed when things may not be immediately clear.

Understanding infinite (∞) relative differences

When a language element is statistically significant in one language set and completely absent from the other, the relative difference will be shown as ∞. 

Please be careful to understand that an infinite relative difference doesn’t necessarily equal a relevant insight. For example, names and other source-specific words will commonly surface as having infinite relative differences. 

For example, comparing a Harry Potter book to The Great Gatsby would almost certainly surface 'Hermione' as having a ∞ relative difference, but this doesn't tell us much about J.K. Rowling's writing style.

To discern whether a discovery you’ve made is interesting or insightful, it’s helpful to ask yourself whether you personally find it surprising.  If yes, then you’re likely on to something... 😮 💡

Did this answer your question?