Relative Insight’s comparative approach to text analysis surfaces statistically significant differences and similarities between text data sets. In doing so, the platform directs your attention to the things that actually matter in a body of text.

In this article:

What is relative difference?

Relative difference is a measure of how much more prevalent a linguistic feature (topic, phrase, word, emotion or grammar element) is in one data set compared to others. It is normalised to take into account variations in the size of data sets being compared.

How is relative difference calculated?

For each data set uploaded into the platform, Relative Insight conducts a detailed linguistic analysis. Our natural language processing algorithms ‘read’ the text, identifying topics, grammar, emotions, words and phrases.

The frequencies (counts) of each linguistic feature are then determined and normalised based on the size of the data set to produce the relative frequency which can then be used to conduct ‘apples to apples’ comparisons between different sized data sets.

Relative difference is then calculated by dividing the relative frequency of a particular linguistic feature in two data sets.

unstructured text analysis relative difference calculation

Infinite relative difference (∞)

When a linguistic feature is significantly present in one data set and completely absent from the other (zero occurrences), the relative difference will be shown as infinity (∞). 

Very large and infinite relative differences don't always correlate to the most interesting insights - names and other source-specific words will commonly surface with high relative difference values. 

For example, comparing a Harry Potter book to The Great Gatsby would almost certainly surface 'Hermione' as having an infinite relative difference, but this doesn't tell us anything we wouldn't have already known.

To discern whether a discovery is insightful, ask yourself if you're surprised. If you are, you’re likely on to something...💡

Aggregated relative difference (on insight cards)

Insight cards display the aggregated relative difference for all the included linguistic features.

This number is calculated using the same logic outlined previously by comparing everything on the insight card grouped together against the same collection of linguistic features in the other data set. It is not a simple average of the relative differences for each linguistic feature.

In the example shown below, the aggregated relative difference of 64.7x indicates that the collection of topics, words and phrases are collectively 64.7x more prevalent among British gamers than Americans.

Did this answer your question?