Relative Insight’s comparative approach to text analysis surfaces statistically significant differences and similarities between text data sets. In doing so, the platform directs your attention to the things that actually matter in a body of text.
In this article:
What is relative difference?
Relative difference is a measure of how much more prevalent a linguistic feature (topic, phrase, word, emotion or grammar element) is in one data set compared to others. It is normalized to take into account variations in the size of data sets being compared.
How is relative difference calculated?
For each data set uploaded into the platform, Relative Insight conducts a detailed linguistic analysis. Our natural language processing algorithms ‘read’ the text, identifying topics, grammar, emotions, words and phrases.
The frequencies (counts) of each linguistic feature are then determined and normalized based on the size of the data set to produce the relative frequency which can then be used to conduct ‘apples to apples’ comparisons between different-sized data sets.
Relative difference is then calculated by dividing the relative frequency of a particular linguistic feature in two data sets.
Infinite relative difference (∞)
When a linguistic feature is significantly present in one data set and completely absent from the other (zero occurrences), the relative difference will be shown as infinity (∞).
Very large and infinite relative differences don't always correlate to the most interesting insights - names and other source-specific words will commonly surface with high relative difference values.
For example, comparing a Harry Potter book to The Great Gatsby would almost certainly surface 'Hermione' as having an infinite relative difference, but this doesn't tell us anything we wouldn't have already known.
To discern whether a discovery is insightful, ask yourself if you're surprised. If you are, you’re likely on to something... |
Aggregated relative difference (on insight cards)
Insight cards display the aggregated relative difference for all the included linguistic features. You can also customize it by setting your preferred overall impact metric.
This number is calculated using the same logic outlined previously by comparing everything on the insight card grouped against the same collection of linguistic features in the other data set. It is not a simple average for each linguistic feature.
With insight cards, you can combine different linguistic elements which may overlap, because we make sure that all overlaps are accounted for. For example, if we have the topic Food, the word "taco" and the phrase "taco Tuesday," the word "taco" would only count toward the overall metric once, even though it exists in more than one element.