Once a text data set is uploaded to Relative Insight, it flows through a series of processes that culminate with the results being presented to users in the comparison view. This article and explainer video will help you understand and speak about the inner workings of the platform.
The natural language processing (NLP) pipeline
When a data set is fed into Relative Insight, the text first flows through a series of processes – the NLP pipeline. As the data passes through, the algorithms 'read' the text and transform it into something a computer can understand and perform further analysis on.
This involves a number of steps, including (but not limited to):
Breaking the text down into sub-components (sentences, phrases, words)
Labelling parts of speech (noun, pronoun, adjective, determiner etc.)
Identifying named entities (people, locations, companies etc.)
Topic Identification – the process of the computer discerning what the text is about by looking at both individual words and phrases as well as the words that surround them
After passing through the NLP pipeline, the platform stores a record of the frequencies of each identified linguistic feature in a particular data set (topics, phrases, words, emotions, grammar).
To enable objective comparisons that aren't distorted by differences in the size of data sets (word counts), the platform calculates the relative frequency of each linguistic feature. For example, if the word ‘beauty’ appears 5 times in a data set of 1,000 words this will have the same relative frequency as a 2,000-word data set where the word appears 10 times.
Once this is done, the relative frequencies for each linguistic feature are compared to determine the relative difference. Relative difference is calculated for each data set being compared:
When relative difference values exceed 1.0 this indicates the linguistic feature is more prevalent in the data set being examined compared to others. The higher the value the bigger the difference.
Statistical testing to determine differences and similarities
To ensure sufficient evidence to assert a difference is not being identified by chance, the platform calculates the probability that the relative difference would indicate a difference where one doesn’t truly exist.
When a linguistic feature returns a relative difference between 0.9 and 1.1 and does not meet the threshold for classification as a difference, this indicates a potential similarity. Function words (e.g. if, the, and) are removed as these words occur with a high frequency in any data set. As with differences, a statistical test is conducted to assess that a similarity wasn’t identified where one doesn’t truly exist before presenting the results in the platform.
Displaying results in the comparison view
All of the processes described in this article happen in seconds (maybe minutes for larger data sets). Once completed, the differences and similarities are presented to the user in the platform. From the comparison view, users can then pull the most interesting discoveries onto insight cards.