Once a text data set is uploaded to Relative Insight, it flows through a series of processes and visualizations that culminate with the results being presented to users in the Explore view.
In this article you will learn about:
The natural language processing (NLP) pipeline
When a data set is fed into Relative Insight, the text first flows through a series of processes – the NLP pipeline. As the data passes through, the algorithms read the text and transform it into something a computer can understand and perform further analysis on.
This involves a number of steps, including:
Breaking the text down into sub-components (sentences, phrases, words)
Labeling parts of speech (noun, pronoun, adjective, determiner etc.)
Identifying named entities (people, locations, companies etc.)
Topic Identification – the process of the computer discerning what the text is about by looking at both individual words and phrases as well as the words that surround them.
Comparison
After passing through the NLP pipeline, the platform stores a record of the frequencies of each identified linguistic feature in a particular data set.
To enable objective comparisons that aren't distorted by differences in word counts, the platform calculates the relative frequency of each linguistic feature.
For example, if the word ‘beauty’ appears 5 times in a data set of 1,000 words this will have the same relative frequency as a 2,000-word data set where the word appears 10 times.
Once this is done, the relative frequencies for each linguistic feature are compared to determine the relative difference. Relative difference is calculated for each data set being compared:
When relative difference values exceed 1.0 this indicates the linguistic feature is more prevalent in the data set being examined compared to others. The higher the value the bigger the difference.
Statistical testing to determine differences and similarities
To ensure sufficient evidence to assert a difference is not being identified by chance, the platform calculates the probability that the relative difference would indicate a difference where one doesn’t truly exist.
When a linguistic feature returns a relative difference between 0.9 and 1.1 and does not meet the threshold for classification as a difference, this indicates a potential similarity.
Function words (e.g. if, the, and) are removed as these words occur with a high frequency in any data set. As with differences, a statistical test is conducted to assess that a similarity wasn’t identified where one doesn’t truly exist before presenting the results in the platform.
Displaying results in the comparison view
All of the processes described in this article happen in minutes. Once completed, the visualizations with key trends and summaries appear, followed by a detailed output of differences and similarities.
From Explore page, users can then pull the most interesting discoveries onto insight cards.