Standard English gives users the ability to get value from a single language source, providing a new methodology for comparing text and analyzing unstructured data.
Notably, Data Discovery uses Standard English comparison as default.
Standard English models general language use from a very wide range of topics and backgrounds.
In this article, you will learn:
What is Standard English?
Relative Insight’s Standard English model is a general representation of written English. It is comprised of 9,954,331 words representing 175,954 unique words from 100,760 different sources. It is comprised of a sample of Wikipedia articles and forum conversations on a wide variety of topics. This model has been built into the platform and can be used for many sorts of comparisons.
While the best comparisons will most often be between similar data sources, there are several situations in which the Standard English model can be very useful.
When to use Standard English
1. To identify key themes within a data set
When analyzing a new data set, it is often helpful to do some preliminary analysis to identify key linguistic features. This kind of ‘baselining’ can help you determine potential ways to split your data based on the content of the text (topics, words, phrases, emotions, and grammar) to build additional comparisons.
This approach is also useful when you have a data set that is either too small to be split, or you can’t get your hands on a suitable data set to compare.
2. When you are interested in frequency analysis
Being comprised of a wide range of sources, Standard English is a good representation of the general distribution of words. If you’re trying to understand what words are over-indexing (i.e. appearing frequently in a dataset) then it can provide a suitable basis for comparison.
Creating comparisons against Standard English
When creating a comparison, click 'Explore your data.'
In the secondary language tab, select our standard model and click 'Select.'
The standard model in Relative Insight is also available in German, French, and Spanish.