How your customers perceive you and your products.

Semantic text analysis in R.

Dr. Marc Jacobs
19 min readFeb 17, 2023

--

Product are purchased more and more over the internet and whenever an item is purchased the digital platform you used allows you to place a review about the product. So other people can see if the product lived up to its specifications, or more precisely, to the expectation of the customer. Because, in the end, a review is not an objective perception of the item per se but a subjective appraisal. Which does not mean it has any less value. In fact, reviews can be extremely valuable and so any company that wants to look more in-depth in the appraisal of their products would do well to assess how they are assessed in the first place.

For a commercial company, I started analyzing the reviews coming from three major platforms. Using R I will show in this blog how I translated the Dutch reviews into English and then applied semantic analysis to gain a more in-depth understanding of how certain products and product families were appraised across time and place.

Although everybody could scrape review data from a digital platform I will not share the data here, since it is a work in progress and coming from a commercial assignment. Nevertheless, the reader should be able to easily translate what I have done to their own texts. A lot of examples already exist on text analysis, but the majority come from books or some other example that has been used way to often. This, on the other hand, is the real deal. And you will see, not that easy.

Lets get started by first loading in the libraries and the data, and then do some wrangling. The files are in Dutch and so is the content but once I get to the translation part it will turn to English. Because the translation is such a big part, I will leave the first part in the original Dutch language.

#### IMPORT LIBRARIES ----
library(readxl)
library(dplyr)
library(ggplot2)
theme_set(theme_bw())
library(janitor)
library(ggridges)
library(deeplr)
library(httr)
library(jsonlite)
library(stringr)
library(tidytext)
library(wordcloud)
library(reshape2)
library(igraph)
library(ggraph)
#### IMPORT DATA & WRANGLING ----

Reviews <- read_excel("Data/Reviewbestand tbv Reviews.xlsx")
dim(Reviews)
[1] 108496 39

The review file consists of plenty of reviews, and the file has almost no missing data. So that should be fun enough to play with. Of course, you never know what's in the dataset until you start unraveling it.

DataExplorer::plot_missing(Reviews)

Lets look at how many reviews we actually have.

Reviews <- Reviews %>% 
clean_names()%>%
dplyr::mutate(Date = as.Date(with(Reviews,paste(jaar,maand,dag,sep="-")),"%Y-%m-%d"))%>%
dplyr::mutate(rating = as.numeric(rating))%>%
dplyr::mutate(rating = if_else(bron=="Bol", rating*5, rating))

Reviews%>%
mutate(brand = recode_factor(brand, xxxx = "XXXX"),
jaar = as.factor(jaar))%>%
filter(brand == "XXXX")%>%
group_by(bron)%>%
select(bron)%>%
table()

Amazon Bol Fahrrad
5880 4655 485

So, thats about 10k reviews of which the majority is from Amazon. Once again, it is not so much about how many reviews yo have but much more what in them. The first problem we will see very early on (in the next coupld of plots already) is that the majority or reviews, if not almost all, allow the reviewer to add star rankings. But, as we will see, these star rankings add very little information. If any.

Numbers are great, but text is better.

ggplot(Reviews, 
aes(x=rating, fill=bron))+
geom_density(alpha=0.5)+
facet_grid(~jaar)

Reviews%>%
filter(!product_category%in%("-") & brand%in%c("xxxx", "XXXX"))%>%
ggplot(.,
aes(x=rating, fill=bron))+
geom_density(alpha=0.5)+
facet_grid(~jaar)
You can see the development of the rating distribution, per source and per year. The density plot is a bit misused here, because the five-rating scores are on an ordinal level, not a ratio. And so there really is nothing inbetween. But it does show how the majority of the reviews are accompanied by a five-star rating and that the number of reviews has increased over time.

Lets dig a bit deeper in the reviews, and the development of the ratings.

Reviews%>%
filter(!product_category%in%("-") & brand%in%c("xxxx", "XXXX"))%>%
mutate(jaar = as.factor(jaar))%>%
ggplot(.,
aes(x=jaar, y=rating))+
geom_point()+
geom_jitter()+
labs(x="Year",
y="Rating")

Reviews%>%
mutate(brand = recode_factor(brand, xxx = "XXXX"),
jaar = as.factor(jaar))%>%
ggplot(.,
aes(x=jaar, y=rating, col=brand))+
geom_point(alpha=0.5)+
geom_jitter()+
labs(x="Year",
y="Rating")
The reviews per year, and per major brand, and the accompanying ratings. As you can see, there is once again the clearly visible increase in ratings over the years. And despite the majority being a five-star or four-star rating, there are for sure lesser stars to be found. Once again, the value of the review needs to be assessed separately from the star.
Reviews%>%
mutate(brand = recode_factor(brand, xxxx = "XXXX"),
jaar = as.factor(jaar))%>%
ggplot(.,
aes(x=jaar, y=rating, col=brand))+
geom_point()+
geom_jitter()+
facet_wrap(~brand, ncol=4)+
labs(x="Year",
y="Rating",
col="Brand")+
theme(legend.position="none")

Reviews%>%
mutate(brand = recode_factor(brand, xxxx= "XXXX"),
jaar = as.factor(jaar),
rating = as.factor(rating))%>%
group_by(jaar, rating, brand)%>%
summarise(freq=n())%>%
ungroup()%>%
ggplot(.,
aes(x=jaar, y=rating, fill=log(freq)))+
geom_tile()+
scale_fill_viridis_c(option = "magma")+
facet_wrap(~brand, ncol=4)+
labs(x="Year",
y="Rating",
fill="Number of ratings (log scale)")+
theme(legend.position="bottom")
And, for particular brand, we can see the rating development in two ways: a jittered scatter plot or a heatmap. As you can see, some companies receive a steady stream of reviews whilst others struggle. For the majority, the majority of reviews consist of four-or five-star ratings.

We can also try and see if rating developments, across all brands, differ per source.

Reviews%>%
select(rating, jaar, bron)%>%
mutate(jaar = as.factor(jaar),
bron=as.factor(bron))%>%
ggplot(.,
aes(x=rating, y=jaar, fill=bron))+
ggridges::geom_density_ridges(alpha=0.4)+
facet_wrap(~bron, ncol=3)
That does not seem to be the case. In fact, according to these plots, little has changed over the years.

And now, lets look at the reviews of a particular brand and how it performs across time and across products within that brand.

Reviews%>%
filter(!product_family_xxxx%in%("-") & brand%in%c("xxxx", "XXXXX"))%>%
mutate(jaar = as.factor(jaar))%>%
group_by(jaar, rating, product_family_xxxx)%>%
summarise(freq=n())%>%
ggplot(.,
aes(y=rating,
x=jaar,
fill=freq))+
geom_tile()+
facet_wrap(~product_family_xxx)+
labs(x="Year",
y="Rating",
fill="Number of ratings")+
theme(legend.position="bottom")
Splitting the data will result in quite some gaps which is normal.

Now, I already mentioned that the reviews are in Dutch, but the majority of semantic lexicons are not in Dutch. They are in English. And, since I have reviews from multiple languages, (of which the majority are in Dutch) and I want to to cross-reference, it is best if I pick a single language. This means I need to translate the reviews which I will do using the DeepL translation API. Up to a certain number of characters per month, the API is free following your personal registration. You can then use your personal key to ask the API to come up with max 30k character translation for a maximum of 500k. Do not make no mistake, that is easily attained and in fact I was not able to process all the data from the start. That is something I will need to follow-up with.

my_key<-"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
langs <- deeplr::available_languages2(my_key)
as.data.frame(langs)
deeplr::usage2(my_key)

Reviews%>%
filter(!product_category%in%("-") & brand%in%c("xxxx", "XXXX"))%>%
select(review)%>%
nchar(., type = "bytes")

try<-Reviews%>%
dplyr::filter(!product_category%in%("-") & brand%in%c("xxxx", "XXXX"))%>%
select(review)%>%
as.character()
try%>%nchar(., type = "char")
try<-deeplr::split_text(try, max_size_bytes = 29000)
try1<-try%>%filter(segment_id%in%(c(1)))
try2<-try%>%filter(segment_id%in%(c(2)))
try3<-try%>%filter(segment_id%in%(c(3)))
try4<-try%>%filter(segment_id%in%(c(4)))
try5<-try%>%filter(segment_id%in%(c(5)))
try6<-try%>%filter(segment_id%in%(c(6)))
try7<-try%>%filter(segment_id%in%(c(7)))
try8<-try%>%filter(segment_id%in%(c(8)))
try9<-try%>%filter(segment_id%in%(c(9)))
try10<-try%>%filter(segment_id%in%(c(10)))
try11<-try%>%filter(segment_id%in%(c(11)))
try12<-try%>%filter(segment_id%in%(c(12)))
try13<-try%>%filter(segment_id%in%(c(13)))
try14<-try%>%filter(segment_id%in%(c(14)))
try15<-try%>%filter(segment_id%in%(c(15)))
try16<-try%>%filter(segment_id%in%(c(16)))
try17<-try%>%filter(segment_id%in%(c(17)))

Translation <-deeplr::translate2(
text = try,
target_lang = "EN",
auth_key = my_key)

Translation1<-Translation%>%
as.data.frame()%>%
unlist()%>%
stringr::str_split(string =.,
pattern =stringr::regex("\", \""))%>%
magrittr::extract2(3)%>%
as.data.frame()%>%
rename(., translated = .)%>%
mutate(segment_id = "1")%>%
as_tibble()

Translation2 <-deeplr::translate2(text = try2,target_lang = "EN",auth_key = my_key)%>%
as.data.frame()%>%
unlist()%>%
stringr::str_split(string =.,
pattern =stringr::regex("\", \""))%>%
magrittr::extract2(3)%>%
as.data.frame()%>%
rename(., translated = .)%>%
mutate(segment_id = "2")%>%
as_tibble()

comb_trans<-data.table::rbindlist(list(Translation2,Translation3,Translation4,
Translation5,Translation6,Translation7,
Translation8,Translation9,Translation10,
Translation11,Translation12,Translation13,
Translation14,Translation15,Translation1))
dim(comb_trans)
Reviews%>%
filter(!product_category%in%("-") & brand%in%c("xxxx", "XXXX"))%>%
dim() # I am missing about 400 reviews
write.csv(comb_trans, "comb_trans.csv")

comb_trans<-comb_trans%>%
mutate(segment_id = factor(segment_id, levels=c("1","2","3","4","5","6","7","8","9"
,"10","11","12","13","14","15")))%>%
arrange(segment_id)%>%
ungroup()%>%
mutate(translated = trimws(.$translated, which = c("both"), whitespace = "[ \t\r\n]"),
row_number = row_number())

comb_trans%>%select(segment_id)%>%table()

Reviews_bol_xxxx<-Reviews%>%
dplyr::filter(!product_category%in%("-") & brand%in%c("xxxx", "XXXX"))%>%
ungroup()%>%
mutate(row_number = row_number())

# Segment 1
Reviews_bol_xxxx%>%select(row_number, review)%>%filter(row_number==218)
comb_trans%>%select(row_number,segment_id,translated)%>%filter(row_number==218)
comb_trans%>%select(row_number,segment_id,translated)%>%filter(row_number==219) # row 218 reviews = rows 218 + 219 translated
Reviews_bol_xxxx%>%slice(n=1:218)
comb_trans%>%slice(n=1:219)
rev1<-Reviews_bol_xxxx%>%slice(n=1:218)
trans1<-comb_trans%>%slice(n=1:218)
dim(rev1);dim(trans1)
rev1%>%bind_cols(trans1)%>%select(review, translated)
comb1<-rev1%>%bind_cols(trans1)
# Segment 2
Reviews_bol_xxxx%>%select(row_number, review)%>%filter(row_number==407)
comb_trans%>%select(row_number,segment_id,translated)%>%filter(row_number==409)
rev2<-Reviews_bol_xxxx%>%slice(n=219:407)
trans2<-comb_trans%>%slice(n=220:408)
dim(rev2);dim(trans2)
rev2%>%bind_cols(trans2)%>%select(review, translated)
comb2<-rev2%>%bind_cols(trans2)
# Segment 3
Reviews_bol_xxxx%>%select(row_number, review)%>%filter(row_number==643)
comb_trans%>%select(row_number,segment_id,translated)%>%filter(row_number==646)
rev3<-Reviews_bol_xxxx%>%slice(n=408:643)
trans3<-comb_trans%>%slice(n=410:645)
dim(rev3);dim(trans3)
rev3%>%bind_cols(trans3)%>%select(review, translated)
comb3<-rev3%>%bind_cols(trans3)

............................

combined<-data.table::rbindlist(list(comb1,comb2,comb3,
comb4,comb5,comb6,
comb7,comb8,comb9,
comb10,comb11,comb12,
comb13,comb14,comb15))
combined<-combined%>%
as_tibble()%>%
select(-c(row_number...41, row_number...44))
write.csv(combined, "combined.csv")

combined%>%
select(review,
translated)%>%
filter(row_number()==2)

# A tibble: 1 x 2
review translated
<chr> <chr>
1 zeer mooie en toch handige afsluitende tas. Very nice and yet convenient locking bag.

Okay, so that was quite the endeavor. Besides from the obvious imperfections of my pipeline, I also struggled with the fact that the API does cut off reviews in the middle once the API-call limit is met. I should have thought about that early one, but I want to get as much out of the free subscription as possible. Next time I will do a more straightforward analysis which will save me a lot of time. However, in the end, as you can see from the last row, I did end up with sensible translations of my reviews. A total of almost 4200 translated reviews.

Lets see what I can do with these reviews.

So, first, I will need to unnest the review lines and start looking at separate words. Later on, I will revert back to connecting two or three words, but the majority of text analysis revolves around words and the emotions they contain. So lets see what’s exactly in those reviews.

tidy_text<-combined %>%
select(jaar, rating, translated)%>%
mutate(linenumber = row_number())%>%
unnest_tokens(word, translated)%>%
anti_join(stop_words)

tidy_text %>%
count(word, sort = TRUE)

# A tibble: 3,001 x 2
word n
<chr> <int>
1 nice 1629
2 bag 1232
3 bike 1219
4 sturdy 776
5 quality 713
6 basket 711
7 easy 626
8 convenient 372
9 bags 366
10 spacious 365
# ... with 2,991 more rows

tidy_text %>%
group_by(jaar, rating)%>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word)) +
geom_col() +
facet_wrap(~rating)+
labs(y = NULL)
Well, a LOT of words.

I already mentioned that to create some kind of semantic analysis, a lexicon is needed. A lexicon contains emotional words, or sentiments, that will connect a certain word to a number. So, for instance, on a scale of -5 to 5, the wordt ‘happy’ will score a 4 in one lexicon, and perhaps a 5 in another lexicon. I am not entirely sure how they were created, but consider it a ranking of a word on the sentiment scale.

Here, I will show you which words are part of the ‘joy’ chapter of the NRC lexicon. Quite a lot of words are connected to ‘joy’.

get_sentiments("afinn")
get_sentiments("bing")
get_sentiments("nrc")

nrc <- get_sentiments("nrc")
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
nrc_joy
# A tibble: 687 x 2
word sentiment
<chr> <chr>
1 absolution joy
2 abundance joy
3 abundant joy
4 accolade joy
5 accompaniment joy
6 accomplish joy
7 accomplished joy
8 achieve joy
9 achievement joy
10 acrobat joy
# ... with 677 more rows

Since we have a lexicon now, we can also see how many times we have words that are connected to ‘joy’. And, if we find the ‘joy’ part to be to restricted, we can see how many words have any sentiment attached to it, whatsoever, using the NRC lexicon.

tidy_text %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)

Joining, by = "word"
# A tibble: 101 x 2
word n
<chr> <int>
1 satisfied 139
2 happy 133
3 beautiful 123
4 shopping 100
5 perfect 81
6 electric 79
7 closure 66
8 money 52
9 excellent 39
10 excited 38
# ... with 91 more rows

tidy_text %>%
inner_join(nrc) %>%
count(word, sort = TRUE)
Joining, by = "word"
# A tibble: 621 x 2
word n
<chr> <int>
1 sturdy 776
2 fits 538
3 happy 532
4 shopping 500
5 convenient 372
6 spacious 365
7 perfect 324
8 money 312
9 rack 302
10 top 300
# ... with 611 more rows

Lets dig a bit deeper into sentiment analysis and use a different lexicon, bing. This lexicon can transform words into positive and negative words by from which you can actually get a sentiment number by subtracting the positive from the negative words. Although the words in themselves have much more meaning than just being a step-up towards a number, it is easier to do calculations now. So, my advice would be to look at the words themselves, their semantic meaning and the sentiments attached to them. But combine all three parts, and not just transform text to numbers because it is easier to calculate or make models.

tidy_text_sentiment <- tidy_text %>%
inner_join(get_sentiments("bing")) %>%
count(jaar, rating, index=linenumber, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(tidy_text_sentiment, aes(index, sentiment), col="black") +
geom_col(show.legend = FALSE, width=2)+
facet_grid(~jaar, scales="free", space="free")
And here we can see the sentiments, per review, per year. The majority of the reviews has an overall positive sentiment, but like I said, tranforming text to numbers will get you in trouble if you do not take a look at the underlying text.

What I am curious about is if the sentiment analysis can be connected to the ratings provided. I would expect that a higher star rating will also lead to a higher sentiment. There is no natural limit.

tidy_text_sentiment%>%
mutate(jaar = as.factor(jaar))%>%
group_by(rating, jaar)%>%
summarise(sentiment = median(sentiment))%>%
ggplot(.,
aes(x=rating, y=jaar, fill=sentiment, label=sentiment)) +
geom_tile()+
scale_fill_viridis_c(option="magma")+
geom_text()+
labs(x="Score",
y="Year",
fill="Sentiment (higher is better)",
title = "Sentiment per year and per score (median)")
As you can see, a higher score (or star rating) will often lead to a higher overall (median) sentiment number, but the relationship is not linear. Nor is the same across time. I am sure we could do more splits, but to me, this shows that a single star-rating may be indicative (as it should since it is given by the reviewer itself), but does not abolish the need to dig deeper into the reviews themselves.

Lets look at another way at the distribution of the sentiments. Since we are subtracting negative from positive words, we can create density plots but they can become quite fluid — meaning that the spikes you see have less meaning then the plot makes you believe.

tidy_text_sentiment%>%
mutate(jaar = as.factor(jaar))%>%
ggplot(., aes(x=sentiment, fill=jaar))+
geom_density(alpha=0.5)+
facet_grid(~rating, scales="free", space="free")+
labs(fill="Year")
The sentiment distribution per year and per star rating. From this plot it should again be very clear that the rating in itself is not that informative, nor will it directly show you what the content of the review will be. Even five-star ratings can have content which is less then positive.

Now I am curious if I can see some temporal development in the sentiments.

tidy_text_sentiment%>%
mutate(jaar = as.factor(jaar),
rating = as.factor(rating))%>%
ggplot(., aes(x=jaar, y=sentiment, col=rating))+
geom_point(alpha=0.5)+
geom_jitter()+
facet_grid(~rating, scales="free", space="free")+
labs(x="Year",
y="Sentiment (higher is more positive)",
title="Development of sentiment per year and per rating")
As you can see, once again, the sentiment analysis does not reveal a one-to-one relationship with the star ratings. Yes, the higher the star rating, the better the overall sentiment of the review, but there is much more information WITHIN a review then one may be expect from looking at the star rating.

Lets move deeper into the words now we have seen that the overall sentiment score and the star-rating have some expected relationship, but do not even tell half the story. What I am going to do next is look at how many positive and negative words I can find for a particular product family for a particular brand. This should give me a better understanding of the ratio of positive vs negative words. And will also slowly take me away from the numbers.

combined %>%
select(jaar, product_family_xxxx, rating, translated)%>%
unnest_tokens(word, translated)%>%
anti_join(stop_words)%>%
inner_join(get_sentiments("bing")) %>%
filter(!product_family_xxxx%in%c("-"))%>%
count(product_family_xxxx, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)%>%
ggplot(.)+
geom_bar(aes(y=product_family_xxxx , x=positive, fill="positief"), stat="identity")+
geom_bar(aes(y=product_family_xxxx , x=negative, fill="negatief"), stat="identity")+
labs(x="Number of words",
y="Product family",
fill="Positive / negative word",
title="How many positive negative words per review per product family")
You can easily see that some product families receive much more reviews (or have much longer reviews since we are only looking at words). But the reviews are for sure not all positive.

We can also look if there is a temporal effect, per year.

combined %>%
select(jaar, product_family_xxxx, rating, translated)%>%
unnest_tokens(word, translated)%>%
anti_join(stop_words)%>%
inner_join(get_sentiments("bing")) %>%
filter(!product_family_xxxx%in%c("-"))%>%
count(jaar, product_family_xxxx, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)%>%
ggplot(.)+
geom_bar(aes(y=product_family_xxxx , x=positive, fill="positief"), stat="identity")+
geom_bar(aes(y=product_family_xxxx , x=negative, fill="negatief"), stat="identity")+
facet_grid(~jaar, space="free", scale="free")+
labs(x="Number of words",
y="Product family",
fill="Positive / negative word",
title="How many positive negative words per review per product family")
From what I can see, graphically, that does not necessarily seem to be the case. Still, although we are moving away from the star rating, and back into words, we are still counting. So a lot of qualitative information is missing by shrinking text to numbers.

I mentioned before that there are different lexica that can be approached, and they are not the same. So, lets see how much they differ, and what kind of impact that could have on our findings.

afinn <- tidy_text %>% 
inner_join(get_sentiments("afinn")) %>%
group_by(linenumber) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN",
index=linenumber)
bing_and_nrc <- bind_rows(
tidy_text %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
tidy_text %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative")))%>%
mutate(method = "NRC"))%>%
count(method, index=linenumber, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
Here you see the sentiment analysis per review across the different lexica — as you can see, it really does depend which one you use. Not just per review, but also across reviews. Hence, it would actually be best to repeat any analysis you do across all three of them.
afinn <- tidy_text %>% 
inner_join(get_sentiments("afinn")) %>%
group_by(jaar, rating, linenumber) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN",
index=linenumber)
bing_and_nrc <- bind_rows(
tidy_text %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
tidy_text %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative")))%>%
mutate(method = "NRC"))%>%
count(method, jaar, rating, index=linenumber, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
bind_rows(afinn,
bing_and_nrc) %>%
mutate(jaar = as.factor(jaar),
rating = factor(rating, levels = c("1","2","3","4","5")))%>%
ggplot(aes(jaar, sentiment, col = method)) +
geom_point() +
geom_jitter() +
geom_hline(yintercept=0, lty=2, col="black")+
facet_grid(~rating,scales = "free", space = "free")
The difference between them becomes even more striking. Although I have mostly used the Bing lexicon, it would be better to use all three and compare them. Because which one you use makes a difference, like using different rulers to measure distance.

Now, lets move even deeper into words by looking at the distribution of positive and negative words across reviews, and per review, using the Bing lexicon.

tidy_text %>%
inner_join(get_sentiments("bing")) %>%
count(rating, word, sentiment, sort = TRUE) %>%
ungroup()%>%
group_by(sentiment) %>%
slice_max(n, n = 50) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~rating, scales = "free_x",ncol=5) +
labs(x = "Addition to sentiment",
y = "Word",
title="Distribution positive and negative words per review")
As expected, the majority of the words are positive and can be found in a four- or five-star rating. Nevertheless, even in those rating negative words can be found, but you always need to look very close if the word is indeed negative or not. Context is everything and especially when doing text analysis. Disentangling single words from text is a perfect way to loose context.

One of the most funny methods to apply to text analysis is the wordcloud, which will give you an idea of the words included in the reviews, how many times they are detected, and if they are positive or negative. Although the wordcloud does not connect words, the human brain will have no other choice but to scan across and look for patterns. And most of the time, some of them can be identified.

tidy_text %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 200))

tidy_text %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 500)
Two wordclouds: with (right) and without (left) sentiment.

The number of times a word has been detected only really makes sense if you know how many words were used across all reviews. So, we want to look for the ratio of a particular word across all words used in that particular review, or even across all reviews. Just to get some context back.

review_words <- tidy_text %>%
count(rating, word, sort = TRUE)
total_words <- review_words %>%
group_by(rating) %>%
summarize(total = sum(n))
review_words <- left_join(review_words, total_words)
review_words%>%
rename(review=rating)%>%
mutate(rownumber = row_number())%>%
filter(rownumber<201)%>%
ggplot(., aes(y=word, x=n/total)) +
geom_bar(stat="identity", show.legend = FALSE) +
facet_wrap(~review, ncol = 5, scales = "free")
Here you can see, per star rating, the words most often encountered and their ratio across all reviews. As you can see, many of the high-frequency words only have a low ratio. Meaning there is a lot going on in the reviews. Especially the reviews with a four or five star rating. It seems that just giving a star is not enough — text is being provided as well and that text needs to be analyzed.

Up until now we have looked at ratings and single words, but it is now time to start connecting words looking for combinations which are most prevalent. I mentioned before that words need context and the meaning of word can change dramatically in the presence of the word before of after the word of interest. That is why a sentence has more meaning then the separate words that make up that sentence in the first place.

Below you can see me looking for the most prevalent two-word combinations. It does not take you look to realize that the majority of words are preceded by a word with actually very little meaning such as ‘a’, ‘on’, ‘the’, ‘it’ etc. So one can best remove theses words and start connecting to the next word in line that actually has some deeper meaning. If we do that, we will start to see some very interesting combinations. Which we can then of course also start connecting to the star ratings again, since there are still of interest. And if not because of the rating itself, it is very much interesting because of how it relates to the words within a rated review.

> bol_bigrams <- combined %>%
+ select(jaar, rating, translated)%>%
+ unnest_tokens(bigram, translated, token = "ngrams", n = 2) %>%
+ filter(!is.na(bigram))
> bol_bigrams %>%
+ count(bigram, sort = TRUE)
# A tibble: 24,796 x 2
bigram n
<chr> <int>
1 easy to 545
2 good quality 495
3 on the 379
4 nice and 329
5 the bike 306
6 it is 279
7 the bag 271
8 to use 266
9 to attach 262
10 a lot 210
# ... with 24,786 more rows
> bol_bigrams <- combined %>%
+ select(jaar, rating, translated)%>%
+ unnest_tokens(bigram, translated, token = "ngrams", n = 2) %>%
+ filter(!is.na(bigram))
> bol_bigrams %>%
+ count(bigram, sort = TRUE)
# A tibble: 24,796 x 2
bigram n
<chr> <int>
1 easy to 545
2 good quality 495
3 on the 379
4 nice and 329
5 the bike 306
6 it is 279
7 the bag 271
8 to use 266
9 to attach 262
10 a lot 210
# ... with 24,786 more rows
> bigrams_separated <- bol_bigrams %>%
+ tidyr::separate(bigram, c("word1", "word2"), sep = " ")
> bigrams_filtered <- bigrams_separated %>%
+ filter(!word1 %in% stop_words$word) %>%
+ filter(!word2 %in% stop_words$word)
> bigram_counts <- bigrams_filtered %>%
+ count(word1, word2, sort = TRUE)
> bigram_counts
# A tibble: 4,965 x 3
word1 word2 n
<chr> <chr> <int>
1 bike basket 144
2 nice design 124
3 bike bag 116
4 nice sturdy 110
5 meets expectations 105
6 bicycle bag 88
7 nice bag 73
8 electric bike 67
9 quality convenient 66
10 fits perfectly 63
# ... with 4,955 more rows
> bigrams_united <- bigrams_filtered %>%
+ tidyr::unite(bigram, word1, word2, sep = " ")
> bigrams_united
# A tibble: 9,996 x 3
jaar rating bigram
<dbl> <dbl> <chr>
1 2018 4 convenient locking
2 2018 4 locking bag
3 2016 5 bike bell
4 2016 5 bell meets
5 2016 5 meets expectations
6 2016 5 expectations fast
7 2016 5 fast delivery
8 2016 5 delivery nice
9 2016 5 nice classic
10 2016 5 classic design
# ... with 9,986 more rows

Lets take it up a notch. Lets find three-word combinations, and fixate the second word to ‘bag’. We can then see which two-word combinations are best connected to the word ‘bag’ and connect those combinations to the ratings.

combined %>%
select(jaar, rating, translated)%>%
unnest_tokens(trigram, translated, token = "ngrams", n = 3) %>%
filter(!is.na(trigram)) %>%
tidyr::separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
count(word1, word2, word3, sort = TRUE)
bigrams_filtered %>%
filter(word2 == "bag") %>%
count(jaar, rating, word1, sort = TRUE)
bigram_tf_idf <- bigrams_united %>%
count(rating, bigram) %>%
bind_tf_idf(bigram, rating, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf%>%
group_by(rating)%>%
arrange(desc(tf_idf), .by_group = TRUE)%>%
top_n(., 4)%>%
mutate(rating = factor(rating))%>%
ggplot(., aes(x=tf_idf, y=bigram, fill=rating))+
geom_bar(stat="identity")+
facet_wrap(~rating, ncol=5)
On the y-axis you will see the two-word combinations related to the word ‘bag’, the star rating and the number of times these combinations are seen across the ratings. One can see that some combinations are common in both the four- and the five-star rating, but the majority can truly be found across ratings. To me, an indication that the rating has some meaning, but once again only warrants a deeper look.

One of the best words to look for is the word ‘not’, because the word not is a 180 degree context changer. Let me show you what happens if I look for combinations with the word ‘not’.

bigrams_separated %>%
filter(word1 == "not") %>%
count(word1, word2, sort = TRUE)
AFINN <- get_sentiments("afinn");AFINN
not_words <- bigrams_separated %>%
group_by(rating)%>%
filter(word1 == "not") %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word2, value, sort = TRUE)
my_colors <- c("darkgreen", "darkred")
not_words %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
head(100) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(n * value, word2, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = my_colors) +
labs(x = "Sentiment value * number of occurrences",
y = "Words preceded by \"not\"")+
facet_wrap(~rating, ncol=5)
Across ratings, you can see the words preceded by the word ‘not’. The color hints at the 180 degree context change, meaning that a green word is positive and a red word is negative, although on the sentiment scale it is the opposite. That is because a word can have a positive sentiment, but flips around when the word ‘not’ is attached to it. Words like ‘not’ or ‘no’ are very important to look out for.

Lets do that — lets look for both ‘not’ and ‘no’.

negation_words <- c("not", "no")
negated_words <- bigrams_separated %>%
filter(word1 %in% negation_words) %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word1, word2, value, sort = TRUE)
negated_words %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
head(200) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(n * value, word2, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = my_colors) +
labs(x = "Sentiment value * number of occurrences",
y = "Words preceded by \"negation term\"")+
facet_wrap(~word1, ncol=2, scales="free")
Easily to see which words are preceded by ‘no’ and ‘not’, their sentiment value and the number of times they occur. Its all about looking for patterns.

The last thing I will do is try and connect the words via correlation analysis, providing me with a network of words. From that network I should be able to get a glimpse of the value of the reviews as well hints at future endeavors. A single review in itself is really worthwhile and especially if it has detailed information about a product. Then again, the combinations across a review offer additional information and context on a product or product family. So its like slicing an onion and looking for patterns that are interesting across layers and within a layer.

Lets move towards the correlation analysis highlighting the combinations of words that are most frequently observed.

bigram_graph<-combined %>%
select(jaar, rating, translated)%>%
unnest_tokens(bigram, translated, token = "ngrams", n = 2) %>%
filter(!is.na(bigram))%>%
tidyr::separate(bigram, c("word1", "word2"), sep = " ")%>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)%>%
count(word1, word2, sort = TRUE)%>%
filter(n > 20) %>%
graph_from_data_frame()
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
The same network, but depicted differently which will lead to different placements within the network, but the connections stay. The plot to the right is most informative since the thickness of the arrow shows the strength of the correlation. Do keep in mind that in text analysis, correlations greater then 0.5 are rare. There is just too much going on, so what you are looking for is some connection that you did not foresee, which you can then use to dive deeper into the review. For instance, the combination ‘meets expectations’ keeps sticking out.

Like I did before with the words ‘not’ and ‘no’ (but using placement and not correlation), I can now look at words that are correlated most highly with certain other words. For this exercise, I will use the words ‘bag’, ‘bad’, ‘nice’ and ‘basket’.

bol_section_words <- combined %>%
select(jaar, rating, translated)%>%
group_by(rating)%>%
mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
group_by(rating, section)%>%
unnest_tokens(word, translated) %>%
filter(!word %in% stop_words$word)
word_pairs <- bol_section_words %>%
ungroup()%>%
widyr::pairwise_count(word, section, sort = TRUE)
word_pairs
word_cors <- bol_section_words %>%
group_by(word) %>%
filter(n() >= 20) %>%
widyr::pairwise_cor(word, section, sort = TRUE)
word_cors
word_cors %>%
filter(item1 == "not")
word_cors %>%
filter(item1 %in% c("bag", "bad", "nice", "basket")) %>%
group_by(item1) %>%
slice_max(correlation, n = 50) %>%
ungroup() %>%
mutate(item2 = reorder(item2, correlation)) %>%
ggplot(aes(item2, correlation)) +
geom_bar(stat = "identity") +
facet_wrap(~ item1, scales = "free") +
coord_flip()
Here you can see the correlation coefficients on the x-axis — that is not a lot of correlation from a traditional point of view! In fact, in the social sciences it would not even be worth mentioning, but we are talking text analysis here.

To filter through, lets look at all the correlations exceeding the theshold of 0.4 which is an arbitrary level I will readily admit.

word_cors %>%
filter(correlation > .4) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
Still, the results do carry information. There is information about delivery which we could dive deeper in since I want the reviews to talk about the product, and not the delivery, which is not something a company can always control. Then there is something about water repellent, its capacity, and its looks. But not every combinations seems to be positive, so also here we could start adding sentiment analysis again.

Last, but not least, I want to focus on the temporal development of certain words. These words have a semantic meaning for sure, but they can also be words on which a company may or must focus. Because these are the words that matter. So, what I did here, was focus on the words ‘price’, ‘nice’ and ‘bad’. And see how many times we encounter them across reviews and time.

combined %>%
select(jaar, rating, translated)%>%
mutate(translated = gsub(x = translated,
pattern = "[0-9]+|[[:punct:]]|\\(.*\\)",
replacement = "")) %>%
mutate(linenumber = row_number())%>%
group_by(jaar)%>%
unnest_tokens(word, translated)%>%
anti_join(stop_words)%>%
count(word)%>%
group_by(jaar)%>%
bind_tf_idf(term=word, document=jaar, n) %>%
arrange(desc(tf_idf))%>%
mutate(year_total = sum(n),
jaar=zoo::as.Date(zoo::as.yearmon(jaar)))%>%
filter(word %in% c("nice", "bad", "price")) %>%
ggplot(aes(jaar, n / year_total)) +
geom_point(color="black") +
geom_smooth() +
facet_wrap(~ word, scales = "free_y") +
scale_y_continuous(labels = scales::percent_format()) +
labs(y = "% frequency of word in review")+
scale_x_date(date_breaks = "1 year")
As you can see, the word ‘nice’ has for sure a temporal component, although it is impossible to explain from just using this plot. The rest seems to not be affected across time.

Alright, we are the end of an introductory post on semantic analysis of text files coming from product reviews. There is so much more we can do, but I hope it is already clear that adding a semantic pipeline to your customer insights or marketing department is a must.

Please let me know if anything is amiss, wrong or if you have question!

BECOME a WRITER at MLearning.ai

--

--

Dr. Marc Jacobs

Scientist. Builder of models, and enthousiast of statistics, research, epidemiology, probability, and simulations for 10+ years.