How your customers perceive you and your products.
Semantic text analysis in R.
Product are purchased more and more over the internet and whenever an item is purchased the digital platform you used allows you to place a review about the product. So other people can see if the product lived up to its specifications, or more precisely, to the expectation of the customer. Because, in the end, a review is not an objective perception of the item per se but a subjective appraisal. Which does not mean it has any less value. In fact, reviews can be extremely valuable and so any company that wants to look more in-depth in the appraisal of their products would do well to assess how they are assessed in the first place.
For a commercial company, I started analyzing the reviews coming from three major platforms. Using R I will show in this blog how I translated the Dutch reviews into English and then applied semantic analysis to gain a more in-depth understanding of how certain products and product families were appraised across time and place.
Although everybody could scrape review data from a digital platform I will not share the data here, since it is a work in progress and coming from a commercial assignment. Nevertheless, the reader should be able to easily translate what I have done to their own texts. A lot of examples already exist on text analysis, but the majority come from books or some other example that has been used way to often. This, on the other hand, is the real deal. And you will see, not that easy.
Lets get started by first loading in the libraries and the data, and then do some wrangling. The files are in Dutch and so is the content but once I get to the translation part it will turn to English. Because the translation is such a big part, I will leave the first part in the original Dutch language.
#### IMPORT LIBRARIES ----
library(readxl)
library(dplyr)
library(ggplot2)
theme_set(theme_bw())
library(janitor)
library(ggridges)
library(deeplr)
library(httr)
library(jsonlite)
library(stringr)
library(tidytext)
library(wordcloud)
library(reshape2)
library(igraph)
library(ggraph)
#### IMPORT DATA & WRANGLING ----
Reviews <- read_excel("Data/Reviewbestand tbv Reviews.xlsx")
dim(Reviews)
[1] 108496 39
The review file consists of plenty of reviews, and the file has almost no missing data. So that should be fun enough to play with. Of course, you never know what's in the dataset until you start unraveling it.
DataExplorer::plot_missing(Reviews)
Lets look at how many reviews we actually have.
Reviews <- Reviews %>%
clean_names()%>%
dplyr::mutate(Date = as.Date(with(Reviews,paste(jaar,maand,dag,sep="-")),"%Y-%m-%d"))%>%
dplyr::mutate(rating = as.numeric(rating))%>%
dplyr::mutate(rating = if_else(bron=="Bol", rating*5, rating))
Reviews%>%
mutate(brand = recode_factor(brand, xxxx = "XXXX"),
jaar = as.factor(jaar))%>%
filter(brand == "XXXX")%>%
group_by(bron)%>%
select(bron)%>%
table()
Amazon Bol Fahrrad
5880 4655 485
So, thats about 10k reviews of which the majority is from Amazon. Once again, it is not so much about how many reviews yo have but much more what in them. The first problem we will see very early on (in the next coupld of plots already) is that the majority or reviews, if not almost all, allow the reviewer to add star rankings. But, as we will see, these star rankings add very little information. If any.
Numbers are great, but text is better.
ggplot(Reviews,
aes(x=rating, fill=bron))+
geom_density(alpha=0.5)+
facet_grid(~jaar)
Reviews%>%
filter(!product_category%in%("-") & brand%in%c("xxxx", "XXXX"))%>%
ggplot(.,
aes(x=rating, fill=bron))+
geom_density(alpha=0.5)+
facet_grid(~jaar)
Lets dig a bit deeper in the reviews, and the development of the ratings.
Reviews%>%
filter(!product_category%in%("-") & brand%in%c("xxxx", "XXXX"))%>%
mutate(jaar = as.factor(jaar))%>%
ggplot(.,
aes(x=jaar, y=rating))+
geom_point()+
geom_jitter()+
labs(x="Year",
y="Rating")
Reviews%>%
mutate(brand = recode_factor(brand, xxx = "XXXX"),
jaar = as.factor(jaar))%>%
ggplot(.,
aes(x=jaar, y=rating, col=brand))+
geom_point(alpha=0.5)+
geom_jitter()+
labs(x="Year",
y="Rating")
Reviews%>%
mutate(brand = recode_factor(brand, xxxx = "XXXX"),
jaar = as.factor(jaar))%>%
ggplot(.,
aes(x=jaar, y=rating, col=brand))+
geom_point()+
geom_jitter()+
facet_wrap(~brand, ncol=4)+
labs(x="Year",
y="Rating",
col="Brand")+
theme(legend.position="none")
Reviews%>%
mutate(brand = recode_factor(brand, xxxx= "XXXX"),
jaar = as.factor(jaar),
rating = as.factor(rating))%>%
group_by(jaar, rating, brand)%>%
summarise(freq=n())%>%
ungroup()%>%
ggplot(.,
aes(x=jaar, y=rating, fill=log(freq)))+
geom_tile()+
scale_fill_viridis_c(option = "magma")+
facet_wrap(~brand, ncol=4)+
labs(x="Year",
y="Rating",
fill="Number of ratings (log scale)")+
theme(legend.position="bottom")
We can also try and see if rating developments, across all brands, differ per source.
Reviews%>%
select(rating, jaar, bron)%>%
mutate(jaar = as.factor(jaar),
bron=as.factor(bron))%>%
ggplot(.,
aes(x=rating, y=jaar, fill=bron))+
ggridges::geom_density_ridges(alpha=0.4)+
facet_wrap(~bron, ncol=3)
And now, lets look at the reviews of a particular brand and how it performs across time and across products within that brand.
Reviews%>%
filter(!product_family_xxxx%in%("-") & brand%in%c("xxxx", "XXXXX"))%>%
mutate(jaar = as.factor(jaar))%>%
group_by(jaar, rating, product_family_xxxx)%>%
summarise(freq=n())%>%
ggplot(.,
aes(y=rating,
x=jaar,
fill=freq))+
geom_tile()+
facet_wrap(~product_family_xxx)+
labs(x="Year",
y="Rating",
fill="Number of ratings")+
theme(legend.position="bottom")
Now, I already mentioned that the reviews are in Dutch, but the majority of semantic lexicons are not in Dutch. They are in English. And, since I have reviews from multiple languages, (of which the majority are in Dutch) and I want to to cross-reference, it is best if I pick a single language. This means I need to translate the reviews which I will do using the DeepL translation API. Up to a certain number of characters per month, the API is free following your personal registration. You can then use your personal key to ask the API to come up with max 30k character translation for a maximum of 500k. Do not make no mistake, that is easily attained and in fact I was not able to process all the data from the start. That is something I will need to follow-up with.
my_key<-"XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
langs <- deeplr::available_languages2(my_key)
as.data.frame(langs)
deeplr::usage2(my_key)
Reviews%>%
filter(!product_category%in%("-") & brand%in%c("xxxx", "XXXX"))%>%
select(review)%>%
nchar(., type = "bytes")
try<-Reviews%>%
dplyr::filter(!product_category%in%("-") & brand%in%c("xxxx", "XXXX"))%>%
select(review)%>%
as.character()
try%>%nchar(., type = "char")
try<-deeplr::split_text(try, max_size_bytes = 29000)
try1<-try%>%filter(segment_id%in%(c(1)))
try2<-try%>%filter(segment_id%in%(c(2)))
try3<-try%>%filter(segment_id%in%(c(3)))
try4<-try%>%filter(segment_id%in%(c(4)))
try5<-try%>%filter(segment_id%in%(c(5)))
try6<-try%>%filter(segment_id%in%(c(6)))
try7<-try%>%filter(segment_id%in%(c(7)))
try8<-try%>%filter(segment_id%in%(c(8)))
try9<-try%>%filter(segment_id%in%(c(9)))
try10<-try%>%filter(segment_id%in%(c(10)))
try11<-try%>%filter(segment_id%in%(c(11)))
try12<-try%>%filter(segment_id%in%(c(12)))
try13<-try%>%filter(segment_id%in%(c(13)))
try14<-try%>%filter(segment_id%in%(c(14)))
try15<-try%>%filter(segment_id%in%(c(15)))
try16<-try%>%filter(segment_id%in%(c(16)))
try17<-try%>%filter(segment_id%in%(c(17)))
Translation <-deeplr::translate2(
text = try,
target_lang = "EN",
auth_key = my_key)
Translation1<-Translation%>%
as.data.frame()%>%
unlist()%>%
stringr::str_split(string =.,
pattern =stringr::regex("\", \""))%>%
magrittr::extract2(3)%>%
as.data.frame()%>%
rename(., translated = .)%>%
mutate(segment_id = "1")%>%
as_tibble()
Translation2 <-deeplr::translate2(text = try2,target_lang = "EN",auth_key = my_key)%>%
as.data.frame()%>%
unlist()%>%
stringr::str_split(string =.,
pattern =stringr::regex("\", \""))%>%
magrittr::extract2(3)%>%
as.data.frame()%>%
rename(., translated = .)%>%
mutate(segment_id = "2")%>%
as_tibble()
comb_trans<-data.table::rbindlist(list(Translation2,Translation3,Translation4,
Translation5,Translation6,Translation7,
Translation8,Translation9,Translation10,
Translation11,Translation12,Translation13,
Translation14,Translation15,Translation1))
dim(comb_trans)
Reviews%>%
filter(!product_category%in%("-") & brand%in%c("xxxx", "XXXX"))%>%
dim() # I am missing about 400 reviews
write.csv(comb_trans, "comb_trans.csv")
comb_trans<-comb_trans%>%
mutate(segment_id = factor(segment_id, levels=c("1","2","3","4","5","6","7","8","9"
,"10","11","12","13","14","15")))%>%
arrange(segment_id)%>%
ungroup()%>%
mutate(translated = trimws(.$translated, which = c("both"), whitespace = "[ \t\r\n]"),
row_number = row_number())
comb_trans%>%select(segment_id)%>%table()
Reviews_bol_xxxx<-Reviews%>%
dplyr::filter(!product_category%in%("-") & brand%in%c("xxxx", "XXXX"))%>%
ungroup()%>%
mutate(row_number = row_number())
# Segment 1
Reviews_bol_xxxx%>%select(row_number, review)%>%filter(row_number==218)
comb_trans%>%select(row_number,segment_id,translated)%>%filter(row_number==218)
comb_trans%>%select(row_number,segment_id,translated)%>%filter(row_number==219) # row 218 reviews = rows 218 + 219 translated
Reviews_bol_xxxx%>%slice(n=1:218)
comb_trans%>%slice(n=1:219)
rev1<-Reviews_bol_xxxx%>%slice(n=1:218)
trans1<-comb_trans%>%slice(n=1:218)
dim(rev1);dim(trans1)
rev1%>%bind_cols(trans1)%>%select(review, translated)
comb1<-rev1%>%bind_cols(trans1)
# Segment 2
Reviews_bol_xxxx%>%select(row_number, review)%>%filter(row_number==407)
comb_trans%>%select(row_number,segment_id,translated)%>%filter(row_number==409)
rev2<-Reviews_bol_xxxx%>%slice(n=219:407)
trans2<-comb_trans%>%slice(n=220:408)
dim(rev2);dim(trans2)
rev2%>%bind_cols(trans2)%>%select(review, translated)
comb2<-rev2%>%bind_cols(trans2)
# Segment 3
Reviews_bol_xxxx%>%select(row_number, review)%>%filter(row_number==643)
comb_trans%>%select(row_number,segment_id,translated)%>%filter(row_number==646)
rev3<-Reviews_bol_xxxx%>%slice(n=408:643)
trans3<-comb_trans%>%slice(n=410:645)
dim(rev3);dim(trans3)
rev3%>%bind_cols(trans3)%>%select(review, translated)
comb3<-rev3%>%bind_cols(trans3)
............................
combined<-data.table::rbindlist(list(comb1,comb2,comb3,
comb4,comb5,comb6,
comb7,comb8,comb9,
comb10,comb11,comb12,
comb13,comb14,comb15))
combined<-combined%>%
as_tibble()%>%
select(-c(row_number...41, row_number...44))
write.csv(combined, "combined.csv")
combined%>%
select(review,
translated)%>%
filter(row_number()==2)
# A tibble: 1 x 2
review translated
<chr> <chr>
1 zeer mooie en toch handige afsluitende tas. Very nice and yet convenient locking bag.
Okay, so that was quite the endeavor. Besides from the obvious imperfections of my pipeline, I also struggled with the fact that the API does cut off reviews in the middle once the API-call limit is met. I should have thought about that early one, but I want to get as much out of the free subscription as possible. Next time I will do a more straightforward analysis which will save me a lot of time. However, in the end, as you can see from the last row, I did end up with sensible translations of my reviews. A total of almost 4200 translated reviews.
Lets see what I can do with these reviews.
So, first, I will need to unnest the review lines and start looking at separate words. Later on, I will revert back to connecting two or three words, but the majority of text analysis revolves around words and the emotions they contain. So lets see what’s exactly in those reviews.
tidy_text<-combined %>%
select(jaar, rating, translated)%>%
mutate(linenumber = row_number())%>%
unnest_tokens(word, translated)%>%
anti_join(stop_words)
tidy_text %>%
count(word, sort = TRUE)
# A tibble: 3,001 x 2
word n
<chr> <int>
1 nice 1629
2 bag 1232
3 bike 1219
4 sturdy 776
5 quality 713
6 basket 711
7 easy 626
8 convenient 372
9 bags 366
10 spacious 365
# ... with 2,991 more rows
tidy_text %>%
group_by(jaar, rating)%>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word)) +
geom_col() +
facet_wrap(~rating)+
labs(y = NULL)
I already mentioned that to create some kind of semantic analysis, a lexicon is needed. A lexicon contains emotional words, or sentiments, that will connect a certain word to a number. So, for instance, on a scale of -5 to 5, the wordt ‘happy’ will score a 4 in one lexicon, and perhaps a 5 in another lexicon. I am not entirely sure how they were created, but consider it a ranking of a word on the sentiment scale.
Here, I will show you which words are part of the ‘joy’ chapter of the NRC lexicon. Quite a lot of words are connected to ‘joy’.
get_sentiments("afinn")
get_sentiments("bing")
get_sentiments("nrc")
nrc <- get_sentiments("nrc")
nrc_joy <- get_sentiments("nrc") %>%
filter(sentiment == "joy")
nrc_joy
# A tibble: 687 x 2
word sentiment
<chr> <chr>
1 absolution joy
2 abundance joy
3 abundant joy
4 accolade joy
5 accompaniment joy
6 accomplish joy
7 accomplished joy
8 achieve joy
9 achievement joy
10 acrobat joy
# ... with 677 more rows
Since we have a lexicon now, we can also see how many times we have words that are connected to ‘joy’. And, if we find the ‘joy’ part to be to restricted, we can see how many words have any sentiment attached to it, whatsoever, using the NRC lexicon.
tidy_text %>%
inner_join(nrc_joy) %>%
count(word, sort = TRUE)
Joining, by = "word"
# A tibble: 101 x 2
word n
<chr> <int>
1 satisfied 139
2 happy 133
3 beautiful 123
4 shopping 100
5 perfect 81
6 electric 79
7 closure 66
8 money 52
9 excellent 39
10 excited 38
# ... with 91 more rows
tidy_text %>%
inner_join(nrc) %>%
count(word, sort = TRUE)
Joining, by = "word"
# A tibble: 621 x 2
word n
<chr> <int>
1 sturdy 776
2 fits 538
3 happy 532
4 shopping 500
5 convenient 372
6 spacious 365
7 perfect 324
8 money 312
9 rack 302
10 top 300
# ... with 611 more rows
Lets dig a bit deeper into sentiment analysis and use a different lexicon, bing. This lexicon can transform words into positive and negative words by from which you can actually get a sentiment number by subtracting the positive from the negative words. Although the words in themselves have much more meaning than just being a step-up towards a number, it is easier to do calculations now. So, my advice would be to look at the words themselves, their semantic meaning and the sentiments attached to them. But combine all three parts, and not just transform text to numbers because it is easier to calculate or make models.
tidy_text_sentiment <- tidy_text %>%
inner_join(get_sentiments("bing")) %>%
count(jaar, rating, index=linenumber, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
ggplot(tidy_text_sentiment, aes(index, sentiment), col="black") +
geom_col(show.legend = FALSE, width=2)+
facet_grid(~jaar, scales="free", space="free")
What I am curious about is if the sentiment analysis can be connected to the ratings provided. I would expect that a higher star rating will also lead to a higher sentiment. There is no natural limit.
tidy_text_sentiment%>%
mutate(jaar = as.factor(jaar))%>%
group_by(rating, jaar)%>%
summarise(sentiment = median(sentiment))%>%
ggplot(.,
aes(x=rating, y=jaar, fill=sentiment, label=sentiment)) +
geom_tile()+
scale_fill_viridis_c(option="magma")+
geom_text()+
labs(x="Score",
y="Year",
fill="Sentiment (higher is better)",
title = "Sentiment per year and per score (median)")
Lets look at another way at the distribution of the sentiments. Since we are subtracting negative from positive words, we can create density plots but they can become quite fluid — meaning that the spikes you see have less meaning then the plot makes you believe.
tidy_text_sentiment%>%
mutate(jaar = as.factor(jaar))%>%
ggplot(., aes(x=sentiment, fill=jaar))+
geom_density(alpha=0.5)+
facet_grid(~rating, scales="free", space="free")+
labs(fill="Year")
Now I am curious if I can see some temporal development in the sentiments.
tidy_text_sentiment%>%
mutate(jaar = as.factor(jaar),
rating = as.factor(rating))%>%
ggplot(., aes(x=jaar, y=sentiment, col=rating))+
geom_point(alpha=0.5)+
geom_jitter()+
facet_grid(~rating, scales="free", space="free")+
labs(x="Year",
y="Sentiment (higher is more positive)",
title="Development of sentiment per year and per rating")
Lets move deeper into the words now we have seen that the overall sentiment score and the star-rating have some expected relationship, but do not even tell half the story. What I am going to do next is look at how many positive and negative words I can find for a particular product family for a particular brand. This should give me a better understanding of the ratio of positive vs negative words. And will also slowly take me away from the numbers.
combined %>%
select(jaar, product_family_xxxx, rating, translated)%>%
unnest_tokens(word, translated)%>%
anti_join(stop_words)%>%
inner_join(get_sentiments("bing")) %>%
filter(!product_family_xxxx%in%c("-"))%>%
count(product_family_xxxx, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)%>%
ggplot(.)+
geom_bar(aes(y=product_family_xxxx , x=positive, fill="positief"), stat="identity")+
geom_bar(aes(y=product_family_xxxx , x=negative, fill="negatief"), stat="identity")+
labs(x="Number of words",
y="Product family",
fill="Positive / negative word",
title="How many positive negative words per review per product family")
We can also look if there is a temporal effect, per year.
combined %>%
select(jaar, product_family_xxxx, rating, translated)%>%
unnest_tokens(word, translated)%>%
anti_join(stop_words)%>%
inner_join(get_sentiments("bing")) %>%
filter(!product_family_xxxx%in%c("-"))%>%
count(jaar, product_family_xxxx, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)%>%
ggplot(.)+
geom_bar(aes(y=product_family_xxxx , x=positive, fill="positief"), stat="identity")+
geom_bar(aes(y=product_family_xxxx , x=negative, fill="negatief"), stat="identity")+
facet_grid(~jaar, space="free", scale="free")+
labs(x="Number of words",
y="Product family",
fill="Positive / negative word",
title="How many positive negative words per review per product family")
I mentioned before that there are different lexica that can be approached, and they are not the same. So, lets see how much they differ, and what kind of impact that could have on our findings.
afinn <- tidy_text %>%
inner_join(get_sentiments("afinn")) %>%
group_by(linenumber) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN",
index=linenumber)
bing_and_nrc <- bind_rows(
tidy_text %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
tidy_text %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative")))%>%
mutate(method = "NRC"))%>%
count(method, index=linenumber, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
bind_rows(afinn,
bing_and_nrc) %>%
ggplot(aes(index, sentiment, fill = method)) +
geom_col(show.legend = FALSE) +
facet_wrap(~method, ncol = 1, scales = "free_y")
afinn <- tidy_text %>%
inner_join(get_sentiments("afinn")) %>%
group_by(jaar, rating, linenumber) %>%
summarise(sentiment = sum(value)) %>%
mutate(method = "AFINN",
index=linenumber)
bing_and_nrc <- bind_rows(
tidy_text %>%
inner_join(get_sentiments("bing")) %>%
mutate(method = "Bing et al."),
tidy_text %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative")))%>%
mutate(method = "NRC"))%>%
count(method, jaar, rating, index=linenumber, sentiment) %>%
tidyr::pivot_wider(names_from = sentiment,
values_from = n,
values_fill = 0) %>%
mutate(sentiment = positive - negative)
bind_rows(afinn,
bing_and_nrc) %>%
mutate(jaar = as.factor(jaar),
rating = factor(rating, levels = c("1","2","3","4","5")))%>%
ggplot(aes(jaar, sentiment, col = method)) +
geom_point() +
geom_jitter() +
geom_hline(yintercept=0, lty=2, col="black")+
facet_grid(~rating,scales = "free", space = "free")
Now, lets move even deeper into words by looking at the distribution of positive and negative words across reviews, and per review, using the Bing lexicon.
tidy_text %>%
inner_join(get_sentiments("bing")) %>%
count(rating, word, sentiment, sort = TRUE) %>%
ungroup()%>%
group_by(sentiment) %>%
slice_max(n, n = 50) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(n, word, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~rating, scales = "free_x",ncol=5) +
labs(x = "Addition to sentiment",
y = "Word",
title="Distribution positive and negative words per review")
One of the most funny methods to apply to text analysis is the wordcloud, which will give you an idea of the words included in the reviews, how many times they are detected, and if they are positive or negative. Although the wordcloud does not connect words, the human brain will have no other choice but to scan across and look for patterns. And most of the time, some of them can be identified.
tidy_text %>%
anti_join(stop_words) %>%
count(word) %>%
with(wordcloud(word, n, max.words = 200))
tidy_text %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("gray20", "gray80"),
max.words = 500)
The number of times a word has been detected only really makes sense if you know how many words were used across all reviews. So, we want to look for the ratio of a particular word across all words used in that particular review, or even across all reviews. Just to get some context back.
review_words <- tidy_text %>%
count(rating, word, sort = TRUE)
total_words <- review_words %>%
group_by(rating) %>%
summarize(total = sum(n))
review_words <- left_join(review_words, total_words)
review_words%>%
rename(review=rating)%>%
mutate(rownumber = row_number())%>%
filter(rownumber<201)%>%
ggplot(., aes(y=word, x=n/total)) +
geom_bar(stat="identity", show.legend = FALSE) +
facet_wrap(~review, ncol = 5, scales = "free")
Up until now we have looked at ratings and single words, but it is now time to start connecting words looking for combinations which are most prevalent. I mentioned before that words need context and the meaning of word can change dramatically in the presence of the word before of after the word of interest. That is why a sentence has more meaning then the separate words that make up that sentence in the first place.
Below you can see me looking for the most prevalent two-word combinations. It does not take you look to realize that the majority of words are preceded by a word with actually very little meaning such as ‘a’, ‘on’, ‘the’, ‘it’ etc. So one can best remove theses words and start connecting to the next word in line that actually has some deeper meaning. If we do that, we will start to see some very interesting combinations. Which we can then of course also start connecting to the star ratings again, since there are still of interest. And if not because of the rating itself, it is very much interesting because of how it relates to the words within a rated review.
> bol_bigrams <- combined %>%
+ select(jaar, rating, translated)%>%
+ unnest_tokens(bigram, translated, token = "ngrams", n = 2) %>%
+ filter(!is.na(bigram))
> bol_bigrams %>%
+ count(bigram, sort = TRUE)
# A tibble: 24,796 x 2
bigram n
<chr> <int>
1 easy to 545
2 good quality 495
3 on the 379
4 nice and 329
5 the bike 306
6 it is 279
7 the bag 271
8 to use 266
9 to attach 262
10 a lot 210
# ... with 24,786 more rows
> bol_bigrams <- combined %>%
+ select(jaar, rating, translated)%>%
+ unnest_tokens(bigram, translated, token = "ngrams", n = 2) %>%
+ filter(!is.na(bigram))
> bol_bigrams %>%
+ count(bigram, sort = TRUE)
# A tibble: 24,796 x 2
bigram n
<chr> <int>
1 easy to 545
2 good quality 495
3 on the 379
4 nice and 329
5 the bike 306
6 it is 279
7 the bag 271
8 to use 266
9 to attach 262
10 a lot 210
# ... with 24,786 more rows
> bigrams_separated <- bol_bigrams %>%
+ tidyr::separate(bigram, c("word1", "word2"), sep = " ")
> bigrams_filtered <- bigrams_separated %>%
+ filter(!word1 %in% stop_words$word) %>%
+ filter(!word2 %in% stop_words$word)
> bigram_counts <- bigrams_filtered %>%
+ count(word1, word2, sort = TRUE)
> bigram_counts
# A tibble: 4,965 x 3
word1 word2 n
<chr> <chr> <int>
1 bike basket 144
2 nice design 124
3 bike bag 116
4 nice sturdy 110
5 meets expectations 105
6 bicycle bag 88
7 nice bag 73
8 electric bike 67
9 quality convenient 66
10 fits perfectly 63
# ... with 4,955 more rows
> bigrams_united <- bigrams_filtered %>%
+ tidyr::unite(bigram, word1, word2, sep = " ")
> bigrams_united
# A tibble: 9,996 x 3
jaar rating bigram
<dbl> <dbl> <chr>
1 2018 4 convenient locking
2 2018 4 locking bag
3 2016 5 bike bell
4 2016 5 bell meets
5 2016 5 meets expectations
6 2016 5 expectations fast
7 2016 5 fast delivery
8 2016 5 delivery nice
9 2016 5 nice classic
10 2016 5 classic design
# ... with 9,986 more rows
Lets take it up a notch. Lets find three-word combinations, and fixate the second word to ‘bag’. We can then see which two-word combinations are best connected to the word ‘bag’ and connect those combinations to the ratings.
combined %>%
select(jaar, rating, translated)%>%
unnest_tokens(trigram, translated, token = "ngrams", n = 3) %>%
filter(!is.na(trigram)) %>%
tidyr::separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
count(word1, word2, word3, sort = TRUE)
bigrams_filtered %>%
filter(word2 == "bag") %>%
count(jaar, rating, word1, sort = TRUE)
bigram_tf_idf <- bigrams_united %>%
count(rating, bigram) %>%
bind_tf_idf(bigram, rating, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf%>%
group_by(rating)%>%
arrange(desc(tf_idf), .by_group = TRUE)%>%
top_n(., 4)%>%
mutate(rating = factor(rating))%>%
ggplot(., aes(x=tf_idf, y=bigram, fill=rating))+
geom_bar(stat="identity")+
facet_wrap(~rating, ncol=5)
One of the best words to look for is the word ‘not’, because the word not is a 180 degree context changer. Let me show you what happens if I look for combinations with the word ‘not’.
bigrams_separated %>%
filter(word1 == "not") %>%
count(word1, word2, sort = TRUE)
AFINN <- get_sentiments("afinn");AFINN
not_words <- bigrams_separated %>%
group_by(rating)%>%
filter(word1 == "not") %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word2, value, sort = TRUE)
my_colors <- c("darkgreen", "darkred")
not_words %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
head(100) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(n * value, word2, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = my_colors) +
labs(x = "Sentiment value * number of occurrences",
y = "Words preceded by \"not\"")+
facet_wrap(~rating, ncol=5)
Lets do that — lets look for both ‘not’ and ‘no’.
negation_words <- c("not", "no")
negated_words <- bigrams_separated %>%
filter(word1 %in% negation_words) %>%
inner_join(AFINN, by = c(word2 = "word")) %>%
count(word1, word2, value, sort = TRUE)
negated_words %>%
mutate(contribution = n * value) %>%
arrange(desc(abs(contribution))) %>%
head(200) %>%
mutate(word2 = reorder(word2, contribution)) %>%
ggplot(aes(n * value, word2, fill = n * value > 0)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = my_colors) +
labs(x = "Sentiment value * number of occurrences",
y = "Words preceded by \"negation term\"")+
facet_wrap(~word1, ncol=2, scales="free")
The last thing I will do is try and connect the words via correlation analysis, providing me with a network of words. From that network I should be able to get a glimpse of the value of the reviews as well hints at future endeavors. A single review in itself is really worthwhile and especially if it has detailed information about a product. Then again, the combinations across a review offer additional information and context on a product or product family. So its like slicing an onion and looking for patterns that are interesting across layers and within a layer.
Lets move towards the correlation analysis highlighting the combinations of words that are most frequently observed.
bigram_graph<-combined %>%
select(jaar, rating, translated)%>%
unnest_tokens(bigram, translated, token = "ngrams", n = 2) %>%
filter(!is.na(bigram))%>%
tidyr::separate(bigram, c("word1", "word2"), sep = " ")%>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)%>%
count(word1, word2, sort = TRUE)%>%
filter(n > 20) %>%
graph_from_data_frame()
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), vjust = 1, hjust = 1)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()
Like I did before with the words ‘not’ and ‘no’ (but using placement and not correlation), I can now look at words that are correlated most highly with certain other words. For this exercise, I will use the words ‘bag’, ‘bad’, ‘nice’ and ‘basket’.
bol_section_words <- combined %>%
select(jaar, rating, translated)%>%
group_by(rating)%>%
mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
group_by(rating, section)%>%
unnest_tokens(word, translated) %>%
filter(!word %in% stop_words$word)
word_pairs <- bol_section_words %>%
ungroup()%>%
widyr::pairwise_count(word, section, sort = TRUE)
word_pairs
word_cors <- bol_section_words %>%
group_by(word) %>%
filter(n() >= 20) %>%
widyr::pairwise_cor(word, section, sort = TRUE)
word_cors
word_cors %>%
filter(item1 == "not")
word_cors %>%
filter(item1 %in% c("bag", "bad", "nice", "basket")) %>%
group_by(item1) %>%
slice_max(correlation, n = 50) %>%
ungroup() %>%
mutate(item2 = reorder(item2, correlation)) %>%
ggplot(aes(item2, correlation)) +
geom_bar(stat = "identity") +
facet_wrap(~ item1, scales = "free") +
coord_flip()
To filter through, lets look at all the correlations exceeding the theshold of 0.4 which is an arbitrary level I will readily admit.
word_cors %>%
filter(correlation > .4) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
Last, but not least, I want to focus on the temporal development of certain words. These words have a semantic meaning for sure, but they can also be words on which a company may or must focus. Because these are the words that matter. So, what I did here, was focus on the words ‘price’, ‘nice’ and ‘bad’. And see how many times we encounter them across reviews and time.
combined %>%
select(jaar, rating, translated)%>%
mutate(translated = gsub(x = translated,
pattern = "[0-9]+|[[:punct:]]|\\(.*\\)",
replacement = "")) %>%
mutate(linenumber = row_number())%>%
group_by(jaar)%>%
unnest_tokens(word, translated)%>%
anti_join(stop_words)%>%
count(word)%>%
group_by(jaar)%>%
bind_tf_idf(term=word, document=jaar, n) %>%
arrange(desc(tf_idf))%>%
mutate(year_total = sum(n),
jaar=zoo::as.Date(zoo::as.yearmon(jaar)))%>%
filter(word %in% c("nice", "bad", "price")) %>%
ggplot(aes(jaar, n / year_total)) +
geom_point(color="black") +
geom_smooth() +
facet_wrap(~ word, scales = "free_y") +
scale_y_continuous(labels = scales::percent_format()) +
labs(y = "% frequency of word in review")+
scale_x_date(date_breaks = "1 year")
Alright, we are the end of an introductory post on semantic analysis of text files coming from product reviews. There is so much more we can do, but I hope it is already clear that adding a semantic pipeline to your customer insights or marketing department is a must.
Please let me know if anything is amiss, wrong or if you have question!