Using Hugging Face Transformers for Sentiment Analysis in R

Daniel Tope Omole
Heartbeat
Published in
13 min readJul 19, 2023

--

source: author

Introduction
Sentiment analysis is a rapidly growing field within the Natural Language Processing (NLP) domain, which deals with the automatic analysis and classification of emotions and opinions expressed in text. It has applications in a wide range of areas, including customer feedback analysis, brand reputation management, and political sentiment analysis.
One of the recent developments in the field of sentiment analysis is the use of Hugging Face Transformers, a library that provides a high-level API to work with the most recent and state-of-the-art NLP models. The library has been widely adopted in the NLP community and has shown remarkable performance on various NLP tasks, including sentiment analysis.

This outline is aimed at providing a comprehensive guide on how to implement sentiment analysis using Hugging Face Transformers in the R programming language. It will cover the background information on sentiment analysis, an overview of R and the required packages, and a step-by-step guide to implementing sentiment analysis using Hugging Face Transformers in R. The article will provide an in-depth understanding of how to preprocess the data, load the pre-trained model, and evaluate the results of the sentiment analysis.

Prerequisites
To perform sentiment analysis using Hugging Face Transformers in R, the following packages must be installed:

  • tidyverse: This is a popular collection of packages for data manipulation, cleaning, and analysis in R.
  • transformers: This is the main Hugging Face Transformers library in R, providing access to a wide range of pre-trained NLP models.
  • tokenizers: This is an R package for tokenization, the process of splitting text into smaller units called tokens.
  • Text package: Hugging Face's Transformers in R can be used to convert text variables into word embeddings. Word embeddings can then be used to predict numerical variables, compute semantic similarity scores across texts, visually represent statistically significant words across multiple dimensions, and much more.
  • dplyr: This is a package for data manipulation and cleaning in R.
  • reticulate: This a package that allows you to create a Python environment in R.

To install these packages, you can run the following code in the R console:

Sentiment Analysis

Natural language processing (NLP) techniques are frequently employed in the process of sentiment analysis, which is the automatic identification and classification of the opinions conveyed in a text. It involves analysing the emotional tone of the text to decide whether it is neutral, positive, or negative.

Sentiment analysis has several uses, including social media monitoring, market research, customer feedback analysis, and political polling. By analysing the sentiment of a significant number of social media posts, a business, for instance, can learn more about how customers see its products and services and utilise this knowledge to improve business decisions.

Overall, sentiment analysis has grown in popularity as a tool for businesses and organisations seeking to better understand their stakeholders and customers. They can evaluate large amounts of text quickly and accurately by automating sentiment analysis, and they can use the information they learn to improve their goods, services, and overall consumer experience. The robust and flexible programming language R is widely used for data analysis and visualisation. It provides a variety of programs and tools that make complex data analysis tasks, such as sentiment analysis, simple.

The first step in performing sentiment analysis in R is to find the appropriate packages.

Hugging Face Transformers

Recently, the field of natural language processing has paid a lot of attention to a type of machine learning architecture called transformers. They are neural networks made to analyse data sequences like sentences or paragraphs and anticipate the following word or sentence. Hugging Face has developed a collection of pre-trained transformer models that can be applied to a range of natural language processing (NLP) applications, including text classification, sentiment analysis, and question answering.

Hugging Face transformers distinguishes itself in that it processes input sequences using a method known as self-attention. Although self-attention mechanisms enjoy widespread popularity in the field, Hugging Face transformers effectively utilize this technique to achieve remarkable outcomes. Self-attention enables the transformer to concentrate on various portions of the input sequence at various points, which increases its efficiency in comprehending intricate links between words and sentences. You can read more about Hugging Face and other transformers here. Hugging Face has created a pre-trained transformer model library for a variety of NLP tasks. These models can detect intricate correlations between words and phrases because they have been pre-trained on massive datasets. Hugging Face transformer models BERT, GPT-2, RoBERTa, and T5 are included in the library. Each of these models can be tailored to specific NLP tasks and has its own set of advantages and disadvantages. While it is acknowledged that transformers are not exclusive to NLP applications, as they can be applied to various data types such as images and audio for image and audio processing, identification, and classification amongst other tasks, the focus of this article lies in their utility within NLP.

  • BERT is one of the most popular Hugging Face transformer models (Bidirectional Encoder Representations from Transformers). To train the transformer model BERT, a massive corpus of text was used. It can perform a variety of NLP operations, such as sentiment analysis, question answering, and text.
  • Another well-known transformer model is GPT-2 (Generative Pre-trained Transformer 2) from Hugging Face. It is a sizable transformer model that can generate text that sounds like human speech. After being trained on a large web page dataset, GPT-2 is capable of creating content that is both coherent and entertaining.
  • A BERT variant that has been tailored for language modeling tasks is called RoBERTa. It can be customised for a range of NLP tasks and has been pre-trained on a sizable corpus of text.
  • A transformer model called T5 (Text-to-Text Transfer Transformer) was created for the specific NLP problem of text-to-text transfer. It may carry out a range of tasks, such as text summarization, question answering, and language translation.

The Hugging Face Transformer’s capacity to be customised for particular NLP tasks is another important feature. While it is acknowledged that the capacity to fine-tune models is not exclusive to Hugging Face or transformers in general, Hugging Face Transformers provide a user-friendly environment and a diverse selection of pre-trained models that can be effectively customized.

By leveraging the customizability feature, developers can optimize the performance of Hugging Face Transformers to meet the unique requirements of their particular use cases. Fine-tuning empowers users to incorporate domain-specific knowledge and enhances the model’s ability to handle specific tasks with higher precision and relevance.

While other frameworks also offer the ability to fine-tune models, the Hugging Face Transformers framework simplifies the process and provides extensive resources through its hub. This makes it a valuable tool for developers seeking to achieve superior performance in natural language processing applications by tailoring the models to their specific needs.

Implementing Hugging Face Transformers for Sentiment Analysis in R

The dataset we will be using has two columns titled Customer Reviews and Liked.

Customer reviews give us information about the evaluations that customers have made of the food at a restaurant, and the liked column indicates whether or not they enjoyed the food. The dataset is from Kaggle, contributed by Arsh Anwar, and you can access it here.

Setting up the transformer environment in R

This code sets up a transformer environment within R using the reticulate package, and then installs the “transformers” package in that environment.

The “transformers” package is a popular library for natural language processing (NLP) in Python, so this code allows you to use its functionality within R.

The second part of the code uses the “import” function from reticulate to load the “transformers” and “torch” Python libraries into R, which makes their functions and classes available to use in your R code.

Loading the pre-trained model

This code loads a pre-trained sentiment analysis model called “nlptown/bert-base-multilingual-uncased-sentiment”. The model is based on the BERT (Bidirectional Encoder Representations from Transformers) architecture, which is a type of deep learning model that has achieved state-of-the-art results in various NLP tasks.

The “tokenizer” object is initialized using the “from_pretrained” method of the “BertTokenizer” class from the “transformers” package, which prepares the text data to be inputted into the BERT model.

The “model” object is initialized using the “from_pretrained” method of the “BertForSequenceClassification” class, which is a pre-trained BERT model that has been fine-tuned for sentiment analysis. The model takes a sequence of text as input and outputs a probability distribution over the different sentiment classes.

Loading and preprocessing the dataset

The line of code preprocesses the text data in the dataset by converting all the text to lowercase using the tolower() function. This helps to standardize the text and ensure that capitalization doesn't affect the analysis.

The next line of code removes all punctuation from the text data using the removePunctuation() function. This is another standardization step that removes any unnecessary characters that could affect the analysis.

Finally, the labels object is created by selecting the 'Liked' column of the dataset data frame. This column contains the binary labels (0 or 1) that indicate whether each review is negative or positive, respectively.

Implementing the model

The code uses the transformers package in R, which allows us to utilize a pre-trained BERT model to analyze the sentiment of the restaurant reviews.

The tokenizer is used to encode the input data and create input tensors that can be passed to the pre-trained BERT model.

Next, we move on to the model inference step. The inputs variable created in the previous step is used as input to the model object, which contains the pre-trained BERT model. The no_grad() function is used to turn off gradient calculation during inference, which speeds up the computation and reduces memory usage. The logits function is used to calculate the output of the model, which represents the sentiment of each review.

Finally, we move on to the prediction step, where the output of the model is used to predict the sentiment of each review. The argmax() function is used to find the index of the maximum value in the output tensor, which corresponds to the predicted sentiment of the review. The item() function is used to extract the predicted sentiment label as a scalar value.

Converting scores to binary scores

After predicting the sentiment scores of the reviews, it is common to convert these scores to binary scores to make the analysis more interpretable. In our code, the sentiment scores are converted to binary scores based on a threshold value of 2.5.

The ifelse() function is used to create a new column in the dataset data frame called binary_scores. For each row in the data frame, the function checks whether the predicted sentiment score is greater than or equal to the threshold value. If it is, the function assigns a binary score of 1 (positive sentiment) to that row. If not, the function assigns a binary score of 0 (negative sentiment) to that row.

Showing the sentiment result

The mutate() function creates a new column called sentiment, which is assigned a value of "Positive" if the binary score is greater than 0 and "Negative" if it is not. Next, the group_by() function groups the data by the sentiment column. Finally, the summarise() function counts the number of reviews in each sentiment category.

The second block of code creates a new column in the dataset data frame called sentimentresult. This column is similar to the sentiment column created in the first block of code, but it uses the ifelse() function to assign "Positive" or "Negative" values directly based on the binary scores.

These lines of code are used to evaluate the performance of the sentiment analysis model that was trained on the restaurant review dataset.

The confusion_matrix variable stores a table that shows the number of correctly and incorrectly classified reviews for each sentiment category (positive and negative). This table can be used to calculate performance metrics such as accuracy, precision, and recall.

The accuracy variable calculates the overall accuracy of the model by dividing the sum of the correctly classified reviews by the total number of reviews in the dataset.

The precision variable calculates the precision of the model for each sentiment category by dividing the number of correctly classified reviews in that category by the total number of reviews that the model classified as belonging to that category.

The recall variable calculates the recall of the model for each sentiment category by dividing the number of correctly classified reviews in that category by the total number of reviews that actually belong to that category.

Creating visuals for the results

We are analyzing the sentiment analysis results by visualizing them using a bar chart and word clouds. The bar chart shows the count of positive and negative reviews, and we use different colours to distinguish between them.

We also split the reviews into positive and negative categories and create word clouds of frequent words in each category. Word clouds are visual representations of words that appear more frequently in the text, and they give us an idea of what people are talking about in their reviews.

source: author
Positive words, source: author
Negative word, source: author

Using the text() package

The text package in R is designed for natural language processing (NLP) and machine learning tasks with state-of-the-art transformer models from Hugging Face. The package has two main objectives: to provide a point solution for transforming text to word embeddings ready to use for downstream tasks, and to serve as an end-to-end solution that provides state-of-the-art AI techniques tailored for social and behavioral scientists.

The text package provides powerful functions tailored to test research hypotheses in social and behavior sciences for both relatively small and large datasets. The package provides access to many language analysis tasks such as textClassify(), textGeneration(), and textTranslate(), and also functions to analyze the word embeddings with well-tested machine learning algorithms and statistics. You can read more about the text() package here.

The textClassify() function is used for sentiment analysis, and it is used to predict the label and probability of a text using a pre-trained classifier language model. The function takes the following arguments:

  • x: A character vector containing text to classify.
  • model: A character string indicating the name of the pre-trained classifier language model to use. Defaults to "distilbert-base-uncased-finetuned-sst-2-english."
  • device: A character string indicating whether to use CPU or GPU. Defaults to "cpu."
  • tokenizer_parallelism: A logical value indicating whether to use parallel processing for tokenization. Defaults to FALSE.
  • logging_level: A character string indicating the logging level to use. Defaults to "ERROR."

The textClassify() function uses the pre-trained classifier language models to predict the sentiment of the input text. It returns a data frame with two columns: "label" and "probability." The "label" column contains the predicted sentiment label, which is either "positive" or "negative." The "probability" column contains the probability of the predicted sentiment label.

The textrpp_install() is used to install the required Python packages needed to run text classification using textrpp in a self-contained environment. This ensures that the required dependencies are installed and can be accessed by textrpp without any issues.

textrpp_initialize() is used to initialize the installed textrpp package so it can be called from R. This command sets up the necessary Python environment and links it to the R session.

classification <- textClassify(dataset,model = "distilbert-base-uncased-finetuned-sst-2-english", set_seed = 1234, return_incorrect_results = TRUE, function_to_apply = "softmax") performs text classification using the pre-trained distilbert-base-uncased-finetuned-sst-2-english model. This code takes in a dataset variable as input and applies the pre-trained model to classify the text data.

The set_seed argument is used to set a random seed for reproducibility purposes. The return_incorrect_results argument is used to return the incorrect classification results for analysis. The function_to_apply argument is used to apply a softmax function to the classification results. classification is used to output the results of the text classification.

Sentiment analysis has become an essential tool in the field of Natural Language Processing, with applications in various fields such as customer feedback analysis, brand reputation management, and political sentiment analysis. With the recent development of the Hugging Face Transformer library, sentiment analysis has become even more efficient and accurate. This article provided a step-by-step guide on how to implement sentiment analysis using Hugging Face Transformers in the R programming language. The guide covered the necessary packages required, how to preprocess data, load pre-trained models, and evaluate results. With the knowledge gained from this article, readers can easily perform sentiment analysis tasks using Hugging Face Transformers in R and gain valuable insights from text data.

You can view the full code on the Git repo below:

Thanks for taking the time to read my blog ❤️. You can reach out to me on LinkedIn.

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author. 👇

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

--

--

A data scientist with a background in healthcare. My expertise in data analysis and machine learning using tools like python, R , STATA, SQL to deliver insights