Natural Language Processing with R and Comet

Daniel Tope Omole
Heartbeat
Published in
9 min readJul 18, 2023

--

Source: Author

Natural Language Processing (NLP) is a field of study focused on allowing computers to understand and process human language. There are many different NLP techniques and tools available, including the R programming language. In this article, we’ll show you how to use the R SDK for Comet to build a simple NLP project.

Multiple models are typically developed as the training proceeds when performing ML and AI tasks, making it challenging to keep track of them. The development of ML and AI benefits greatly from team collaborations. These are the main issues that most ML and AI teams encounter, and Comet is made to help with model tracking, reproducibility, and team collaboration issues.

Comet
Comet is a cutting-edge machine learning platform that offers a comprehensive set of tools and features for streamlining and optimizing the machine learning workflow. It’s a cloud-based platform that provides data visualization, collaboration tools, and advanced tracking and reporting (Comet-ML, 2023).

The platform’s goal is to make machine learning more accessible and efficient for researchers, data scientists, and machine learning practitioners. Comet provides a powerful and versatile platform for managing machine learning projects thanks to its user-friendly interface and intuitive design. Comet simplifies the machine learning process, allowing users to focus on what matters most: building and deploying powerful machine learning models (Comet-ML, 2023).

NLP with RandomForest
Random Forest is a widely used machine learning technique that employs an ensemble of decision trees to make predictions. This method involves creating multiple decision trees from a random selection of features and training each tree on a random sample of the data. The results of each tree are then combined to produce a final prediction, which is often more accurate than the output of a single decision tree. The random selection of features and data points helps to prevent overfitting and increase the overall accuracy of the model.

Random Forest can be used in NLP (Natural Language Processing) as a method for text classification, sentiment analysis, and entity recognition. In NLP, each document is represented as a set of features, such as word frequency, term frequency-inverse document frequency (TF-IDF), or word embeddings, and the task is to predict a label based on the input features.

The random forest algorithm can be trained on these features to make predictions about the label, such as positive or negative sentiment, or the type of entity mentioned in the document. The random selection of features and data points in each tree helps to reduce overfitting and improve the robustness of the model, making it well-suited for NLP tasks.

install.packages('randomForest')

Now let’s get into our project to see how Comet can aid our NLP model tracking, reproducibility, and team collaboration issues.

The dataset we would be using has two columns in this dataset titled Customer Reviews and Liked.

Customer reviews give us information about the evaluations that customers have made of the food at a restaurant, and the liked column indicates whether or not they enjoyed the food.

The dataset is from Kaggle contributed by Arsh Anwar, and you can access it here.

Installing Comet on R
To install Comet on your R studio you should first create an account if you don’t have one. You can create an account here.

To set up Comet on your R studio:

  • Install the “cometr ” package
install.packages("cometr")
  • After obtaining your Comet API key and completing the one-time setup, you are ready to dive into Machine Learning in R.

Create a project folder on your system nad set it as local working directory in your Rstudio.

You must have defined your /.comet.yml file, which should look something like this:

COMET_WORKSPACE: my-comet-user-name
COMET_PROJECT_NAME: ******
COMET_API_KEY: ******************
  • Read the official documentation for Comet here to understand the full installation guide.
  • .comet.yml config file in the local working directory.

Let’s start with the NLP project:

  • Load your data: For this NLP project, we’ll need to load your text data into R. We can do this using the read.tsv function in R. We need to make sure that our text data is in a single column with a header that represents the text.
  • Pre-process your data: Before we can start training our NLP model, we’ll need to pre-process our data. This typically involves cleaning and transforming our text data into a format that can be used by machine learning algorithms.
  • We will be using the “tm” package for preprocessing. The “tm” package provides a comprehensive framework for text mining and text analysis in R. It includes text filtering, stemming, and tokenization functions, among others.
  • We will also be using the “snowballC” package. It is an R interface to the C ‘libstemmer’ library that implements Porter’s word stemming algorithm for collapsing words to a common root to aid the comparison of vocabulary. Read more about the package here.
#To install it, simply type into the R terminal.
install.packages("tm")
  • Split your data into train and test sets: Once our data has been pre-processed, we’ll need to split it into train and test sets. This is important to ensure that our model is trained and evaluated on different data.
# Load the comet package
library(cometr)

#Load the tm package
library(tm)

#Load the snowball package
libraby(snowballC)


# Load the data
# Ensure that the dataset is saved in your local working directory
data <- read.table("Restaurant_Reviews.tsv", header = TRUE, sep = "\t")

# Pre-process the data
corpus = VCorpus(VectorSource(data$Review))
corpus = tm_map(corpus, content_transformer(tolower))
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, stopwords("english"))
corpus = tm_map(corpus, stemDocument)
corpus = tm_map(corpus, stripWhitespace)



# Creating the Bag of Words model
dtm = DocumentTermMatrix(corpus)
dtm = removeSparseTerms(dtm, 0.999)
dataset = as.data.frame(as.matrix(dtm))
dataset$Liked = data$Liked


# Encoding the target feature as factor
dataset$Liked = factor(dataset$Liked, levels = c(0, 1))

Create a project and experiment in Comet

  • You will need your user API key and a project that you have created in your workspace.
  • Make sure you have set up the config file in your working directory.
  • You can also create a new project in Rscript.
#Create a new project and experiment in comet
# Note the this method will work only if you have properly set up your credentials,
# in the .yml file in your local working directory
cometr::create_project(project_name = "nlp", project_description = "nlp")

experiment = create_experiment(
project_name = "nlp",
keep_active = TRUE,
log_output = TRUE,
log_error = FALSE,
log_code = TRUE,
log_system_details = TRUE

The code creates a new project in the Comet platform. The first line creates a new project with the name “nlp” and a description “nlp”. The next block creates a new experiment within the “nlp” project. The create_experiment function has several parameters:

  • project_name: specifies the name of the project to which the experiment belongs.
  • keep_active: determines whether to keep the experiment active or not.
  • log_output: determines whether to log the output.
  • log_error: determines whether to log error messages.
  • log_code: determines whether to log the code.
  • log_system_details: determines whether to log the system details.

Splitting the dataset

# Load the caTools package
library(caTools)

# Split the data into train and test sets
set.seed(123)

#log the hyperparameter to your comet-ml
experiment$log_parameter("seed", 123)

#splitting the dataset into training set with 80% and test set with 20%
split = sample.split(dataset$Liked, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Training the model with the randomForest model

# Fitting Random Forest Classification to the Training set
library(randomForest)
classifier = randomForest(x = training_set[-692],
y = training_set$Liked,
ntree = 10)

# Predicting the Test set results
y_pred = predict(classifier, newdata = test_set[-692])

Evaluating the Model Performance and Logging it to Your Comet

# Making the Confusion Matrix
cm =table(test_set[,692], y_pred)
cm <- table(data$Prediction, data$Actual)

# Evaluation matrices
accuracy <- sum(cm[1], cm[4]) / sum(cm[1:4])
precision <- cm[4] / sum(cm[4], cm[2])
sensitivity <- cm[4] / sum(cm[4], cm[3])
fscore <- (2 * (sensitivity * precision))/(sensitivity + precision)
specificity <- cm[1] / sum(cm[1], cm[2])



# Logging them to your comet-ml
experiment$log_metric("Accuracy",accuracy)
experiment$log_metric("Precison",precision)
experiment$log_metric("Sensitivity",sensitivity)
experiment$log_metric("F1-score",fscore)
experiment$log_metric("Specificity",specificity)


# Exporting the model rds to your comet-ml as a file
saveRDS(classifier, file = "model.rds")
experiment$upload_asset("model.rds", step=1)


# ending the experiment
experiment$stop()

The Comet code is from the Comet R SDk

The code is creating a confusion matrix (cm) from two columns of a data set (test_set[,692] and y_pred or `data$'s "Prediction" and "Actual" columns). The confusion matrix is a 2x2 table that compares the true class labels with the predicted class labels.

Source: Sarang Narkhede

The code calculates the accuracy, precision, sensitivity (also known as recall), F-score (harmonic mean of precision and recall), and specificity of the predictions.

  1. Accuracy: This metric measures the overall correct predictions out of all predictions made. It is calculated as the sum of true positive (cm[1]) and true negative (cm[4]) divided by the sum of true positive, true negative, false positive, and false negative (cm[1:4]).
  2. Precision: Precision measures the proportion of positive predictions that are actually correct. It is calculated as the true positive (cm[4]) divided by the sum of true positive and false positive (cm[4] + cm[2]).
  3. Sensitivity (also known as recall): Sensitivity measures the proportion of actual positive cases that are correctly identified. It is calculated as the true positive (cm[4]) divided by the sum of true positive and false negative (cm[4] + cm[3]).
  4. F-Score: The F-score is the harmonic mean of precision and sensitivity. It is calculated as (2 * (sensitivity * precision)) divided by (sensitivity + precision).
  5. Specificity: Specificity measures the proportion of actual negative cases that are correctly identified. It is calculated as the true negative (cm[1]) divided by the sum of true negative and false positive (cm[1] + cm[2]).

Comet Platform
You can access your Comet account, to access the matrices and parameters that were logged on the platform. You can also add your team members as collaborators so they can assess the values logged on the platform.

The images below show an overview of the platform and the values that were logged with the code above.

Source: author (Comet-ml account)

This article has highlighted the importance of using NLP in machine learning and how Random Forest with NLP can provide improved results. It also showcases the benefits of using Comet for ML and NLP models, providing a platform to track and manage experiments in a systematic and organized manner. The use of NLP and Random Forest with NLP along with Comet demonstrates the potential for improved model performance and efficient experimentation.

Here is a list of articles that I found helpful when writing this article:

  1. Sruthi E R. “Understanding Random Forest”, Analytics Vidhya
  2. Comet-ML. “Comet R SDK”, Comet-ML
  3. Salma, Ghoneim. “Accuracy, Recall, Precision, F-Score & Specificity, which to optimize on?”, Towards Data Science

Thanks for taking the time to read my blog ❤️. You can reach out to me on LinkedIn.

If you have any thoughts on the topic, please share them in the comments; I’m always looking to learn more and improve.

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author 👇

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

--

--

A data scientist with a background in healthcare. My expertise in data analysis and machine learning using tools like python, R , STATA, SQL to deliver insights