LDA Vs Watson NLP Topic Modeling

Abhilasha Mangal
IBM Data Science in Practice
5 min readNov 11, 2022

--

Unstructured content is constantly rising in volume these days. Handling this data and converting it into a structured manner is time-consuming and labor-intensive for employees. Using the topic modeling approach, a machine can sift through unlimited lists of unstructured content into similar documents. It will save businesses time and money. So everyone wants the best Topic Modeling tool which they can use directly with their unstructured data & got the best results.

Topic Modeling

In this blog, we walk you through the popular Open Source Latent Dirichlet Allocation (LDA) Topic Modeling from conventional algorithms and Watson NLP Topic Modeling. We are sharing a helpful introduction to these models and comparing their advantages and disadvantages in practical use cases.

1. Latent Dirichlet Allocation (LDA) Topic Modeling

LDA is a well-known unsupervised clustering method for text analysis. It is a sort of topic modeling in which documents are modeled as collections of these word topics, with words being represented as topics.

The LDA technique uses parametrized probability distributions for each document. An algorithm is carried out in LDA by carefully following the stages listed below.

  1. Randomly assign each word to a topic amongst the K topics, where K is the number of pre-defined topics. These pre-defined topics are fixed.
  2. For each document d, use the following formula to calculate each word w in the document.

p(word w with topic t) = p(topic t | document d) * p(word w | topic t)

3. Reassign topic T’ to word w with p(t’|d)*p(w|t’) probability.

The last step is computed multiple times till we do not get the all topics with the keywords which will not change further.

2. Watson NLP Topic Modeling

Watson NLP Topic Modeling consists of two types of models inside this. Topic modeling starts with summarization (Summary Model) after constructing a conversation structure from input data. The N-gram-based approach is generated after applying the text pre-processing procedure to conversations. Then, the topic model applies a hierarchical clustering algorithm using conversation vectors from the output of the summary model.

Topic Modeling

Summary Model: This model takes the input data in the form of Syntax Data which is generated by using the Watson NLP Syntax Model.

Hierarchical Clustering Model: This model takes the input as the output of the summary model, along with train parameters. In the train parameters, you can modify the number of topics & clustering level as well.

Hierarchical Clustering Model Architecture

To understand the LDA & Watson NLP topic modeling we extracted the topics by using both algorithms.

3. Collecting the dataset

We collected the data from the Consumer complaint database. Once you have collected this dataset. You can upload it to the Watson Studio instance by going to the Assets tab.

Collecting dataset

4. Text Pre-Processing

We have leveraged the Watson NLP library for data pre-processing like Stop word and common pattern removal. The cleaned and processed data is passed through both LDA and Watson NLP Topic modeling.

5. LDA Topic Modeling

We have used the Genism LDA model & Dictionary to create the topics from the clean text. Here we can extract the names of the topics with the keywords along with their probability.

LDA Model

We can extract all the topics along with their keywords & probability. The number of topics from the LDA model is fixed. Following are the topics output from the LDA model.

LDA Model output

As per the output, we can see topic numbers are provided with probability & keywords. LDA has not assigned topic names so a human is required to provide the topic names based on keywords. If you want to learn more about this, you can view this notebook.

6. Waston NLP Topic Modeling

We passed the same cleaned and processed data to create a topic model by using Watson NLP. The data is first sent to the summary model along with train parameters. The output of the summary model is passed to the hierarchical topic model as input along (Summary Model output) with training parameters. To learn more about it and get the step-by-step processes, you can download the code from GitHub. Watson NLP topic modeling provides output with the Topic Name, Keywords & Phrases, etc.

Topic Model -output

If you want to learn more about Topic Modeling using the Watson NLP library, please read my previous blog. We can visualize the most important topics using the frequency plot, as shown below.

Top 15 Topics

After analysis of Watson NLP topic modeling, we can see topic names rather than numbers. So here no Human interaction is required to get the topic name. We can drill down here each topic basis on its keywords & Phrases.

7. Comparison :

Based on the results from the LDA and Watson NLP Topic modeling, we can see Watson NLP provides more good & accurate results regarding Keywords, Phrases & Topic Names as compared to LDA. The below table defines the more key points:

Comparison Table

Conclusion :

We have seen how easily we can identify the topics from consumer financial datasets by using Watson NLP. We are able to find out the comparison between both algorithms & their key points as well.

You can start your AI journey by browsing & building AI models through a guided wizard here. The IBM Build Lab team is here to work with you on your AI journey. For more information, Embeddable AI Webpage.

You can also additionally browse the collection of Embeddable AI self-serve assets at Tech Zone and on GitHub.

--

--