LDA Vs Watson NLP Topic Modeling
Unstructured content is constantly rising in volume these days. Handling this data and converting it into a structured manner is time-consuming and labor-intensive for employees. Using the topic modeling approach, a machine can sift through unlimited lists of unstructured content into similar documents. It will save businesses time and money. So everyone wants the best Topic Modeling tool which they can use directly with their unstructured data & got the best results.
In this blog, we walk you through the popular Open Source Latent Dirichlet Allocation (LDA) Topic Modeling from conventional algorithms and Watson NLP Topic Modeling. We are sharing a helpful introduction to these models and comparing their advantages and disadvantages in practical use cases.
1. Latent Dirichlet Allocation (LDA) Topic Modeling
LDA is a well-known unsupervised clustering method for text analysis. It is a sort of topic modeling in which documents are modeled as collections of these word topics, with words being represented as topics.
The LDA technique uses parametrized probability distributions for each document. An algorithm is carried out in LDA by carefully following the stages listed below.
- Randomly assign each word to a topic amongst the K topics, where K is the number of pre-defined topics. These pre-defined topics are fixed.
- For each document d, use the following formula to calculate each word w in the document.
p(word w with topic t) = p(topic t | document d) * p(word w | topic t)
3. Reassign topic T’ to word w with p(t’|d)*p(w|t’) probability.
The last step is computed multiple times till we do not get the all topics with the keywords which will not change further.
2. Watson NLP Topic Modeling
Watson NLP Topic Modeling consists of two types of models inside this. Topic modeling starts with summarization (Summary Model) after constructing a conversation structure from input data. The N-gram-based approach is generated after applying the text pre-processing procedure to conversations. Then, the topic model applies a hierarchical clustering algorithm using conversation vectors from the output of the summary model.
Summary Model: This model takes the input data in the form of Syntax Data which is generated by using the Watson NLP Syntax Model.
Hierarchical Clustering Model: This model takes the input as the output of the summary model, along with train parameters. In the train parameters, you can modify the number of topics & clustering level as well.
To understand the LDA & Watson NLP topic modeling we extracted the topics by using both algorithms.
3. Collecting the dataset
We collected the data from the Consumer complaint database. Once you have collected this dataset. You can upload it to the Watson Studio instance by going to the Assets tab.
4. Text Pre-Processing
We have leveraged the Watson NLP library for data pre-processing like Stop word and common pattern removal. The cleaned and processed data is passed through both LDA and Watson NLP Topic modeling.
5. LDA Topic Modeling
We have used the Genism LDA model & Dictionary to create the topics from the clean text. Here we can extract the names of the topics with the keywords along with their probability.
We can extract all the topics along with their keywords & probability. The number of topics from the LDA model is fixed. Following are the topics output from the LDA model.
As per the output, we can see topic numbers are provided with probability & keywords. LDA has not assigned topic names so a human is required to provide the topic names based on keywords. If you want to learn more about this, you can view this notebook.
6. Waston NLP Topic Modeling
We passed the same cleaned and processed data to create a topic model by using Watson NLP. The data is first sent to the summary model along with train parameters. The output of the summary model is passed to the hierarchical topic model as input along (Summary Model output) with training parameters. To learn more about it and get the step-by-step processes, you can download the code from GitHub. Watson NLP topic modeling provides output with the Topic Name, Keywords & Phrases, etc.
If you want to learn more about Topic Modeling using the Watson NLP library, please read my previous blog. We can visualize the most important topics using the frequency plot, as shown below.
After analysis of Watson NLP topic modeling, we can see topic names rather than numbers. So here no Human interaction is required to get the topic name. We can drill down here each topic basis on its keywords & Phrases.
7. Comparison :
Based on the results from the LDA and Watson NLP Topic modeling, we can see Watson NLP provides more good & accurate results regarding Keywords, Phrases & Topic Names as compared to LDA. The below table defines the more key points:
Conclusion :
We have seen how easily we can identify the topics from consumer financial datasets by using Watson NLP. We are able to find out the comparison between both algorithms & their key points as well.
You can start your AI journey by browsing & building AI models through a guided wizard here. The IBM Build Lab team is here to work with you on your AI journey. For more information, Embeddable AI Webpage.
You can also additionally browse the collection of Embeddable AI self-serve assets at Tech Zone and on GitHub.