From Rulesets to Transformers: A Journey Through the Evolution of SOTA in NLP

6 min readApr 5, 2023

Charting the evolution of SOTA (State-of-the-art) techniques in NLP (Natural Language Processing) over the years, highlighting the key algorithms, influential figures, and groundbreaking papers that have shaped the field.

To understand the full impact of the above evolutionary process. We must first understand an essential terminology, SOTA (State-of-the-art).

SOTA (state-of-the-art) in machine learning refers to the best performance achieved by a model or system on a given benchmark dataset or task at a specific point in time. It measures the current cutting-edge performance of a model or system in a particular field or job.

Before we go any further, let’s introduce NLP.

What is Natural Language Processing (NLP)

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that deals with interactions between computers and human languages. NLP algorithms help computers understand, interpret, and generate natural language.

The race for the SOTA model for NLP

Many models have held the title of being the best SOTA models in the field of NLP (Natural Language Processing). In this article, we’ll look at the evolution of these state-of-the-art (SOTA) models and algorithms, the ML techniques behind them, the people who envisioned them, and the papers that introduced them.

Evolution of SOTA models in NLP and factors affecting them

Here is the evolutionary map for this article. We will overlay this map with landmarks in human understanding, computing, eras, and use cases.

Three significant events affected the evolution of these models.

Evolution of input data being available for training:

The real take-off point for NLP model evolution started with the advent of the internet. With the large amount of data to process and the advent of using real-world data generated from real humans, the training data set was now there.

Evolution of our application and understanding:

The evolution of these techniques and models can closely align with our understanding of language and computing. The earlier models that were SOTA for NLP mainly fell under the traditional machine learning algorithms. These included the Support vector machine (SVM) based models. These algorithms treated NLP analysis with a more statistical and mathematical approach. Over the years, we evolved that to solving NLP use cases by adopting Neural Network-based algorithms loosely based on the structure and function of a human brain.

Evolution of computing:

The evolution of these algorithms also closely follows the evolution of computing. The advancement of computing and the advent of GPUs and parallelization significantly impacted the way we could approach these NLP models in terms of training and optimization. This gave rise to more effective model techniques.

The Evolution of SOTA Models for NLP

1. Rule-Based Systems (1950s — 1960s)

The earliest work in NLP was based on rule-based systems, hand-crafted rules designed to process and translate the language.
Use Cases: Translation services.
Significant papers/experiment: One of the earliest examples was the Georgetown- Inside IBM Research experiment in 1954, which involved translating 60 Russian sentences into English using a set of hand-crafted rules.
Citation: Article from IBM archives

2. Statistical Models (1970s — 1980s)

In the 1970s and 1980s, statistical models and machine learning algorithms began to gain popularity in NLP. One early model was the Hidden Markov Model (HMM),. Another model was the n-gram model, used for language modeling and machine translation.

Use Cases: speech recognition and named entity recognition.

Significant papers:
“A statistical approach to machine translation” by Brown et al. (1990)
“Speech recognition using hidden Markov models” by Rabiner and Juang (1986)

Significant people:
Frederick Jelinek
Leonard E. Baum
Biing-Hwang Juang

3. Digital Data Explosion and Text Mining (1990s — 2000s)

The rise of the internet and the explosion of new digital data created new challenges and opportunities for NLP. We needed more robust models that could handle a large amount of data and index and search it efficiently. Companies like Google, Yahoo and Meta pioneered research in this field.

Use Cases: Web Search, Information Retrieval, Text Mining

Significant papers:
“Latent Dirichlet Allocation” by Blei et al. (2003)
“Support-vector networks” by Cortes and Vapnik (1995)

Significant people:
David Blei
Corinna Cortes
Vladimir Vapnik

4. Deep Learning (Late 2000s — early 2010s)

With the evolution of needing to solve more complex and non-linear tasks, The human understanding of how to model for machine learning evolved. The birth of Neural networks was initiated with an approach akin to structuring solving problems with algorithms modeled after the human brain. With the rise of deep learning (deep learning means multiple levels of neural networks) and neural networks, models such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) began to be used in NLP.
RNNs are a type of neural network that can handle sequential data by processing inputs one at a time while also maintaining an internal memory of previous inputs.
CNNs are a type of neural network that are particularly effective for processing images and other two-dimensional data by using convolutional layers to learn features and patterns in the data.
One SOTA example is the Long Short-Term Memory (LSTM) model, which is a type of RNN that is particularly effective for language modeling and machine translation. Another example is the Word2Vec model, which is used for word embedding and has been shown to be effective for tasks such as sentiment analysis and named entity recognition.
Word embedding is a way to represent words as numbers in a neural network for language tasks. The neural network learns these numbers during training, and capture the meaning and context of words in the training data. This helps the neural network to understand better and process natural language.

Use Cases: Sentiment Analysis, Machine Translation, Named Entity Recognition

Significant papers:
“Learning word embeddings efficiently with noise-contrastive estimation” by Mnih and Hinton (2012)
“Sequence to sequence learning with neural machine translation” by Sutskever et al. (2014)

Significant people:
Geoffrey Hinton
Yoshua Bengio
Ilya Sutskever

5. Transformer Models (Mid-2010s — present)

Neural networks represented a major obstacle where traditional RNNs still processed inputs sequentially, which can be slow and computationally expensive. Thus, was born the Transformer model, which improved upon the existing RNNs. They work by processing the input text all at once, rather than one word at a time like older models such as RNNs. Transformers use a self-attention mechanism to learn which words in the input text are most important for understanding the context, allowing them to capture longer-range dependencies between words. They also use an encoder and decoder mechanism to better set context for both input and output tasks and thus make these models a perfect match for generative text AI models.

Additionally, Papers published by NVIDIA AI on efficiently pre-training models has really helped push the boundaries of efficiency and speed.

All the latest SOTA models are based on transformer-based models.
Popular Examples include the Bidirectional Encoder Representations from Transformers (BERT) model and the Generative Pre-trained Transformer 3 (GPT-3) model. In recent years transformer models have emerged as the SOTA models for NLP.

Use Cases:Language Modeling, Question Answering, Text Generation

Significant papers:
“Attention is all you need” by Vaswani et al. (2017)
“BERT: Pre-training of deep bidirectional transformers for language understanding” by Devlin et al. (2018)
“Language models are few-shot learners” by Brown et al. (2020)
“GPT-4 Technical report” by Open AI.
“Efficient Large-Scale Language Model Training on GPU Clusters
Using Megatron-LM” by Deepak Narayanan et al. (2021)

Significant people:
Ashish Vaswani
Jacob Devlin
Alec Radford

The current leader of the pack

Recently, ChatGPT-4 has taken the current crown for best SOTA performance.

GPT-4 considerably outperforms existing large language models, alongside most state-of-the-art (SOTA) models which may include benchmark-specific crafting or additional training protocols

What’s next?

With the advancement of computing and AI, it would not be surprising to see a completely different modelling technique come to the fore to capture the SOTA crown.

BECOME a WRITER at MLearning.ai

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com