Working of Encoder-Decoder model

Sequence-Sequence LSTM model

Chamanth mvs
11 min readMay 25, 2023

LSTM and GRU models are variants of RNN. The RNN models are much easier to understand when compared to LSTM and GRU units. The beauty in understanding base concept is that, the learnings can be leveraged into LSTM and GRU units.

There are many types of RNN/LSTM models and the most popular among them is the Seq2Seq model.

Types of RNN models: img src

There are two types of Seq2Seq models — the Seq2Seq model of the same length and the Seq2Seq model of different lengths.

The architecture of the model depends on the type of use case, which is to be solved. Sequence models have gained traction in the past 5 years and it’s been very active research of study, even the GPT models — on which ChatGPT was implemented using the concept of Transformers and BERT, which is based on Self-Attention model — it is based on Attention models — which uses the base of Encoder-Decoder sequence models.

The problems like Machine Translation, Question Answering, Creating Chatbots and Text Summarization could also be solved using Encoder-Decoder sequence models.

Let’s consider a problem of Machine Translation (Translating a text from one language to another) to understand the architecture of Encoder-Decoder model and how they are built. Typically Neural Networks are used to perform Machine Translation, which is commonly called as Neural Machine Translation.

The example considered in this article is Translation from English to Telugu (Machine Translation problem).

Encoder-Decoder Architecture

The high level architecture of Encoder-Decoder models looks as below.

Architecture of Encoder-Decoder network

Let’s understand the architecture in an point-wise summarized manner.

  1. Encoder and Decoder models would typically be either a LSTM cell or GRU cells.
  2. Encoder reads the input sequence (sentences are typically considered as sequences — because sentence is a sequence of words) and then summarizes the information in the internal state vectors [in LSTM — these are called as hidden-state and cell-state vectors] — Output of the encoder network is discarded and only the internal-state vectors are preserved in encoders.
  3. Decoder is also an LSTM cell—whose initial states are initialized to the final states of the Encoder’s LSTM cell. Based on the final states of Encoder LSTM, the Decoder starts generating output sequences.
  4. Decoder acts differently while in-training and in-testing phase (inference). During training the teacher forcing techniques are used to train the decoder faster. During inferencethe input to the decoder at each timesep is the output from the previous timestep.
  5. As said above, the encoder summarizes the input sequence into state vectors, also called as Thought vectors —these vectors are fed as an input to the Decoder network, which starts generating output sequences based on the Thought vectors.

Teacher Forcing

Teacher forcing is a strategy for training RNN/LSTM based networks — in which ground truth is used an input, instead of using model output from a prior timestep as an input.

Teacher forcing works by using the actual-output from the training dataset at the x(t) current timestep — i.e., y(t) as input in the next timestep of x(t+1), rather than the output generated by the network at x(t). To be more clear — the output generated at x(t) be y`(t) and the actual output be y(t). When using the teacher forcing strategy — we use y(t) instead of y`(t).

All the above 5-steps described in the architecture section, will be more clearly understood in an intuitive way with an example of translating an English sentence (input sentence) into an equivalent Telugu sentence (output sequence).

Encoder LSTM

Brief about Encoder network

It is important to note that all components (x_ij, hi, ci and y^ij) are vectors.

Translate English sentence into the Telugu sentence equivalent.

Let’s start solving a machine translation problem — translation from English to Telugu.

(Input sequence) English : I am learning data science.

(Output sequence) Telugu : నేను డేటా సైన్స్ నేర్చుకుంటున్నాను.

— — — Encoder phase — — —

Consider Input sequence (x_ij)

A sentence is a sequence of words and Word is a sequence of characters.

Consider a sentence and break it down into sequence of words.

I am learning data science (sentence) — [‘I’, ‘am’, ‘learning’, ‘data’, ‘science’] (sequence of words)

Consider a sentence and break it down into sequence of characters.

I am learning data science (sentence) — [‘I’, ‘a’, ‘m’, ‘l’, ‘e’, ‘r’, ’n’, ‘i’, ‘g’, ‘d’, ‘t’, ‘s’, ‘c’, ‘’] (sequence of words)

In real world, the applications of machine translation are done — by considering a sentence and breaking it into sequence of words. So, it is termed as Word-level Neural Machine Translation problem.

The Input sequence — I am learning data science — we have 5 words — so, this could be made into 5 timesteps — LSTM model will read this sentence word by word in 5 timesteps.

(Input sequence) — I am learning data science — — x_i1 = ‘I’, x_i2 = ‘am’, x_i3 = ‘learning’, x_i4 = ‘data’ and x_i5 = ‘science’

How do we represent each word (x_ij) as vectors? — These are typically done using word embedding algorithms like word2vec, sent2vec, etc…

Why to use word embeddings?

Computers, coding-scripts and machine-learning models can’t read and understand text in any human sense, it can only understand the computer language (binary language).

When a word “cat” is read, many different associations to it are invoked in our mind, cat — — it’s a small furry animal, eats fish, etc. Such kind of linguistic associations are a result of quite complex neurological computations that happens in our mind, as we have been trained by our parents, teachers and ourselves from experiences and infact genetic data too BUT the ML models would start from scratch with no pre-built understanding of word meaning.

Computers can handle numerical input in an efficient way.

How the textual input should be sent to the models? The numerical representation method and the numerical values in it should capture as much linguistic meaning of a word. An informative input representation could have a massive impact on overall performance of the model.

In any problem, dealing with the textual data, Word-embeddings are the saviours. Problems from text classification to machine translation, word-embeddings are extensively used.

Why to use embeddings in NLP? — Computers, coding-scripts and machine-learning models can’t read and understand text in any human sense, it can only understand the computer language (binary language).

When a word “cat” is read, many different associations to it are invoked in our mind, cat — — it’s a small furry animal, eats fish, etc. Such kind of linguistic associations are a result of quite complex neurological computations that happens in our mind, as we have been trained by our parents, teachers and ourselves from experiences and infact genetic data too BUT the ML models would start from scratch with no pre-built understanding of word meaning.

Computers can handle numerical input in an efficient way.

How the textual input should be sent to the models? — The numerical representation method and the numerical values in it should capture as much linguistic meaning of a word. An informative input representation could have a massive impact on overall performance of the model.

In any problem, dealing with the textual data, Word-embeddings are the saviours. Problems from text classification to machine translation, word-embeddings are extensively used.

Assume a dataset consists of 50k reviews is having 80000 unique words, which becomes the vocabulary size of the dataset. If the embedding is not used, then the data could be represented as one-hot encoded vectors of each word — this itself feels disastrous.

Let’s take a very simple example and understand:

Dataset = (‘This is a informative blog’, ‘This blog explains on creating data pipeline’).

Assume, this is the dataset, it contains two sentences and let’s store the words in this dataset in a list and sort the words based on the occurrence and give the respective index to the data.

words_list =(‘This’,’is’,’a’,’informative’,’blog’,’explains’,’on’,’creating’,’data’,’pipeline’)

Vocabulary size is 10

ranked_index=(‘1’,’2',’3',’4',’5',’6',’7',’8',’9',’10')

Now representing the dataset in the form of one-hot encoded vector representation

Dataset = (({1,0,0,0,0,0,0,0,0,0},{0,1,0,0,0,0,0,0,0,0},{0,0,1,0,0,0,0,0,0,0},{0,0,0,1,0,0,0,0,0,0},{0,0,0,0,1,0,0,0,0,0}),({1,0,0,0,0,0,0,0,0,0},{0,0,0,0,1,0,0,0,0,0},{0,0,0,0,0,1,0,0,0,0},{0,0,0,0,0,0,1,0,0,0},{0,0,0,0,0,0,0,1,0,0},{0,0,0,0,0,0,0,0,1,0},{0,0,0,0,0,0,0,0,0,1}))

Assume, if the review dataset is to be represented in the form of one-hot encoded vectors — How sparse that would be?

One-hot encoding process generates a very sparse (mostly zero) feature-vector/embedding for each input word but one-hot vectors are a quick and easier way to represent words as vectors of real-valued numbers. Many NLP projects have been implemented on same method and the results appeared to be mediocre, especially when the training dataset is small because, they aren’t better way of representing the data and the final trained weights (features) couldn’t be well generalized because there is a chance of so many weights tending to be sparse. In typical ML models — L1 and L2 regularization techniques would be used to deal this problem.

To deal all such problems, word-embeddings are being used.

Back to our task

It is clear that we use word-embeddings to represent each x_ij in the form of vectors.

hi and ci

hi is the (hidden state) and ci is the (cell state) at each timestep — to mention it in simple form, these states are used to remember — what LSTM model has already learnt.

for example:

h0,c0 — these states are initialized to zeros. At this state, the model hasn’t started learning anything yet.

h3, c3 — this states that the network has learnt particular information till timestep-3 . It is the summary of information, that the model has learnt till timestep-3.

h3,c3 are the states that the model has learnt this data — “I am learning”

h4,c4 are the states that the model has learnt this data — “I am learning data”

h5,c5 are the states that the model has learnt this data — “I am learning data science” (this h5,c5 states — contains the summary of entire input sequence). It is the state where sentence ends. States from last timestep are also called as “THOUGHT VECTORS” (summarizing the entire input sequence in vector form).

size of the vectors hi,ci — number of LSTM units in the LSTM cell.

(y_ij)^ at each timestep

(y_ij)^ is the prediction of the model at each timestep. What type of a vector is (y_ij)^ ? — In word-level language models, (y_i)^ is the probability distributions over entire vocabulary, which is generated by using softmax activations.

So, each (y_ij)^ is a vector of size (vocabulary size) — which is total number of words in the vocabulary. (y_ij)^ could be used (or) discarded — depending on the problem.

In our case — where we are working on machine translation problem (encoder-decoder problem) — English to Telugu sentences (y_ij)^ are being discarded. That means, we are not considering the output of the network that english sentence generates.

Summary of the Encoder Model

An input sequence (english sentence) is learnt word-by-word and the internal states of the LSTM network have been preserved after each timestep till the last word (last timestep (or) end of the sequence) — hn,cn — assuming there are only n words in the sequence.

These vectors hn,cn are called as encoding of the input sequence — as they encode(summarize) the entire input in vector form.

In the encoder (y_ij)^ are discarded — because the output of encoder network is not considered or used. After reading the entire input sequence by the encoder network — the states that are sent to the decoder network and based on these states, the output is generated in the decoder network.

— — — Decoder Phase — — —

Encoder network (LSTM) works in the same way in both training phase and inference phase. But, the Decoder network (LSTM) works differently in both training and inference phases.

(Input sequence) I am learning data science.

The goal of the training process is to train/teach the Decoder to output

(Output sequence) నేను డేటా సైన్స్ నేర్చుకుంటున్నాను.

In the Enoder model, the input has been sequence of words. Similarly, the Decoder model, generates the output sequence word by word.

In the Output Sequence, we add Start_ at the start of the sequence — to indicate to the decoder to start generating telugu sequence (the next word will be a first word in a telugu sentence) and the _End at the end (or) after the last word in the telugu sentence — to let the decoder know, it as a stopping condition. During the inference process — _End denotes the end of the translated sentence and the inference loop will be stopped.

So, the Output Sequence would be as

Start_ నేను డేటా సైన్స్ నేర్చుకుంటున్నాను _End

Finally, the loss is calculated on predicted outputs from each timestep and the errors are backpropagated through time (BPTT) in order to update the parameters of the network.

How the Encoder and Decoder network together look like?

Inference means

Before understanding the inference phase of Decoder, it is important to understand about the word inference

Understanding the word Inference in terms of Data science with an example — During the inference → The Developer (or) Data scientist might give the trained ML model, some photos of cars (which was never seen by the model) — to understand what does the trained ML model infer from the photos of cars.

So, Inference in ML means — putting the data to work on real-time data to produce an actionable output. During this phase, the inference system accepts inputs from the end-users and process the data and then feeds this data into ML model and serves the output back to the users.

Similarly, Inference in Neural Network means — applying the knowledge from a trained neural network model and inferring the result from that trained neural network model.

Decoder — Inference phase

The goal of Decoder is to predict the output sequences by using provided Thought vectors. Understanding with an example, makes the concept much more clearer.

(Input sequence) I am learning data science.

(Output sequence) Start_ నేను డేటా సైన్స్ నేర్చుకుంటున్నాను _End.

Step-1 : Building THOUGHT VECTORS by encoding the input sequence

Step-2 : Output sequences will be generated in the loop — word by word

Process of Inference :

Let’s understand the above pictorial representations in the form of words.

During inference, only one word is generated at a time — It is something similar to the loop — which means only one timestep is processed each time.

Initial states of the Decoder are set to the final states of the Encoder and Initial input to the Decoder is always the _Start token.

At every timestep, the predicted output is fed as input in the next timestep.

Break the loop, when the Decoder predicted End_ token.

--

--

Chamanth mvs

Data Science and ML practitioner | I share my learnings and thoughts here