SimVP: Simpler yet Better Video Prediction

This is going to be a short, not a long story. The story of a simpler model architecture but efficient for video prediction.

Reza Yazdanfar
5 min readOct 23, 2023

I read every day, my book, stories, and papers. Here you see those AI-related research that captures interests. Doesn’t include everything, but enough to know.

I am building nouswise🤗!! check it out and get on the waitlist😉 If you want early access hmu on X (Twitter). Let me know your thoughts.

WHY DO WE NEED TO PREDICT VIDEO’S FUTURE?

For a variety of tasks, it’d be awesome to have at least an intuition of what would happen in the future, the next bubble interests economists, the next fashion trend interests fashion designers, the next Google interests Venture Capitals, etc.

A way of forecasting the future is using video data, exactly like we humans, whenever we can speculate a goal in soccer in a few seconds, or even less, beforehand and then there’s an unbelievable and bright excitement in our eyes, etc.

In the arena of Artificial Intelligence (computer vision to be precise), researchers in the community are developing algorithms with diverse architectures to predict the new few frames. A growing focus for all, researchers, companies, and investors.

There are numerous applications now, including simulating the behavior of dynamic systems such as proteins in biology or self-driving vehicles on the streets and so on.

With the rise of Attention-base Transformers, attentions are applied to a variety of tasks. On the other hand, as we can say they are kind of ANNs (Artificial Neural Networks) for generalizing data but weigh more powerful, but also high energy consuming for computation. Consequently, a lot of improved variations have been developed.

This article, simply, is an attempt to respond to the question “Is there a simple method that can perform comparably well?

Is there a simple method that can perform comparably well?

Well, the results say yes: SimVP

HOW CAN WE DEVELOP A SIMPLE MODEL BUT ROBUST?

First, we must look into what has been done, and then we can find how to do the job

What makes SimVP different is not using an attention mechanism in its architecture but fully convolutional operators.

We can categorize key methods into four branches:

a) RNN-RNN-RNN

b) CNN-RNN-CNN

c) CNN-ViT-CNN

d) CNN-CNN-CNN

Fig 1. Red and blue lines help to learn the temporal evolution and spatial dependency. [source]

All four various architectures are awesome for video prediction, have gained a lot of attention, and worked on. The paper has mentioned developed architectures in these four categories with their limitations.

For now, it is enough to know that SimVP is a CNN-CNN-CNN architecture that has less complexity than other RNN or ViT-based architectures.

The results have been astonishing at the time of the research on various datasets. However, contrary to the more complicated the better, here we aim to simplify.

The reason is obvious, the more complicated takes a huge portion of time, energy, and resources for a long time. And then, we want to use it in the real-time, deployment stage, with less time to predict, throughput.

So now we know it has convolutional operators. Now let’s dive into the architecture itself.

WHAT?

So now, we know that SimVP is a CNN-CNN-CNN, what is the architecture look like?

The SimVP in one view looks like Figure 1.

Fig 2. The architecture in one view [source]

In a nutshell, it has three parts:

  1. An encoder to understand and encode spatial features of the frame.
  2. Then, a translator to understand the dynamical behavior (temporal evolution),
  3. And finally, a decoder to decode and blend both temporal and spatial features

Yeah there are preprocessing and postprocessing as well, but we don’t consider them as architecture.

Fig 3. The architecture in detail. The encoder stacks Ns ConvNormReLU block to extract spatial features, i.e., convoluting C channels on (H, W). The translator employs Nt Inception modules to learn temporal evolution, i.e., convoluting T ×C channels on (H, W). The decoder utilizes Ns unConvNormReLU blocks to reconstruct the ground truth frames, which convolutes C channels on (H, W). [source]

The encoder is a simple convolution and normalization layer and the decoder is unConvNormReLU blocks (ConvTranspose2d + GroupNorm + LeakyReLU) which all are the established block and operations.

For the Translator, it uses a number of modules called Inceptions for learning the temporal features. And each Inception is designed like Figure 4.

Fig 4. Inception [source]

Results

I am not going to discuss the results, generally, SimVP has achieved SOTA results on lightweight benchmarks.

Table 1. SimVP vs SOTA. The optimal(or suboptimal) results are marked by bold(or underlined). [source]

The main research is SimVP: Simpler yet Better Video Prediction, the code is also publicly available on Github. The code is an easy going one, contrary to other repos I’ve checked. 😉

This was a short story, and I believe it’d be more productive for me sometimes to know a problem and how it’s been solved, and only if I face the issue I’d start knowing it inside out. What you think? Lemme know.

Where all are tired and usually skip:
I write more about AI models and their architectures, if you liked it follow me on Medium.
I got fascinated in business and markets, so just in case you might be interested, hit me up and/or subscribe my Substack; I write on behalf of industries and startups.
By the way I love networking and getting to know various people with various backgrounds, so why not? Do not hesitate to book a time on Calandly or contacting me directly on Linkedin.

WRITER at MLearning.ai / Augmented AI / AI movie / 80+ GPT4 V

--

--