Getting Started with Multimodal Retrieval Augmented Generation

10 min readMar 8, 2024

Editor’s note: Valentina Alto is a speaker for ODSC East this April 23–25. Be sure to check out her talk, “The AI Paradigm Shift: Under the Hood of a Large Language Models,” there!

An implementation with CLIP and GPT-4-vision

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities across various natural language understanding and generation tasks. These models, such as GPT-4, Llama-2, Mistral, and many others, have been trained on vast amounts of text data and can generate coherent and contextually relevant responses. However, they are not without limitations. Here are some key challenges faced by LLMs:

Outdated Knowledge: LLMs lack real-time updates and may provide outdated information. Their knowledge is static, even in dynamic domains where facts change rapidly.
Non-Transparent Reasoning: Understanding how LLMs arrive at their decisions remains challenging. Their reasoning processes are often opaque and difficult to trace.
Hallucination: LLMs sometimes generate plausible-sounding but factually incorrect information. They might invent details that do not exist in the training data, leading to unreliable outputs.

To address these limitations, researchers have turned to Retrieval-Augmented Generation (RAG) as a promising solution. Let’s explore why RAG is important and how it bridges the gap between LLMs and external knowledge.

What is RAG?

RAG is an architectural framework for LLM-powered applications which consists of two main steps:

Retrieval. In this stage, the system has the task of retrieving from the provided knowledge base the context that is the most similar to the user’s query. This step involves the concept of embedding.

Definition

Embedding is a process of transforming data into numerical vectors and it is used to represent text, images, audio, or other complex data types in a multi-dimensional space that preserves the semantic similarity and relevance of the original data. This means that, for example, the embeddings representing two words or concepts that are semantically similar, will be mathematically close within the multi-dimensional space.

Embedding is essential for generative AI and RAG (retrieval-augmented generation) because it allows the models to access and compare external knowledge sources with the user input and generate more accurate and reliable outputs.

The process involves the embedding of the user’s query and the retrieval of those words or sentences that are represented by vectors that are mathematical to the query’s vector.

Augmented Generation. Once the relevant set of words, sentences or documents is retrieved, it becomes the context from which the LLM generates the response. It is augmented in the sense that the context is not simply “copied-pasted” and presented to the user as it is, but it is rather passed to the LLM as context and used to produce an AI-generated answer.

Below you can see an illustration showing the two phases above:

The RAG leads to a series of benefits, including:

Incorporating External Knowledge. RAG integrates information from external sources. By doing so, it enhances the accuracy and credibility of LLM-generated content. For knowledge-intensive tasks (e.g., medical diagnosis, legal advice, or scientific explanations), RAG ensures that the model leverages up-to-date, reliable information.
Reducing Factual Errors. RAG combines the strengths of both parametric (LLMs) and non-parametric (retrieval-based) approaches. The retrieval component retrieves relevant facts from external sources, reducing the risk of hallucination and factual inaccuracies. The generation component then produces fluent and contextually appropriate responses.
Interpretable Knowledge Integration. Unlike black-box models, RAG allows for more transparent reasoning. Researchers can analyze the retrieved information and understand how it influences the generated output. This interpretability is crucial for building trust and ensuring accountability.

The RAG framework has paved the way for powerful LLM-powered applications and a new search engine paradigm. However, over the last year, it was mainly applied to textual data, due to the “mono-modal” behavior of the majority of LLMs available in the market (such as the GPT-3.5-turbo or the Llama 2). In other words, those models were “only” able to process text as input.

But the whole set of stimuli we receive as humans, as well as our way of communicating with each other, is not only via text. This means that we need to extend our LLM-powered applications with multimodal capabilities so that we can interact with them with multiple modalities.

Introducing Multimodality

Multimodality refers to the integration of information from different modalities (e.g., text, images, audio). Introducing multimodality in RAG further enriches the model’s capabilities:

Text-Image Fusion: Combining textual context with relevant images allows RAG to provide more comprehensive and context-aware responses. For instance, a medical diagnosis system could benefit from both textual descriptions and relevant medical images.
Cross-Modal Retrieval: RAG can retrieve information from diverse sources, including text, images, and videos. Cross-modal retrieval enables a deeper understanding of complex topics by leveraging multiple types of data.
Domain-Specific Knowledge: Multimodal RAG can integrate domain-specific information from various sources. For example, a travel recommendation system could consider both textual descriptions and user-generated photos to suggest personalized destinations.

Multimodal RAG (MM-RAG) follows the same pattern as the “monomodal” RAG described in the previous section, with the difference that we can interact with the model in multiple ways, plus the indexed knowledge base can also be in different data formats. The idea behind this pattern is to create a shared embedding space, where data in different modalities can be represented with vectors in the same multidimensional space. Also in this case, the idea is that similar data will be represented by vectors that are close to each other.

Once our knowledge base is properly embedded, we can store it in a multimodal VectorDB and use it to retrieve relevant context given the user’s query (which can be multimodal as well):

Now the question is: how do we generate a shared embedding space and make sure that our model is able to retrieve relevant data? We will cover different approaches to achieve that in the next section.

Building a MM-RAG application with CLIP and GPT-4-vision

Now let’s see how to practically build an MM-RAG application. The idea is to build a conversational application that can receive as input both text and images, as well as retrieve relevant information from a PDF that contains text and images. Henceforth, in this scenario, multimodality refers to “text + images” data. The goal is to create a shared embedding space where both images and text have their vector representation.

Let’s break down the architecture into its main steps.

Embedding the multimodal knowledge base

To embed our images, there are two main options we can follow:

1. Using an LMM such as the GPT-4-vision to first get a rich captioning of the image. Then, use a text embedding model such as the text-ada-002 to embed that caption.

2. Using a model that is capable of directly embedding images without intermediate steps. For example, we can use a Vision Transformer for this purpose.

Definition

The Vision Transformer (ViT) emerged as an alternative to Convolutional Neural Networks (CNNs). Like LLMs, ViT employs a core architecture consisting of an encoder and decoder. In ViT, the central mechanism is Attention, which enables the model to selectively focus on specific parts of the input sequence during predictions. By teaching the model to attend to relevant input data while disregarding irrelevant portions, ViT enhances its ability to tackle tasks effectively.

What sets attention in Transformers apart is its departure from traditional techniques like recurrence (commonly used in Recurrent Neural Networks or RNNs) and convolutions. Unlike previous models, the Transformer relies solely on attention to compute representations of both input and output. This unique approach allows the Transformer to capture a broader range of relationships between words in a sentence, resulting in a more nuanced representation of the input.

An example of this type of model is CLIP, a ViT developed by OpenAI that can learn visual concepts from natural language supervision. It can perform various image classification tasks by simply providing the names of the visual categories in natural language, without any fine-tuning or labeled data. CLIP achieves this by learning a joint embedding space of images and texts, where images and texts that are semantically related are close to each other. The model was trained on a large-scale dataset of (image, text) pairs collected from the internet.

Once you’ve got your embedding, you will need to store them somewhere, likely a Vector DB.

Choosing your Vector DB

There are several vectordb available in the market, most of which are open-source. Below you can find some of them (all supporting multimodal indexing):

Qdrant. It is an open-source vector database designed for efficient similarity search. It allows you to index and search vectors from different modalities, including text and images. You can create composite indexes that combine embeddings from various sources. Qdrant is built for scalability, making it suitable for large-scale applications. Being open-source, Qdrant benefits from community contributions and improvements.
Weaviate. It provides out-of-the-box multimodal models. Currently, the available module is multi2vec-clip, which projects images and text into a joint embedding space. You can perform nearVector or nearImage searches across these two modalities. Weaviate is designed to scale efficiently, accommodating growing data volumes.
Pinecone. It is a managed vector database service that excels in similarity search and recommendation systems. Pinecone enables you to handle both text and image embeddings so that you can create indexes that incorporate vectors from different modalities. Pinecone’s infrastructure ensures low-latency searches, making it ideal for real-time applications. It scales automatically based on demand, allowing you to focus on building applications without worrying about infrastructure management.

Among the proprietary VectorDB, we can quote Azure AI Search (formerly called Azure Cognitive Search), Microsoft’s AI search engine that has recently added the vector search modality among its features. Azure AI Search also offers a hybrid search functionality, combining the best of the two worlds (traditional search and vector search). Plus, thanks to its powerful algorithms of ranking, this service has been established as a powerful tool for RAG applications. Finally, it offers multimodal capabilities by first leveraging Azure AI Vision models to generate images’ captions, then using the text-ada-002 model to embed it into its vectordb (following the approach described in Option 1 in the previous section).

Now that we have our embeddings stored into a vectordb, we can leverage them to retrieve the relevant context.

Retrieving relevant context

For a fully multimodal experience, we want our application to be able not only to retrieve both text and images but also to receive both as input. This means that users can enjoy a multimodal experience, explaining concepts with both written and visual inputs.

To achieve this result, we will need an LMM to process the user’s input. The idea is that, given a user’s input (text + image), the LMM will reason over it and produce images’ description which is also in line with the whole context provided by the user. Then, once the text + images’ descriptions are obtained, a text embedding model will create the vectors that will be compared with those of the knowledge base.

Once gathered the relevant context (text + images) from the knowledge base, it will be used as input for the LMM to reason over it, in order to produce the generative answer. Note that the generative answer will contain references to both text and image sources.

Conclusion

In the ever-expanding landscape of artificial intelligence, Multimodal Retrieval-Augmented Generation emerges as a beacon of promise. This fusion of Large Multimodal Models and external multimodal knowledge sources opens up exciting avenues for research, applications, and societal impact.

In the next few years, we anticipate MM-RAG to evolve into an indispensable tool for content creation, education, and problem-solving, just like “only-text” RAG has started to become indispensable over the last year. In other words, it will enable more effective communication between AI systems and humans.

If you are interested in learning more about MM-RAG and how to build multimodal applications with Python and AI orchestrators, join our upcoming talk at ODSC East 2024!

References

About the Author:

Valentina Alto is a Data Science MSc graduate and Cloud Specialist at Microsoft, focusing on Analytics and AI workloads within the manufacturing and pharmaceutical industry since 2022. She has been working on customers’ digital transformations, designing cloud architecture and modern data platforms, including IoT, real-time analytics, Machine Learning, and Generative AI. She is also a tech author, contributing articles on machine learning, AI, and statistics, and recently published a book on Generative AI and Large Language Models.

In her free time, she loves hiking and climbing around the beautiful Italian mountains, running, and enjoying a good book with a cup of coffee.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.