Vector Databases — Long Term Memory for AI

Rajat Roy
6 min readMay 3, 2023
Photo by Sven Brandsma on Unsplash

Introduction

In natural language processing (NLP), a vector is a mathematical representation of a word or text document. This representation is used to perform various NLP tasks such as sentiment analysis, text classification, and information retrieval.

In NLP, a vector can be generated using techniques such as word embedding or document embedding. Word embedding involves mapping words to high-dimensional vectors, where each dimension corresponds to a different aspect of the word’s meaning. Document embedding, on the other hand, involves generating a vector representation for an entire document, typically by averaging or pooling the word embeddings of its constituent words.

Once a vector representation has been generated, various mathematical operations can be performed on the vectors to obtain insights into the text data. For example, the similarity between two words or documents can be computed using vector cosine similarity. This involves computing the cosine of the angle between the two vectors, where a higher cosine value indicates greater similarity.

In summary, vectors in NLP are mathematical representations of words or documents, used to perform various NLP tasks such as sentiment analysis, text classification, and information retrieval.

Here's a simple example to convert a text into its numerical representation.

from sklearn.feature_extraction.text import TfidfVectorizer

# Define a sample text
text1 = "The sun was setting over the horizon, casting a warm orange glow across the sky. The waves gently lapped at the shore as the seagulls flew overhead."
text2 = "As the night fell, the stars began to twinkle in the sky. The sound of crickets filled the air as the fireflies danced among the trees."

vectorizer = TfidfVectorizer()
print(vectorizer.fit_transform([text1, text2]).toarray())
[[0.16387439 0.         0.         0.11659798 0.16387439 0.
0.16387439 0. 0. 0. 0. 0.
0.16387439 0.16387439 0.16387439 0.16387439 0. 0.16387439
0. 0. 0.16387439 0.16387439 0.16387439 0.16387439
0.16387439 0.16387439 0.11659798 0. 0. 0.16387439
0.69958785 0. 0. 0. 0.16387439 0.16387439
0.16387439]
[0. 0.15190417 0.15190417 0.21616214 0. 0.15190417
0. 0.15190417 0.15190417 0.15190417 0.15190417 0.15190417
0. 0. 0. 0. 0.15190417 0.
0.15190417 0.15190417 0. 0. 0. 0.
0. 0. 0.10808107 0.15190417 0.15190417 0.
0.75656749 0.15190417 0.15190417 0.15190417 0. 0.
0. ]]

Best Vectorization Technique

The effectiveness of text vectorization techniques depends on the specific task and characteristics of the text data. For example, Bag-of-Words or TF-IDF may be effective for classifying text, while word embeddings or LDA may be better for identifying relationships between words or concepts. The size and complexity of the text data also play a role in determining the most effective technique.

Therefore, it’s important to consider the specific needs and characteristics of the text data and task when selecting a vectorization technique, and experimentation may be necessary to determine the most effective approach.

Transformers based word embeddings, such as those used in the BERT model, have shown to be better than traditional vectorization techniques such as Bag-of-Words or TF-IDF, due to their ability to capture the contextual meaning of words. This is achieved through a process called self-attention, which allows the model to consider the entire sentence or document in which a word appears to determine its most relevant context.

This ability to capture context has been shown to improve the performance of NLP models on a range of tasks, including sentiment analysis, question answering, and natural language inference, and has led to BERT achieving state-of-the-art performance on several NLP benchmarks.

Storing Word Embeddings

The size of transformer based word embeddings varies depending on the specific model and the size of the vocabulary used. For example, the BERT model has two versions, BERT-base and BERT-large, with 110 million and 340 million parameters respectively.

The space required to store transformer based word embeddings can also vary widely depending on the model size and the vocabulary used. For example, the BERT-base model has a file size of approximately 440 MB, while the BERT-large model has a file size of approximately 1.3 GB.

The Database

One such solution for this problem is Elasticsearch. It's a no-sql database through which you can store any large corpus in a json like structure and also allows to store large embeddings which makes it easier to apply analytics solutions on top of those large matrices. Such solutions include Questions Answering, Semantic Search, Recommendation, Text & Image Generation and Anomaly Detection.

Code Example

Here's is an example of similarity search build using Langchain and Elasticsearch. Langchain is another powerful library to utilise Large Language Models or LLMs for different applications but more about it in a different article.

The data used in this article is present here. It contains a large text taken from a speech given by an American representative. What we are going to do is we are going to use Langchain which would leverage OpenAI Embedding to convert this large text into numerical representation and then we'll store it into Elasticsearch. Finally, we apply a search query on the stored data to get document which match the search results.

Get started by installing the required packages.

!pip install elasticsearch langchain openai tiktoken

Since we'll be using Open AI embedding, we need to store Open AI API key in the os environment. Here's how to do it.

import os
os.environ['OPENAI_API_KEY'] = "PASTE OPEN AI API KEY HERE"

To acquire API key, log on to OpenAI platform and get your API key from your profile section.

Initialize, elasticsearch credentials. Elasticsearch is open source, so you can run an ES server in a self hosted environment or you can opt for a free trial for the Elastic cloud.

elastic_host = "HOST"
username = "USERNAME"
password = "PASSWORD"
elasticsearch_url = f"https://{username}:{password}@{elastic_host}:9243"

Import packages.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import ElasticVectorSearch
from langchain.document_loaders import TextLoader

Load text data using langchain TextLoader.

loader = TextLoader('./sample_data/state_of_the_union.txt')
documents = loader.load()

Split text data into document and initialize embeddings.

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

Perform search and get relevant documents.

db = ElasticVectorSearch.from_documents(docs, embeddings, elasticsearch_url=elasticsearch_url)

query = "What did the president say about Ketanji Brown Jackson"
docs = db.similarity_search(query)

for idx, doc in enumerate(docs):
print(f"Match {idx+1}")
print("=="*100)
print(doc.page_content)
print("=="*100)
print("\n\n")

Here is the result.

Match 1
========================================================================================================================================================================================================
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
========================================================================================================================================================================================================

......

Conclusion

So easy right!! Hope I have been able to at-least help your understand the concept of vector databases. Elasticsearch is not the only solution which could be used as a vector database. Another such platform is Pinecone and many others.

Final Note for Readers

Are you a programming, AI, or machine learning enthusiast? Then you’ll love my blog on Medium! I regularly post about these topics and share my insights on the latest trends and tools in data science. If you find my content helpful, please like and follow my blog. And if you want to show some extra support, you can give a tip by clicking the button below. Thanks for your time and support!

BECOME a WRITER at MLearning.ai

--

--

Rajat Roy

Data Scientist | Machine Learning Engineer | Blogger