BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

How To Use A Vector Database

Following

AI is popular. So popular is Artificial Intelligence (AI) and the Machine Learning (ML) functions that drive it, that we have started to look into the core mechanics of AI to know more about how its internal component parts work. With some additional understanding of what makes AI smartness smart, the hope (for many of us) is that we can eradicate AI bias and hallucination, drive AI to do the things we really want it to do… and of course stop the robots taking over the planet (joke).

Central to building the new breed of generative AI functions are Large Language Models (LLMs), complex algorithms that are trained on a huge amount of data of one type or another so that they can recognize the structures, forms and patterns of human language. In terms of where they live, LLMs can theoretically be stored anywhere, but the live best when located in vector databases.

As we have noted before vector databases (VectorDB) are crucial for working with Large Language Models (LLMs) due to their ability to handle the intricate, high-dimensional vector space these models generate. When we say high-dimensional here, we are referring data that goes beyond simple ‘how much’ or ‘what’ value, it can also express size, place in time, its relationship to other values

What is high-dimensional data?

As explained nicely here on GitHub, “High-dimensional data are defined as data in which the number of features (variables observed), p, are close to or larger than the number of observations (or data points), n.”

In other words, high-dimensional data has more features (different ways to denote and define data) than actual data records. As Statology.org reminds us, “A dataset could have 10,000 features, but if it has 100,000 observations then it’s not high dimensional.”

“A vector database organizes and stores vectors - numerical representations of words or phrases generated by LLMs - in a structured way,” explained Mark Nijmeijer, senior director of product, Hycu, Inc - a company known for its multi-cloud and SaaS data protection as-a-service technology.

Just as a librarian can quickly retrieve a book based on its category, a vector database can swiftly fetch the relevant vectors. Nijmeijer says that this is essential for LLMs as they constantly process vast amounts of textual data, converting them into vectors for various tasks like understanding context, generating responses, or finding patterns. Without a vector database, managing and retrieving these vectors would be as cumbersome as finding a specific book in the world’s largest library, drastically slowing down the AI's performance. Common vector databases include DataStax, Pinecone, KX, Chroma, Weaviate, Faiss, Qdrant and others.

Looking at the second player in our list for a moment, vector databases like Pinecone are purpose-built to provide optimized storage and querying capabilities for vector embeddings (i.e. relationships between values). This is designed to provide efficient analysis and retrieval of complex vector data for applications like Retrieval Augmented Generation (RAG - the infusion of external ratified data into LLMs to increase AI accuracy), chatbots, AI agents, recommendation systems and also into similarity search for text and images, fraud detection etc.

Now we know why AI needs LLMs and why LLMs need vector databases, how do we choose which vector database we use?

“Before an organization goes on a mission to use a vector database, it must make sure it’s fit for purpose and the use case,” heeded Nijmeijer. “For structured data with straightforward applications, users should stick to traditional relational databases or NoSQL databases. A vector database becomes essential when handling unstructured data requiring complex algorithmic work, such as high-dimensional data, similarity searches, real-time AI applications, or when scaling ML operations.”

Keen to lay down some central parameters governing the use of vector databases today, the Hycu team present the following core selection criteria.

Seven vector selectors

In terms of performance, it is important to assess the database's response times, throughput and scalability, especially if an application involves extensive image recognition or language processing tasks. Looking at data ingestion, IT teams need to ensure compatibility with current data pipelines and formats. For example, integrating a vector database with a Customer Relationship Management (CRM) system to analyze customer interactions mandates optimization for efficient data ingestion.

Pinecone says that when evaluating vector databases, companies should consider factors such as scalability for handling large vector datasets, support for real-time updates, metadata storage and filtering, query performance at scale, system observability, ecosystem of integrations and even ‘developer popularity’... which might sound like a flaky metric but is really important.

“When it comes to query capabilities, extensive query functions (such as nearest neighbour search, range queries and similarity assessments) are crucial. A content recommendation system, for example, relies heavily on complex similarity searches for personalized user experiences,” noted Hycu’s Nijmeijer. “Then we come to indexing, organizations must choose a database that offers efficient storage and quick retrieval of vector data, especially for uses such as image recognition applications where quick indexing and retrieval of images based on visual similarity are vital.”

The buck obviously often stops with cost, so there is a need to consider the vector database’s licensing costs and the total cost of ownership. Smaller organizations are often advised to focus on cost optimization when developing internal chatbots that make use of vector technologies that run in line with LLMs in the realm of AI.

Data protection & security

An analysis of vector database options would be incomplete without a mention of data protection i.e. a business needs to plan for contingencies such as deletions, misconfigurations or cyber-attacks by selecting a vendor with robust recovery capabilities or support from third-party protection and recovery solutions. This is especially critical for retail or e-commerce, where downtime is not an option.

“Finally (although we could make this list longer quite easily if needed) we come to security,” said Nijmeijer. “For organizations with stringent security and compliance requirements, choosing a vector database that satisfies data sovereignty needs is imperative. Identify and align with your organization's data residency, access and retention policies. Healthcare-related applications, for example, must place security at the forefront of their vector database selection criteria.”

While we have covered the fundamentals here, it’s worth remembering that the vector marketplace is moving as fast as the LLM and AI space in general. As an additional caveat, we might suggest that organizations should opt for vector technology choices that not only fit their technical requirements but also aligns with your organizational needs and constraints.

Is vector-washing a thing?

It’s also worth considering the fact that every company that is a database company may now be quite happy to style itself as a vector database company, so look at where vendors have come from and at what point they have evolved. Vector-washing (like greenwashing, open source washing or other) is not a thing yet, but it could be, so stay alert and consider every vector of choice (the deliberate use of cheesy callback term apologized for) in this equation.

“When traditional database companies see the explosive growth and demand for vector databases as part of AI workloads, it's hard to fault them for trying to get in on the action,” said Greg Kogan, VP of marketing & growth at Pinecone. “But history has shown time and again that the best applications are built on databases purpose-built for the data types and query patterns those applications rely on. With AI applications the dominant data types are vectors and the query patterns are vector search. The difference may be hard to spot with very small use cases, but the difference in performance, cost efficiency and reliability is very clear when working with anything past a sample application."

Kogan reminds us that software developers are racing to build differentiated and commercially viable AI applications; this means they need more than just access to LLMs and vector search to achieve that. The key (as the team at Pinecone says it has found), is to give AI applications on-demand access to the company's data i.e. the more data it can search through semantically to find the right context, the better the application performs in terms of answer quality.

The takeaways here appear to centralize on the need to use the right tool for the job, a maxim which in this scenario translates to ‘use a database that designed from the ground up to be fit for AI-centric purpose’ and to put provisioning for scale (AI is big and getting bigger, just in case you hadn’t noticed) even a business is still in experimentation mode with AI.

Follow me on Twitter or LinkedIn