AI and Graph Databases: Enhancing Data Retrieval

Jayesh Minigi 05 Apr, 2024 • 15 min read

Introduction

In the field of modern data management, two innovative technologies have appeared as game-changers: AI-language models and graph databases. AI language models, shown by new products like OpenAI’s GPT series, have changed the landscape of natural language processing. These models possess an unparalleled ability to understand, generate, and analyze human language, making them indispensable tools for a wide array of applications, from chatbots to content generation.

Simultaneously, graph databases have occurred as an unknown way of data storage and querying, prioritizing the complex relationships between data issues over traditional tabular formats. Graph databases, such as Neo4j and Amazon Neptune, allow associations to represent and guide complex networks of interconnected data with unique flexibility and efficiency. In an era where data is increasingly interconnected and multidimensional, the importance of effective data retrieval cannot be too much. From e-commerce platforms searching to provide personalized recommendations to healthcare systems analyzing patient data for insights, the ability to quickly and accurately get related information is important. Within this context, the concept of integrating AI language models with graph databases appears as an interesting solution to increase data retrieval processes, using the natural language understanding capabilities of AI models to guide the rich-network of relationships encoded in graph databases.

AI and Graph Databases

Learning Objectives

  • Understand the role of AI-language models in improving data retrieval processes in graph databases.
  • Learn the basic principles and operational characteristics of graph databases compared to traditional relational models.
  • Achieve practical knowledge of implementing AI-language models and graph database integration, including setting up environments, importing datasets, and utilizing query languages like Cypher for better data retrieval and analysis.
  • Learn the importance of Retrieval-Augmented Generation (RAG) systems in improving data analysis capabilities when integrated with graph databases.
  • Achieve insight into the process of extracting and transforming data from unstructured sources using AI-language models for input into graph databases.
  • Explore the advantages of graph databases over vector similarity searches in handling complex, multi-hop queries.

This article was published as a part of the Data Science Blogathon.

Understanding Graph Databases

Graph databases introduce an innovative approach to data management, leaving from the limitations of traditional database models to adopt the rich complexities of interconnected data. Unlike their coequals, which depend on fixed tabular forms or unstructured formats, graph databases use the principles of graph theory to organize data into nodes and edges. Nodes represent entities or objects, while edges define the relationships between them, forming an active and interconnected network. This section presents an exploration of the basic ideas and workings of graph databases, highlighting their unique architecture and operational principles. By determining their approach to traditional databases, we gain insights into the unique strengths and weaknesses of graph databases, building the way for a deeper understanding of their role in modern data management and analysis.

Graph Databases

Graph Databases vs. Traditional Models

While traditional databases, such as relational (SQL) databases, organize data into tables and need complex joins to access related information, graph databases adopt different approaches customized to interconnected data. Traditional databases often face computational challenges and don’t have natural when guiding highly interconnected datasets, necessitating complex queries and compromising performance. Very different, graph databases do well in representing relationships beside data, giving a natural and easy-to-understand system for managing interconnected datasets. This inherent capability makes graph databases particularly well-suited for scenarios where relationships play a key role, allowing better and seamless data retrieval without the overhead of complex joins.

Graph Databases vs. Traditional Models

Comparison with Other Databases

In the very large landscape of database technologies, graph databases stand out as a specialized tool with unique strengths and applications. Unlike classic relational databases, which organize data into tables and need complex joins for relationship management, graph databases adopt the built-in interconnectedness of data through nodes and edges. This basic difference strengthens graph databases to do well in scenarios where relationships are as important as the data itself. While relational databases succeed in structured environments with predefined schemas, graph databases give flexibility and scalability, making them well-suited for dynamic and evolving datasets. By understanding the fine differences between graph databases and other models, such as document-oriented or key-value stores, stakeholders can make notified decisions when selecting the most suitable database solution for their specific use case.

Relational Databases (SQL)

Relational databases, often same with SQL databases, structure data into tables interconnected through relationships. These databases excel in managing well-defined, tabular data with high efficiency. However, their performance may suffer as data complexity and interconnectedness increase. This damage arises from the necessity of executing multiple table joins and complex queries to regain related information. While relational databases gives strong solutions for structured data, their limitations become clear in scenarios requiring flexible data modeling and complex relationship management.

Document Databases (NoSQL)

Document databases, classified under the group of NoSQL databases, choose a flexible way to data storage, using document-like structures such as JSON. This design allows them scalability and to do many things, mainly for managing unstructured data. However, document databases face challenges in easily handling complex inter-document relationships. Unlike graph databases, which naturally represent and cross relationships, document databases always require additional processing to guess and manage these connections. While document databases gives valuable solutions for storing and retrieving semi-structured data, their limitations become evident when faced with highly interconnected datasets requiring fine relationship management.

Graph Databases vs. SQL and NoSQL

Aspect Graph Databases SQL and NoSQL Databases
Connectivity Focus Naturally designed to rank relationships, ideal for interconnected data. Focus may differ; relational databases normally focus on structured data, NoSQL databases may vary based on the model (document, key-value, etc.)
Efficient Pathfinding Provides efficient path-finding and traversal capabilities. Pathfinding might need complex queries or additional tools in SQL and NoSQL databases.
Performance Advantage Beats SQL and NoSQL alternatives in complex, interconnected datasets. Performance may differ based on database design, indexing, and query complexity.
Consideration of Overhead Overhead might not be justified for simpler, less connected datasets. Overhead might be lower for simpler datasets in SQL and NoSQL databases.
Data Nature Determines Choice Selection heavily depends on the nature of the data and specific requirements. Choice also depends on data nature but might not prioritize relationships and interconnectedness.
Strengths Handling complex networks and relationships. Handling structured or semi-structured data efficiently.
Practical Consideration Evaluation based on the difficulties of the data landscape is important. Evaluation based on data structure, query patterns, scalability, and consistency requirements.

Implementation Example(Neo4j)

Step1: Neo4j Environment Setup

To easily follow the examples provided in this blog post, it’s recommended to set up a Neo4j 5.11 or higher model. The simplest way is to create a free instance on Neo4j Aura, which provides cloud-based Neo4j databases. Alternatively, you can choose to establish a local instance of the Neo4j database by downloading the Neo4j Desktop application and configuring a local database instance.

from langchain.graphs import Neo4jGraph

url = "neo4j+s://databases.neo4j.io"
username ="neo4j"
password = ""

graph = Neo4jGraph(
    url=url, 
    username=username, 
    password=password
)

Step 2: Working on Dataset

Knowledge graphs do well in easily integrating information from different data sources. When creating a DevOps RAG (Retrieval-Augmented Generation) application, you can retrieve data from different sources including cloud services, task management tools, and beyond.

Working on Dataset

As the small service and task information used in this example isn’t publicly available, an artificial dataset was generated. Using ChatGPT, a small dataset comprising 100 nodes was created specifically for this purpose.

The following code snippet allows the importation of the sample graph into Neo4j for easy integration.

import requests

url = "https://gist.githubusercontent.com/tomasonjo/08dc8ba0e19d592c4c3cde40dd6abcc3/raw/
da8882249af3e819a80debf3160ebbb3513ee962/microservices.json"
import_query = requests.get(url).json()['query']
graph.query(
    import_query
)

If you check the graph in the Neo4j Browser, you should get a similar visualization.

Query Examples output

Blue nodes within our graph show small services, each possibly interconnected with dependencies on one another. These dependencies indicate that the functionality or outcome of a particular microservice may depend on the operation of another. On the other hand, brown nodes show tasks complicatedly linked to these microservices. In addition to showing the structure and associated tasks, our graph also outlines the respective teams responsible for each part.

Step 3: Calculate Neo4j Vector Index

The tasks are already in our knowledge graph. However, we must calculate the embedding values and create the vector index. This can be achieved with the from_existing_graph method.

import os
from langchain.vectorstores.neo4j_vector import Neo4jVector
from langchain.embeddings.openai import OpenAIEmbeddings

os.environ['OPENAI_API_KEY'] = "OPENAI_API_KEY"

vector_index = Neo4jVector.from_existing_graph(
    OpenAIEmbeddings(),
    url=url,
    username=username,
    password=password,
    index_name='tasks',
    node_label="Task",
    text_node_properties=['name', 'description', 'status'],
    embedding_node_property='embedding',
)

In this example, we used the following graph-specific parameters for the from_existing_graph method.

  • index_name: Name of the vector.
  • indexnode_label: Node label given to relevant nodes.
  • text_node_properties: Properties used for calculating embeddings and retrieval from the vector index.
  • embedding_node_property: Property set for storing the embedding values.

Now that the vector index has been created, we can use it as any other vector index in LangChain.

response = vector_index.similarity_search(
    "How will RecommendationService be updated?"
)
print(response[0].page_content)
# name: BugFix
# description: Add a new feature to RecommendationService to provide ...
# status: In Progress

You can notice that we generate a response in the form of a map or dictionary-like string with specified properties in the text_node_properties parameter.

Now, we can easily generate a chatbot response by capturing the vector index within a RetrievalQA module.

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

vector_qa = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(),
    chain_type="stuff",
    retriever=vector_index.as_retriever()
)
vector_qa.run(
    "How will recommendation service be updated?"
)

One disadvantage of vector indexes, normally, is their inability to collect information in the same manner as a structured query language like Cypher. Consider the following illustration:

vector_qa.run(
    "How many open tickets there are?"
)
# There are 4 open tickets.

The response appears valid, and the language model maintains the correctness of the result. However, the issue is in the fact that the response is directly linked to the number of documents retrieved from the vector index, which defaults to four. Thus, the vector index retrieves four open tickets, showing the language model to assume that these represent all the open tickets. In reality, the situation is different, and we can verify this using a Cypher statement.

graph.query(
    "MATCH (t:Task {status:'Open'}) RETURN count(*)"
)
# [{'count(*)': 5}]

While vector similarity search do well at filtering through relevant information in unstructured text, it doesn’t have the capability to analyze and combine structured information actually. In our toy graph, there are five open tasks. To address this limitation, Neo4j gives a solution with Cypher, a structured query language specifically designed for graph databases. By using Cypher, we can easily analyze and combine structured information within the graph database, providing a complete view of the data that vector similarity search alone cannot achieve.

Cypher, a structured query language made for graph databases, gives a visually easy approach to matching patterns and relationships within the data. It uses an ASCII-art style syntax, allowing users to express complex queries in a clear and straightforward manner.

Example Cypher Query:

(:Person {name:"Tomaz"})-[:LIVES_IN]->(:Country {name:"Slovenia"})

This pattern describes a node with the label Person and the name property Tomaz that has a LIVES_IN relationship to the Country node of Slovenia.

Automated Cypher Generation with GraphCypherQAChain

One advantage of LangChain is its GraphCypherQAChain module, which automates the generation of Cypher queries. This means you don’t need to learn Cypher syntax to get information from a graph database such as Neo4j.

Refreshing Schema and Creating Cypher Chain

The code snippet below shows how to refresh the graph schema and create the Cypher chain.

from langchain.chains import GraphCypherQAChain

graph.refresh_schema()

cypher_chain = GraphCypherQAChain.from_llm(
    cypher_llm = ChatOpenAI(temperature=0, model_name='gpt-4'),
    qa_llm = ChatOpenAI(temperature=0), graph=graph, verbose=True,
)

Generating Accurate Cypher Statements

Generating valid Cypher statements can be a challenging task, which is why it’s recommended to use state-of-the-art Language Models (LLMs) like GPT-4 for this purpose. Meanwhile, for generating answers using the database context, you can rely on LLMs such as GPT-3.5-turbo. This approach ensures that the Cypher statements are accurate and syntactically correct, while also using the contextual understanding of the database for generating exact responses.

Query Examples

Now, you can ask the same question about how many tickets are open.

cypher_chain.run(
    "How many open tickets there are?"
)

Output: 

Query Examples output

You can also instruct the chain to combine the data using different grouping keys, as shown in the following example.

cypher_chain.run(
    "Which team has the most open tasks?"
)

Output:

Query Examples output

While it’s true that these collections aren’t graph-based operations, we can clearly perform more graph-oriented tasks such as traversing the dependency graph of small services.

cypher_chain.run(
    "Which services depend on Database directly?"
)

Output:

Query Examples output

Certainly, you can instruct the chain to generate variable-length path traversals by asking questions such as:

cypher_chain.run(
    "Which services depend on Database indirectly?"
)

Output: 

Query Examples output

Some of the services mentioned in both the directly dependent question and the variable-length path traversals are the same. This sameness occurs due to the form of the dependency graph, not due to any issue with the fact of the Cypher statement.

Enhancing Data Retrieval with RAG Systems

Introduction to RAG

Retrieval-Augmented-Generation (RAG) systems present a new mix of retrieval-based and generative AI models, changing information retrieval and generation processes. These systems take advantage of the strengths of both models to increase the accuracy and importance of received information. In summary, RAG systems begin by employing a return component to pull similar data or documents from big databases. Later, this retrieved information acts as a knowledge base for the generative part, which makes and presents combined information in a clear and related fitting manner.

RAG

Significance of RAG Systems in Data Analysis

The integration of RAG systems introduces a powerful rise in data analysis capabilities. By including these systems, the scope and depth of data analysis undergo a big development. Especially, in addressing complex queries, RAG systems show fine and complete responses by using a wider range of information sources. This combination of retrieval and generation capabilities gives a more active and flexible way to data analysis, particularly in plans requiring insights from different datasets or involving summary concepts.

Maximizing Synergy between AI and Graph Databases

Synergistic Integration of AI-Language Models with Graph Databases

The mix of AI-language models with graph databases displays a balanced combination of technologies, with each increasing the strengths of the other. AI-language models, famous for their ability in understanding and generating human-like text, have the potential to greatly improve the querying skill of graph databases. These databases, structured to trace relationships and connections among different data points, always create challenges when queried using traditional search methodologies. However, AI language models, provided with developed natural language processing capabilities, do well in solving complex queries and translating them into graph-database-friendly requests.

Synergistic Integration

Facilitating natural Interaction with Graph Databases

Moreover, this cooperative group gives a more natural interaction with graph databases. Users can express queries in natural language, which the AI model solves and smoothly transforms into a format understandable by the graph database. This simplified interaction actually lowers the entry block for users who may need understanding with the technical query language commonly connected with graph databases.

Dynamic Data Updating and Maintenance

Similarly, AI-language models play a key role in dynamically updating and maintaining graph databases. As these models have new information, they have the ability to identify potential new nodes and relationships, thereby suggesting updates to the database. This iterative process confirms that the graph database remains up-to-date and thoughtful of the latest data trends and patterns.

Extracting Data from Unstructured Sources

Unlocking value from unstructured data sources like PDFs and markdown files is a crucial aspect of modern data management. This process, made possible by AI language models, allows for efficient extraction of entities and relationships. By transforming this data into inputs for graph databases, organizations can greatly enhance database integrity and navigability. This collaboration between AI and graph databases signifies a significant advancement in data analysis, offering users more powerful and user-friendly tools for complex queries and insights.

Unlocking Value from Unstructured Data

 An important load in data management and analysis lies in obtaining meaningful understandings from unstructured data sources like PDFs, markdown files, and other non-standardized formats. Here, AI language models appears as key support, having the power to process and solve these unstructured data sources skillfully. Using advanced natural language processing techniques, these models skillfully find entities, relationships, and key information included within unstructured data.

Transforming Unstructured Data into Graph Database Inputs

This capability changes the use of unstructured data. Instead of remaining unwieldy and often ignored, unstructured data believes recent importance as a valuable input for graph databases. AI models with ease extract entities and their relationships from unstructured texts, smoothly converting them into nodes and edges ready for direct joining into a graph database. This process not only expands the scope of data available within the database but also increases its depth and connection.

Improving Database Integrity and Navigability

Moreover, the extraction facilitated by AI models contains sorting and tagging the collected information, key for maintaining the integrity and navigability of the graph database. Therefore, the database evolves into a potent tool for complex data analysis, revealing insights previously obscured by data’s unstructured nature.

In summary, the mix of AI-language models with graph databases indicates a paradigm shift in data retrieval and analysis. RAG systems bridge the gap between retrieval and generation. It offers precise responses to complex queries, enhancing graph databases’ accessibility and functionality. This cooperation strengthens users with more user-friendly and powerful analytical tools. Lastly, AI models skills in extracting and categorizing data from unstructured sources transforms data utilization, increasing the graph database’s value as a complete data analysis tool.

Benefits of Using Graph Databases in RAG Applications

Graph databases offer significant advantages in RAG (Retrieve, Answer, Generate) applications. They facilitate efficient data storage and retrieval, handle intricate relationships, and boost performance for tasks like question answering. Some of those are discussed below:

Advantages Over Vector Similarity Searches

Vector similarity searches are the cornerstone of data retrieval, offering a reliable means to find relevant information in vast datasets. Yet, these searches often encounter constraints, especially with intricate queries where data point relationships are crucial.

In difference, graph databases present a more fine approach, using their built-in structure to show improved capabilities. In a graph database, data exists as interconnected nodes (entities) and edges (relationships), providing a holistic data view. This structural benefit is crucial in scenarios where grasping entity connections is as vital as understanding the entities themselves.

AI and Graph Databases

One important disadvantage of vector similarity searches lies in their inefficiency when handling queries involving multiple, interconnected entities. For instance, in recommendation systems, users seek items akin to their choices and those favored by similar users for variety. Vector similarity searches usually pause in addressing such complex queries, mainly focusing on external similarities.

Graph databases, on the other hand, succeed in this domain. They easily guide relationships between nodes, allowing the discovery of complex networks of connections. This capability expands beyond direct associations to include complex webs, enabling comprehensive and contextually aware information retrieval.

Multi-hop Searches and Complex Queries

The concept of multi-hop searches is another area where graph databases greatly top classic vector-based systems. Multi-hop searches refer to queries that require multiple steps to reach a conclusion or find a piece of information. In a graph database, this is similar to traversing multiple nodes and edges. For instance, linking two seemingly unrelated pieces could require hopping through a chain of connected nodes in the graph.

Graph databases are naturally designed for this type of query. RAG systems explore connections over multiple hops, enabling answers to complex queries. This is crucial in research and journalism, revealing links between diverse information pieces.

In addition to multi-hop capabilities, graph databases do well in managing complex queries that have collecting information from multiple documents. Unlike vector similarity searches, which normally consider documents in separation, graph databases can consider the connection of different data points. This feature is important for applications like knowledge graphs and semantic search engines, where understanding the relationships between different pieces of information is key.

AI and Graph Databases

For example, in a medical research environment, a query might have finding connections between different symptoms, drugs, and diseases. A graph database can guide through interconnected entities, providing understandings that are not readily clear through simple keyword searches or vector similarity checks.

Also, graph databases can also handle dynamically changing data successfully. In real-time applications, such as social media analysis or fraud detection, data relationships can change quickly. Graph databases are expert at updating and managing these evolving connections, providing up-to-date and related results for complex queries.

Conclusion

The combining of AI language models with graph databases shows important progress in the realm of data retrieval and analysis. By combining the natural language understanding capabilities of AI models, exemplified by OpenAI’s GPT series, with the dynamic and interconnected structure of graph databases, organizations can improve their ability to uncover insights from complex datasets.

Graph databases show a fine approach to data management, ranking relationships between data points, while AI language models enable more natural interaction and query processing. Together, they allow more accurate and efficient data retrieval, particularly in systems involving multi-faceted queries and unstructured data sources. This combination of AI and graph databases not only improves the accessibility and functionality of data analysis tools but also opens further insights from interconnected data.

Key Takeaways

  • Integration of AI-language models with graph databases improves data retrieval by using natural language understanding and complex relationship mapping.
  • Graph databases gives natural approach to managing interconnected data compared to traditional models like SQL, improving performance for complex queries.
  • Cypher, a structured query language for graph databases, simplifies data retrieval and analysis. It allows users to express complex queries in a clear and straightforward manner.
  • Retrieval-Augmented Generation (RAG) systems combine retrieval-based and generative AI models, giving more accurate responses to complex queries. It uses a broader range of information sources.
  • Graph databases do well than vector similarity searches in handling multi-hop searches and dynamically changing data, making them perfect for applications requiring depth and context in data retrieval.

Frequently Asked Questions

Q1. What are the main advantages of using graph databases over traditional relational databases?

A. Graph databases excel at managing interconnected datasets by representing relationships alongside data. This natural framework is ideal for handling complex networks of data. Unlike traditional relational databases, which rely on fixed tabular structures and complex joins, graph databases offer flexibility and scalability.

Q2. How do AI-language models improve data retrieval processes when combined with graph databases?

A. AI-language models, such as OpenAI’s GPT series, improve data retrieval processes by using their natural language understanding capabilities. These models enable more natural interaction and query processing, allowing users to express queries in natural language. This simplifies the querying process and improves the accuracy and efficiency of data retrieval from graph databases.

Q3. What role do retrieval-augmented generation (RAG) systems play in data analysis, mainly when combined with graph databases?

A. RAG systems enhance graph database functionality by offering precise responses to complex queries. By merging retrieval and generation capabilities from diverse sources, they enhance data analysis, beneficial for scenarios needing insights from multiple datasets.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Jayesh Minigi 05 Apr 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers