Integrating Unstructured and Graph Knowledge with Neo4j and LangChain for Enhanced Question Answering

Neo4j is a leading native graph database that stores data as nodes and relationships, ideal for managing interconnected data. It supports ACID transactions, uses the Cypher query language, and offers high scalability. Perfect for applications like social networks, fraud detection, and route optimization.

What is Neo4j Database and How It Works

What is Neo4j?

Neo4j is a native graph database designed to store and manage data as a graph, with nodes representing entities and relationships representing the connections between them. It provides a flexible and efficient way to handle highly interconnected data, making it ideal for applications requiring rich data relationships.

Key Features

  • Native Graph Storage: Stores data as a graph down to the storage level, avoiding the need for graph abstraction layers.

  • Property Graph Model: Uses nodes, relationships, and properties to organize data.

  • ACID Transactions: Ensures reliable and consistent data processing.

  • Cypher Query Language: A powerful and easy-to-use language optimized for graph operations.

  • High Scalability: Supports billions of nodes and relationships with efficient data traversal.

  • Flexible Schema: Allows dynamic changes to the data model without performance loss.

  • Comprehensive Language Support: Offers drivers for multiple programming languages like Java, JavaScript, .NET, and Python.

  • Enterprise Features: Includes clustering, backup, and failover support in the Enterprise Edition.

How It Works

  1. Nodes: Represent entities (e.g., Person, Product) and can have labels and properties.

  2. Relationships: Connect nodes, have a direction (e.g., Person LOVES Person), and can also have properties.

  3. Properties: Key-value pairs that store information about nodes and relationships.

  4. Cypher Query Language: Allows users to query the graph by specifying patterns to find nodes and relationships, making it easy to perform complex queries.

  5. Constant Time Traversals: Efficiently navigates large graphs, enabling quick access to connected data.

  6. Deployment Options: Available as a managed cloud service (AuraDB) or self-hosted (Community and Enterprise Editions).

Use Cases

  • Social Networks: Mapping and analyzing user interactions.

  • Payment Networks: Fraud detection and transaction analysis.

  • Road Networks: Route optimization and traffic management.

  • Enterprise Applications: Enhancing business insights by uncovering hidden connections in data.

Neo4j's ability to efficiently handle complex and dynamic relationships makes it a powerful tool for modern data-driven applications.

# Installing Dependencies

#!pip install -qU transformers datasets langchain openai wikipedia #tiktoken neo4j python-dotenv

# Importing Packages

import os

import re

from langchain.vectorstores.neo4j_vector import Neo4jVector

from langchain.document_loaders import WikipediaLoader, PyPDFLoader

from langchain.embeddings.openai import OpenAIEmbeddings

from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter

from transformers import AutoTokenizer

from dotenv import load_dotenv

# Load environment variables

load_dotenv()

os.environ["NEO4J_URI"] = 'bolt://localhost:7687'

os.environ["NEO4J_USERNAME"] = 'neo4j'

os.environ["NEO4J_PASSWORD"] = 'docdb@123'

# Define the tokenizer using "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Function to calculate the number of tokens in a text

def bert_len(text):

tokens = tokenizer.encode(text)

return len(tokens)

# Example usage

input_text = "This is a sample sentence for tokenization."

num_tokens = bert_len(input_text)

print(f"Number of tokens: {num_tokens}")

# Load and preprocess data

loader = PyPDFLoader("./docs/YouCanHaveAnAmazingMemoryLearn.pdf")

pages = loader.load_and_split()

# Define a text splitter with specific parameters

text_splitter = RecursiveCharacterTextSplitter(

chunk_size=1000, chunk_overlap=200, length_function=bert_len, separators=['nn', 'n', ' ', ''])

# Split the content of the first Wikipedia article into smaller documents

documents = text_splitter.create_documents([pages[4].page_content])

# Instantiate Neo4j vector from documents

neo4j_vector = Neo4jVector.from_documents(

documents,

OpenAIEmbeddings(),

url=os.environ["NEO4J_URI"],

username=os.environ["NEO4J_USERNAME"],

password=os.environ["NEO4J_PASSWORD"])

# Define the query.

query = "What is the introduction on book?"

# Execute the query, get top 2 results.

vector_results = neo4j_vector.similarity_search(query, k=2)

# Print search results with separation.

for i, res in enumerate(vector_results):

print(res.page_content)

if i != len(vector_results) - 1:

print()

# Store the content of the most similar result.

vector_result = vector_results[0].page_content

# Create a Neo4jGraph object by connecting to a Neo4j database.

graph = Neo4jGraph(

url="bolt://localhost:7687", username="neo4j", password="docdb@123"

)

# Print the schema of the Neo4j graph.

print(graph.schema)

# Create a question-answering chain using GPT-3 and a Neo4j graph, with verbose mode enabled.

chain = GraphCypherQAChain.from_llm(

ChatOpenAI(temperature=0.9), graph=graph, verbose=True

)

# Use the question-answering chain to query the Neo4j graph.

graph_result = chain.run("What is the book about?")

print(graph_result)