Skip to content

source code: Add Multimodal RAG with Elasticsearch Gotham City tutorial #390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Feb 28, 2025
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
a843f13
Add Multimodal RAG with Elasticsearch Gotham City tutorial
salgado Feb 8, 2025
e47c5e7
Add Multimodal RAG with Elasticsearch Gotham City tutorial
salgado Feb 8, 2025
1557fb2
docs: add OpenAI API key setup instructions
salgado Feb 10, 2025
39674b2
docs: exclude licence
salgado Feb 10, 2025
d2b1b19
fix: fixed comments
salgado Feb 10, 2025
47d6240
docs: added env template
salgado Feb 10, 2025
3748cef
issues fixed 1st review
salgado Feb 14, 2025
fc0f06a
foo
codefromthecrypt Feb 25, 2025
4182f10
foo
codefromthecrypt Feb 25, 2025
76475fa
polish-and-docker
codefromthecrypt Feb 26, 2025
3244b2a
env-example
codefromthecrypt Feb 26, 2025
1217ef6
env-example
codefromthecrypt Feb 26, 2025
55904f1
fix glitch
codefromthecrypt Feb 26, 2025
e24ca5b
remove spurios log
codefromthecrypt Feb 26, 2025
fc2b80d
Add Jupyter notebook implementation of Multimodal RAG
salgado Feb 27, 2025
e381c57
Add Jupyter notebook implementation of Multimodal RAG
salgado Feb 27, 2025
312baa4
Add Jupyter notebook implementation of Multimodal RAG
salgado Feb 27, 2025
c90ba3d
Update documentation with simpler README and Docker setup guide
salgado Feb 27, 2025
36cd475
adding changes from review
JessicaGarson Feb 27, 2025
6866d48
Update 01-mmrag-blog-quick-start.ipynb
JessicaGarson Feb 27, 2025
d34ab33
remove wrong folder
salgado Feb 27, 2025
112c8fa
Remove Docker configuration files
salgado Feb 27, 2025
d7f2472
remove coker references
salgado Feb 27, 2025
a3abfb2
fixing first line notebook to test branch
salgado Feb 28, 2025
be1a03b
fixing first line notebook to main repo
salgado Feb 28, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Elasticsearch Configuration
ELASTIC_API_KEY=your_api_key_here
ELASTICSEARCH_ENDPOINT=your_elastic_endpoint

# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here

# Model Configuration
MODEL_PATH=~/.cache/torch/checkpoints/imagebind_huge.pth

# Optional Configuration
#LOG_LEVEL=INFO
#DEBUG=False
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# Building a Multimodal RAG Pipeline with Elasticsearch: The Story of Gotham City

This repository contains the code for implementing a Multimodal Retrieval-Augmented Generation (RAG) system using Elasticsearch. The system processes and analyzes different types of evidence (images, audio, text, and depth maps) to solve a crime in Gotham City.

## Overview

The pipeline demonstrates how to:
- Generate unified embeddings for multiple modalities using ImageBind
- Store and search vectors efficiently in Elasticsearch
- Analyze evidence using GPT-4 to generate forensic reports

## Prerequisites

- Python 3.10+
- Elasticsearch cluster (cloud or local)
- OpenAI API key - Setup an OpenAI account and create a [secret key](https://platform.openai.com/docs/quickstart)
- 8GB+ RAM
- GPU (optional but recommended)

## Quick Start

1. **Setup Environment**
```bash
# Create and activate virtual environment
python -m venv env_mmrag
source env_mmrag/bin/activate # Unix/MacOS
# or
.\env_mmrag\Scripts\activate # Windows

# Install dependencies
pip install -r requirements.txt
```

2. **Configure Credentials**
Create a `.env` file:
```env
ELASTICSEARCH_ENDPOINT="your-elasticsearch-endpoint"
ELASTIC_API_KEY="your-elastic-api-key"
OPENAI_API_KEY="your-openai-api-key"
```

3. **Run the Demo**
```bash
# Verify file structure
python stages/01-stage/files_check.py

# Generate embeddings
python stages/02-stage/test_embedding_generation.py

# Index content
python stages/03-stage/index_all_modalities.py

# Search and analyze
python stages/04-stage/rag_crime_analyze.py
```

## Project Structure

```
├── README.md
├── requirements.txt
├── src/
│ ├── embedding_generator.py # ImageBind wrapper
│ ├── elastic_manager.py # Elasticsearch operations
│ └── llm_analyzer.py # GPT-4 integration
├── stages/
│ ├── 01-stage/ # File organization
│ ├── 02-stage/ # Embedding generation
│ ├── 03-stage/ # Elasticsearch indexing/search
│ └── 04-stage/ # Evidence analysis
└── data/ # Sample data
├── images/
├── audios/
├── texts/
└── depths/
```

## Sample Data

The repository includes sample evidence files:
- Images: Crime scene photos and security camera footage
- Audio: Suspicious sound recordings
- Text: Mysterious notes and riddles
- Depth Maps: 3D scene captures

## How It Works

1. **Evidence Collection**: Files are organized by modality in the `data/` directory
2. **Embedding Generation**: ImageBind converts each piece of evidence into a 1024-dimensional vector
3. **Vector Storage**: Elasticsearch stores embeddings with metadata for efficient retrieval
4. **Similarity Search**: New evidence is compared against the database using k-NN search
5. **Analysis**: GPT-4 analyzes the connections between evidence to identify suspects

Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Why so serious?

The show has just begun and you're already running
While clowns are dancing and the city's stunning
In the abandoned theater, a surprise awaits
Come play with me before it's too late!

HAHAHAHAHA!
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
PRELIMINARY REPORT - GCPD
Date: 01/28/2025
Time: 22:30

Incident: Break-in and Vandalism
Location: Gotham Central Bank
Evidence Found:
- Playing cards scattered
- Smile graffiti on walls
- Suspicious audio recording
- Witnesses report maniacal laughter

Status: Under Investigation
Priority Level: MAXIMUM
Primary Suspect: Unknown (possible Joker involvement)
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
HAHAHA!

Dear Detective,

In a city of endless night, a new game unfolds
Where chaos reigns and fear takes hold
I left a gift at Gotham Central Bank
Time's ticking, your mind goes blank

The clues are there, scattered with care
Each laugh echoes everywhere
Midnight strikes, you won't catch me
In Gotham's heart, chaos runs free!

With a smile,
?
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Incident Log:
1. Gotham Central Bank - 22:15 - Alarm triggered
2. Monarch Theater - 22:45 - Suspicious laughter reported
3. Abandoned Amusement Park - 23:00 - Strange lights
4. Ace Chemical Plant - 23:30 - Suspicious movement
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
elasticsearch>=8.11.0
torch>=2.0.0
torchvision>=0.15.0
torchaudio>=2.0.0
imagebind @ git+https://github.com/facebookresearch/ImageBind.git
openai>=1.0.0
python-dotenv>=1.0.0
numpy>=1.24.0
pillow>=10.0.0
opencv-python>=4.8.0
librosa>=0.10.0
matplotlib>=3.7.0
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
from elasticsearch import Elasticsearch, helpers
import base64
import os
from dotenv import load_dotenv
import numpy as np

class ElasticsearchManager:
"""Manages multimodal operations in Elasticsearch"""

def __init__(self):
load_dotenv() # Load variables from .env
self.es = self._connect_elastic()
self.index_name = "multimodal_content"
self._setup_index()

def _connect_elastic(self):
"""Connects to Elasticsearch"""
return Elasticsearch(
os.getenv("ELASTICSEARCH_ENDPOINT"), # Elasticsearch endpoint
api_key=os.getenv("ELASTIC_API_KEY")
)

def _setup_index(self):
"""Sets up the index if it doesn't exist"""
if not self.es.indices.exists(index=self.index_name):
mapping = {
"mappings": {
"properties": {
"embedding": {
"type": "dense_vector",
"dims": 1024,
"index": True,
"similarity": "cosine"
},
"modality": {"type": "keyword"},
"content": {"type": "binary"},
"description": {"type": "text"},
"metadata": {"type": "object"},
"content_path": {"type": "text"}
}
}
}
self.es.indices.create(index=self.index_name, body=mapping)

def index_content(self, embedding, modality, content=None, description="", metadata=None, content_path=None):
"""Indexes multimodal content"""
doc = {
"embedding": embedding.tolist(),
"modality": modality,
"description": description,
"metadata": metadata or {},
"content_path": content_path
}

if content:
doc["content"] = base64.b64encode(content).decode() if isinstance(content, bytes) else content

return self.es.index(index=self.index_name, document=doc)

def search_similar(self, query_embedding, modality=None, k=5):
"""Searches for similar contents"""
query = {
"knn": {
"field": "embedding",
"query_vector": query_embedding.tolist(),
"k": k,
"num_candidates": 100,
"filter": [{"term": {"modality": modality}}] if modality else []
}
}

try:
response = self.es.search(
index=self.index_name,
query=query,
size=k
)

# Return both source data and score for each hit
return [{
**hit["_source"],
"score": hit["_score"]
} for hit in response["hits"]["hits"]]

except Exception as e:
print(f"Error: processing search_evidence: {str(e)}")
return "Error generating search evidence"
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
import os
import cv2
from io import BytesIO
import logging
from torch.hub import download_url_to_file

import torch
import numpy as np
from PIL import Image
from imagebind import data
from imagebind.models import imagebind_model

from torchvision import transforms


logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class EmbeddingGenerator:
"""Generates multimodal embeddings using ImageBind"""

def __init__(self, device="cpu"):
self.device = device
self.model = self._load_model()

def _load_model(self):
"""Initialize and test the ImageBind model."""
checkpoint_path = "~/.cache/torch/checkpoints/imagebind_huge.pth"
os.makedirs(os.path.expanduser("~/.cache/torch/checkpoints"), exist_ok=True)

if not os.path.exists(os.path.expanduser(checkpoint_path)):
print("Downloading ImageBind weights...")
download_url_to_file(
"https://dl.fbaipublicfiles.com/imagebind/imagebind_huge.pth",
os.path.expanduser(checkpoint_path)
)

try:
checkpoint_path = os.path.expanduser("~/.cache/torch/checkpoints/imagebind_huge.pth")

# Check if file exists
if not os.path.exists(checkpoint_path):
raise FileNotFoundError(f"Checkpoint not found: {checkpoint_path}")

model = imagebind_model.imagebind_huge(pretrained=False)
model.load_state_dict(torch.load(checkpoint_path))
model.eval().to(self.device)

# Quick test with empty text input
logger.info("Testing model with sample input...")
test_input = data.load_and_transform_text([""], self.device)
with torch.no_grad():
_ = model({"text": test_input})

logger.info("🤖 ImageBind model initialized successfully")
return model
except Exception as e:
logger.error(f"🚨 Model initialization failed: {str(e)}")
raise

def generate_embedding(self, input_data, modality):
"""Generates embedding for different modalities"""
processors = {
"vision": lambda x: data.load_and_transform_vision_data(x, self.device),
"audio": lambda x: data.load_and_transform_audio_data(x, self.device),
"text": lambda x: data.load_and_transform_text(x, self.device),
"depth": self.process_depth
}

try:
# Input type verification
if not isinstance(input_data, list):
raise ValueError(f"Input data must be a list. Received: {type(input_data)}")

# Convert input data to a tensor format that the model can process
# For images: [batch_size, channels, height, width]
# For audio: [batch_size, channels, time]
# For text: [batch_size, sequence_length]
inputs = {modality: processors[modality](input_data)}
with torch.no_grad():
embedding = self.model(inputs)[modality]
return embedding.squeeze(0).cpu().numpy()
except Exception as e:
logger.error(f"Error generating {modality} embedding: {str(e)}", exc_info=True)
raise


def process_vision(self, image_path):
"""Processes image"""
return data.load_and_transform_vision_data([image_path], self.device)

def process_audio(self, audio_path):
"""Processes audio"""
return data.load_and_transform_audio_data([audio_path], self.device)

def process_text(self, text):
"""Processes text"""
return data.load_and_transform_text([text], self.device)

def process_depth(self, depth_paths, device="cpu"):
"""Custom processing for depth maps"""
try:
# Check file existence
for path in depth_paths:
if not os.path.exists(path):
raise FileNotFoundError(f"Depth map file not found: {path}")

# Load and transform
depth_images = [Image.open(path).convert("L") for path in depth_paths]

transform = transforms.Compose([
transforms.Resize((224, 224)),
transforms.ToTensor(),
])

return torch.stack([transform(img) for img in depth_images]).to(device)

except Exception as e:
logger.error(f"🚨 - Error processing depth map: {str(e)}")
raise
Loading
Loading