Skip to content

feat: vector index refactor #528

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Feb 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 60 additions & 38 deletions docs/building-with-codegen/semantic-code-search.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,88 +5,110 @@ icon: "magnifying-glass"
iconType: "solid"
---

Codegen's `VectorIndex` enables semantic code search capabilities using embeddings. This allows you to search codebases using natural language queries and find semantically related code, even when the exact terms aren't present.
Codegen provides semantic code search capabilities using embeddings. This allows you to search codebases using natural language queries and find semantically related code, even when the exact terms aren't present.

<Warning>This is under active development. Interested in an application? [Reach out to the team!](/introduction/about.tsx)</Warning>

## Basic Usage

Create and save a vector index for your codebase:
Here's how to create and use a semantic code search index:

```python
from codegen.extensions import VectorIndex
# Parse a codebase
codebase = Codebase.from_repo('fastapi/fastapi', language='python')

# Initialize with your codebase
index = VectorIndex(codebase)
# Create index
index = FileIndex(codebase)
index.create() # computes per-file embeddings

# Create embeddings for all files
index.create()
# Save index to .pkl
index.save('index.pkl')

# Load index into memory
index.load('index.pkl')

# Save to disk (defaults to .codegen/vector_index.pkl)
index.save()
# Update index after changes
codebase.files[0].edit('# 🌈 Replacing File Content 🌈')
codebase.commit()
index.update() # re-computes 1 embedding
```

Later, load the index and perform semantic searches:

```python
# Create a codebase
codebase = Codebase.from_repo('fastapi/fastapi')
## Searching Code

# Load a previously created index
index = VectorIndex(codebase)
index.load()
Once you have an index, you can perform semantic searches:

```python
# Search with natural language
results = index.similarity_search(
"How does FastAPI handle dependency injection?",
k=5 # number of results
)

# Print results with previews
for filepath, score in results:
print(f"\nScore: {score:.3f} | File: {filepath}")
file = codebase.get_file(filepath)
# Print results
for file, score in results:
print(f"\nScore: {score:.3f} | File: {file.filepath}")
print(f"Preview: {file.content[:200]}...")
```
<Tip>The `FileIndex` returns tuples of ([File](/api-reference/core/SourceFile), `score`)</Tip>

<Note>
The search uses cosine similarity between embeddings to find the most semantically related files, regardless of exact keyword matches.
</Note>

## Getting Embeddings
## Available Indices

You can also get embeddings for arbitrary text using the same model:
Codegen provides two types of semantic indices:

### FileIndex

The `FileIndex` operates at the file level:
- Indexes entire files, splitting large files into chunks
- Best for finding relevant files or modules
- Simpler and faster to create/update

```python
# Get embeddings for a list of texts
texts = [
"Some code or text to embed",
"Another piece of text"
]
embeddings = index.get_embeddings(texts) # shape: (n_texts, embedding_dim)
from codegen import FileIndex

index = FileIndex(codebase)
index.create()
```

### SymbolIndex (Experimental)

The `SymbolIndex` operates at the symbol level:
- Indexes individual functions, classes, and methods
- Better for finding specific code elements
- More granular search results

```python
from codegen import SymbolIndex

index = SymbolIndex(codebase)
index.create()
```

## How It Works

The `VectorIndex` class:
1. Processes each file in your codebase
2. Splits large files into chunks that fit within token limits
3. Uses OpenAI's text-embedding-3-small model to create embeddings
4. Stores embeddings in a numpy array for efficient similarity search
5. Saves the index to disk for reuse
The semantic indices:
1. Process code at either file or symbol level
2. Split large content into chunks that fit within token limits
3. Use OpenAI's text-embedding-3-small model to create embeddings
4. Store embeddings efficiently for similarity search
5. Support incremental updates when code changes

When searching:
1. Your query is converted to an embedding using the same model
2. Cosine similarity is computed between the query and all file embeddings
3. The most similar files are returned, along with their similarity scores
1. Your query is converted to an embedding
2. Cosine similarity is computed with all stored embeddings
3. The most similar items are returned with their scores

<Warning>
Creating embeddings requires an OpenAI API key with access to the embeddings endpoint.
</Warning>

## Example Searches

Here are some example semantic searches that demonstrate the power of the system:
Here are some example semantic searches:

```python
# Find authentication-related code
Expand Down
6 changes: 5 additions & 1 deletion src/codegen/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
from codegen.cli.sdk.decorator import function
from codegen.cli.sdk.functions import Function

# from codegen.extensions.index.file_index import FileIndex
# from codegen.extensions.langchain.agent import create_agent_with_tools, create_codebase_agent
from codegen.sdk.core.codebase import Codebase
from codegen.shared.enums.programming_language import ProgrammingLanguage

__all__ = ["Codebase", "Function", "function"]
__all__ = ["Codebase", "Function", "ProgrammingLanguage", "function"]
5 changes: 3 additions & 2 deletions src/codegen/extensions/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
"""Extensions for the codegen package."""

from codegen.extensions.vector_index import VectorIndex
from codegen.extensions.index.code_index import CodeIndex
from codegen.extensions.index.file_index import FileIndex

__all__ = ["VectorIndex"]
__all__ = ["CodeIndex", "FileIndex"]
Empty file.
225 changes: 225 additions & 0 deletions src/codegen/extensions/index/code_index.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,225 @@
"""Abstract base class for code indexing implementations."""

from abc import ABC, abstractmethod
from pathlib import Path
from typing import Optional, TypeVar

import numpy as np

from codegen import Codebase

T = TypeVar("T") # Type of the items being indexed (e.g., File, Symbol)


class CodeIndex(ABC):
"""Abstract base class for semantic code search indices.

This class defines the interface for different code indexing implementations.
Implementations can index at different granularities (files, symbols, etc.)
and use different embedding strategies.

Attributes:
codebase (Codebase): The codebase being indexed
E (Optional[np.ndarray]): The embeddings matrix
items (Optional[np.ndarray]): Array of items corresponding to embeddings
commit_hash (Optional[str]): Git commit hash when index was last updated
"""

DEFAULT_SAVE_DIR = ".codegen"

def __init__(self, codebase: Codebase):
"""Initialize the code index.

Args:
codebase: The codebase to index
"""
self.codebase = codebase
self.E: Optional[np.ndarray] = None
self.items: Optional[np.ndarray] = None
self.commit_hash: Optional[str] = None

@property
@abstractmethod
def save_file_name(self) -> str:
"""The filename template for saving the index."""
pass

@abstractmethod
def _get_embeddings(self, items: list[T]) -> list[list[float]]:
"""Get embeddings for a list of items.

Args:
items: List of items to get embeddings for

Returns:
List of embedding vectors
"""
pass

@abstractmethod
def _get_items_to_index(self) -> list[tuple[T, str]]:
"""Get all items that should be indexed and their content.

Returns:
List of tuples (item, content_to_embed)
"""
pass

@abstractmethod
def _get_changed_items(self) -> set[T]:
"""Get set of items that have changed since last index update.

Returns:
Set of changed items
"""
pass

def _get_current_commit(self) -> str:
"""Get the current git commit hash."""
current = self.codebase.current_commit
if current is None:
msg = "No current commit found. Repository may be empty or in a detached HEAD state."
raise ValueError(msg)
return current.hexsha

def _get_default_save_path(self) -> Path:
"""Get the default save path for the index."""
save_dir = Path(self.codebase.repo_path) / self.DEFAULT_SAVE_DIR
save_dir.mkdir(exist_ok=True)

if self.commit_hash is None:
self.commit_hash = self._get_current_commit()

filename = self.save_file_name.format(commit=self.commit_hash[:8])
return save_dir / filename

def create(self) -> None:
"""Create embeddings for all indexed items."""
self.commit_hash = self._get_current_commit()

# Get items and their content
items_with_content = self._get_items_to_index()

Check failure on line 101 in src/codegen/extensions/index/code_index.py

View workflow job for this annotation

GitHub Actions / mypy

error: Need type annotation for "items_with_content" [var-annotated]
if not items_with_content:
self.E = np.array([])
self.items = np.array([])
return

# Split into separate lists
items, contents = zip(*items_with_content)

# Get embeddings
embeddings = self._get_embeddings(contents)

Check failure on line 111 in src/codegen/extensions/index/code_index.py

View workflow job for this annotation

GitHub Actions / mypy

error: Argument 1 to "_get_embeddings" of "CodeIndex" has incompatible type "tuple[Any, ...]"; expected "list[Never]" [arg-type]

# Store embeddings and item identifiers
self.E = np.array(embeddings)
self.items = np.array([str(item) for item in items]) # Store string identifiers

def update(self) -> None:
"""Update embeddings for changed items only."""
if self.E is None or self.items is None or self.commit_hash is None:
msg = "No index to update. Call create() or load() first."
raise ValueError(msg)

# Get changed items
changed_items = self._get_changed_items()

Check failure on line 124 in src/codegen/extensions/index/code_index.py

View workflow job for this annotation

GitHub Actions / mypy

error: Need type annotation for "changed_items" (hint: "changed_items: set[<type>] = ...") [var-annotated]
if not changed_items:
return

# Get content for changed items
items_with_content = [(item, content) for item, content in self._get_items_to_index() if item in changed_items]

Check failure on line 129 in src/codegen/extensions/index/code_index.py

View workflow job for this annotation

GitHub Actions / mypy

error: Need type annotation for "item" [var-annotated]

if not items_with_content:
return

items, contents = zip(*items_with_content)
new_embeddings = self._get_embeddings(contents)

Check failure on line 135 in src/codegen/extensions/index/code_index.py

View workflow job for this annotation

GitHub Actions / mypy

error: Argument 1 to "_get_embeddings" of "CodeIndex" has incompatible type "tuple[Any, ...]"; expected "list[Never]" [arg-type]

# Create mapping of items to their indices
item_to_idx = {str(item): idx for idx, item in enumerate(self.items)}

# Update embeddings
for item, embedding in zip(items, new_embeddings):
item_key = str(item)
if item_key in item_to_idx:
# Update existing embedding
self.E[item_to_idx[item_key]] = embedding
else:
# Add new embedding
self.E = np.vstack([self.E, embedding])
self.items = np.append(self.items, item)

# Update commit hash
self.commit_hash = self._get_current_commit()

def save(self, save_path: Optional[str] = None) -> None:
"""Save the index to disk."""
if self.E is None or self.items is None:
msg = "No embeddings to save. Call create() first."
raise ValueError(msg)

save_path = Path(save_path) if save_path else self._get_default_save_path()

Check failure on line 160 in src/codegen/extensions/index/code_index.py

View workflow job for this annotation

GitHub Actions / mypy

error: Incompatible types in assignment (expression has type "Path", variable has type "str | None") [assignment]
save_path.parent.mkdir(parents=True, exist_ok=True)

Check failure on line 161 in src/codegen/extensions/index/code_index.py

View workflow job for this annotation

GitHub Actions / mypy

error: Item "str" of "str | None" has no attribute "parent" [union-attr]

Check failure on line 161 in src/codegen/extensions/index/code_index.py

View workflow job for this annotation

GitHub Actions / mypy

error: Item "None" of "str | None" has no attribute "parent" [union-attr]

self._save_index(save_path)

Check failure on line 163 in src/codegen/extensions/index/code_index.py

View workflow job for this annotation

GitHub Actions / mypy

error: Argument 1 to "_save_index" of "CodeIndex" has incompatible type "str | None"; expected "Path" [arg-type]

def load(self, load_path: Optional[str] = None) -> None:
"""Load the index from disk."""
load_path = Path(load_path) if load_path else self._get_default_save_path()

Check failure on line 167 in src/codegen/extensions/index/code_index.py

View workflow job for this annotation

GitHub Actions / mypy

error: Incompatible types in assignment (expression has type "Path", variable has type "str | None") [assignment]

if not load_path.exists():
msg = f"No index found at {load_path}"
raise FileNotFoundError(msg)

self._load_index(load_path)

@abstractmethod
def _save_index(self, path: Path) -> None:
"""Save index data to disk."""
pass

@abstractmethod
def _load_index(self, path: Path) -> None:
"""Load index data from disk."""
pass

def _similarity_search_raw(self, query: str, k: int = 5) -> list[tuple[str, float]]:
"""Internal method to find the k most similar items by their string identifiers.

Args:
query: The text to search for
k: Number of results to return

Returns:
List of tuples (item_identifier, similarity_score) sorted by similarity
"""
if self.E is None or self.items is None:
msg = "No embeddings available. Call create() or load() first."
raise ValueError(msg)

# Get query embedding
query_embeddings = self._get_embeddings([query])
query_embedding = query_embeddings[0]

# Compute cosine similarity
query_norm = query_embedding / np.linalg.norm(query_embedding)
E_norm = self.E / np.linalg.norm(self.E, axis=1)[:, np.newaxis]
similarities = np.dot(E_norm, query_norm)

# Get top k indices
top_indices = np.argsort(similarities)[-k:][::-1]

# Return items and similarity scores
return [(str(self.items[idx]), float(similarities[idx])) for idx in top_indices]

@abstractmethod
def similarity_search(self, query: str, k: int = 5) -> list[tuple[T, float]]:
"""Find the k most similar items to a query.

Args:
query: The text to search for
k: Number of results to return

Returns:
List of tuples (item, similarity_score) sorted by similarity
"""
pass
Loading