Skip to content

Commit 5b48892

Browse files
jayhackrushilpatel0
authored andcommitted
feat: vector index refactor (#528)
1 parent 5d8f3fb commit 5b48892

File tree

10 files changed

+810
-269
lines changed

10 files changed

+810
-269
lines changed

docs/building-with-codegen/semantic-code-search.mdx

Lines changed: 60 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -5,88 +5,110 @@ icon: "magnifying-glass"
55
iconType: "solid"
66
---
77

8-
Codegen's `VectorIndex` enables semantic code search capabilities using embeddings. This allows you to search codebases using natural language queries and find semantically related code, even when the exact terms aren't present.
8+
Codegen provides semantic code search capabilities using embeddings. This allows you to search codebases using natural language queries and find semantically related code, even when the exact terms aren't present.
99

1010
<Warning>This is under active development. Interested in an application? [Reach out to the team!](/introduction/about.tsx)</Warning>
1111

1212
## Basic Usage
1313

14-
Create and save a vector index for your codebase:
14+
Here's how to create and use a semantic code search index:
1515

1616
```python
17-
from codegen.extensions import VectorIndex
17+
# Parse a codebase
18+
codebase = Codebase.from_repo('fastapi/fastapi', language='python')
1819

19-
# Initialize with your codebase
20-
index = VectorIndex(codebase)
20+
# Create index
21+
index = FileIndex(codebase)
22+
index.create() # computes per-file embeddings
2123

22-
# Create embeddings for all files
23-
index.create()
24+
# Save index to .pkl
25+
index.save('index.pkl')
26+
27+
# Load index into memory
28+
index.load('index.pkl')
2429

25-
# Save to disk (defaults to .codegen/vector_index.pkl)
26-
index.save()
30+
# Update index after changes
31+
codebase.files[0].edit('# 🌈 Replacing File Content 🌈')
32+
codebase.commit()
33+
index.update() # re-computes 1 embedding
2734
```
2835

29-
Later, load the index and perform semantic searches:
3036

31-
```python
32-
# Create a codebase
33-
codebase = Codebase.from_repo('fastapi/fastapi')
37+
## Searching Code
3438

35-
# Load a previously created index
36-
index = VectorIndex(codebase)
37-
index.load()
39+
Once you have an index, you can perform semantic searches:
3840

41+
```python
3942
# Search with natural language
4043
results = index.similarity_search(
4144
"How does FastAPI handle dependency injection?",
4245
k=5 # number of results
4346
)
4447

45-
# Print results with previews
46-
for filepath, score in results:
47-
print(f"\nScore: {score:.3f} | File: {filepath}")
48-
file = codebase.get_file(filepath)
48+
# Print results
49+
for file, score in results:
50+
print(f"\nScore: {score:.3f} | File: {file.filepath}")
4951
print(f"Preview: {file.content[:200]}...")
5052
```
53+
<Tip>The `FileIndex` returns tuples of ([File](/api-reference/core/SourceFile), `score`)</Tip>
5154

5255
<Note>
5356
The search uses cosine similarity between embeddings to find the most semantically related files, regardless of exact keyword matches.
5457
</Note>
5558

56-
## Getting Embeddings
59+
## Available Indices
5760

58-
You can also get embeddings for arbitrary text using the same model:
61+
Codegen provides two types of semantic indices:
62+
63+
### FileIndex
64+
65+
The `FileIndex` operates at the file level:
66+
- Indexes entire files, splitting large files into chunks
67+
- Best for finding relevant files or modules
68+
- Simpler and faster to create/update
5969

6070
```python
61-
# Get embeddings for a list of texts
62-
texts = [
63-
"Some code or text to embed",
64-
"Another piece of text"
65-
]
66-
embeddings = index.get_embeddings(texts) # shape: (n_texts, embedding_dim)
71+
from codegen import FileIndex
72+
73+
index = FileIndex(codebase)
74+
index.create()
75+
```
76+
77+
### SymbolIndex (Experimental)
78+
79+
The `SymbolIndex` operates at the symbol level:
80+
- Indexes individual functions, classes, and methods
81+
- Better for finding specific code elements
82+
- More granular search results
83+
84+
```python
85+
from codegen import SymbolIndex
86+
87+
index = SymbolIndex(codebase)
88+
index.create()
6789
```
6890

6991
## How It Works
7092

71-
The `VectorIndex` class:
72-
1. Processes each file in your codebase
73-
2. Splits large files into chunks that fit within token limits
74-
3. Uses OpenAI's text-embedding-3-small model to create embeddings
75-
4. Stores embeddings in a numpy array for efficient similarity search
76-
5. Saves the index to disk for reuse
93+
The semantic indices:
94+
1. Process code at either file or symbol level
95+
2. Split large content into chunks that fit within token limits
96+
3. Use OpenAI's text-embedding-3-small model to create embeddings
97+
4. Store embeddings efficiently for similarity search
98+
5. Support incremental updates when code changes
7799

78100
When searching:
79-
1. Your query is converted to an embedding using the same model
80-
2. Cosine similarity is computed between the query and all file embeddings
81-
3. The most similar files are returned, along with their similarity scores
101+
1. Your query is converted to an embedding
102+
2. Cosine similarity is computed with all stored embeddings
103+
3. The most similar items are returned with their scores
82104

83105
<Warning>
84106
Creating embeddings requires an OpenAI API key with access to the embeddings endpoint.
85107
</Warning>
86108

87109
## Example Searches
88110

89-
Here are some example semantic searches that demonstrate the power of the system:
111+
Here are some example semantic searches:
90112

91113
```python
92114
# Find authentication-related code

src/codegen/__init__.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
from codegen.cli.sdk.decorator import function
22
from codegen.cli.sdk.functions import Function
3+
4+
# from codegen.extensions.index.file_index import FileIndex
5+
# from codegen.extensions.langchain.agent import create_agent_with_tools, create_codebase_agent
36
from codegen.sdk.core.codebase import Codebase
7+
from codegen.shared.enums.programming_language import ProgrammingLanguage
48

5-
__all__ = ["Codebase", "Function", "function"]
9+
__all__ = ["Codebase", "Function", "ProgrammingLanguage", "function"]

src/codegen/extensions/__init__.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
"""Extensions for the codegen package."""
22

3-
from codegen.extensions.vector_index import VectorIndex
3+
from codegen.extensions.index.code_index import CodeIndex
4+
from codegen.extensions.index.file_index import FileIndex
45

5-
__all__ = ["VectorIndex"]
6+
__all__ = ["CodeIndex", "FileIndex"]

src/codegen/extensions/index/__init__.py

Whitespace-only changes.
Lines changed: 225 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,225 @@
1+
"""Abstract base class for code indexing implementations."""
2+
3+
from abc import ABC, abstractmethod
4+
from pathlib import Path
5+
from typing import Optional, TypeVar
6+
7+
import numpy as np
8+
9+
from codegen import Codebase
10+
11+
T = TypeVar("T") # Type of the items being indexed (e.g., File, Symbol)
12+
13+
14+
class CodeIndex(ABC):
15+
"""Abstract base class for semantic code search indices.
16+
17+
This class defines the interface for different code indexing implementations.
18+
Implementations can index at different granularities (files, symbols, etc.)
19+
and use different embedding strategies.
20+
21+
Attributes:
22+
codebase (Codebase): The codebase being indexed
23+
E (Optional[np.ndarray]): The embeddings matrix
24+
items (Optional[np.ndarray]): Array of items corresponding to embeddings
25+
commit_hash (Optional[str]): Git commit hash when index was last updated
26+
"""
27+
28+
DEFAULT_SAVE_DIR = ".codegen"
29+
30+
def __init__(self, codebase: Codebase):
31+
"""Initialize the code index.
32+
33+
Args:
34+
codebase: The codebase to index
35+
"""
36+
self.codebase = codebase
37+
self.E: Optional[np.ndarray] = None
38+
self.items: Optional[np.ndarray] = None
39+
self.commit_hash: Optional[str] = None
40+
41+
@property
42+
@abstractmethod
43+
def save_file_name(self) -> str:
44+
"""The filename template for saving the index."""
45+
pass
46+
47+
@abstractmethod
48+
def _get_embeddings(self, items: list[T]) -> list[list[float]]:
49+
"""Get embeddings for a list of items.
50+
51+
Args:
52+
items: List of items to get embeddings for
53+
54+
Returns:
55+
List of embedding vectors
56+
"""
57+
pass
58+
59+
@abstractmethod
60+
def _get_items_to_index(self) -> list[tuple[T, str]]:
61+
"""Get all items that should be indexed and their content.
62+
63+
Returns:
64+
List of tuples (item, content_to_embed)
65+
"""
66+
pass
67+
68+
@abstractmethod
69+
def _get_changed_items(self) -> set[T]:
70+
"""Get set of items that have changed since last index update.
71+
72+
Returns:
73+
Set of changed items
74+
"""
75+
pass
76+
77+
def _get_current_commit(self) -> str:
78+
"""Get the current git commit hash."""
79+
current = self.codebase.current_commit
80+
if current is None:
81+
msg = "No current commit found. Repository may be empty or in a detached HEAD state."
82+
raise ValueError(msg)
83+
return current.hexsha
84+
85+
def _get_default_save_path(self) -> Path:
86+
"""Get the default save path for the index."""
87+
save_dir = Path(self.codebase.repo_path) / self.DEFAULT_SAVE_DIR
88+
save_dir.mkdir(exist_ok=True)
89+
90+
if self.commit_hash is None:
91+
self.commit_hash = self._get_current_commit()
92+
93+
filename = self.save_file_name.format(commit=self.commit_hash[:8])
94+
return save_dir / filename
95+
96+
def create(self) -> None:
97+
"""Create embeddings for all indexed items."""
98+
self.commit_hash = self._get_current_commit()
99+
100+
# Get items and their content
101+
items_with_content = self._get_items_to_index()
102+
if not items_with_content:
103+
self.E = np.array([])
104+
self.items = np.array([])
105+
return
106+
107+
# Split into separate lists
108+
items, contents = zip(*items_with_content)
109+
110+
# Get embeddings
111+
embeddings = self._get_embeddings(contents)
112+
113+
# Store embeddings and item identifiers
114+
self.E = np.array(embeddings)
115+
self.items = np.array([str(item) for item in items]) # Store string identifiers
116+
117+
def update(self) -> None:
118+
"""Update embeddings for changed items only."""
119+
if self.E is None or self.items is None or self.commit_hash is None:
120+
msg = "No index to update. Call create() or load() first."
121+
raise ValueError(msg)
122+
123+
# Get changed items
124+
changed_items = self._get_changed_items()
125+
if not changed_items:
126+
return
127+
128+
# Get content for changed items
129+
items_with_content = [(item, content) for item, content in self._get_items_to_index() if item in changed_items]
130+
131+
if not items_with_content:
132+
return
133+
134+
items, contents = zip(*items_with_content)
135+
new_embeddings = self._get_embeddings(contents)
136+
137+
# Create mapping of items to their indices
138+
item_to_idx = {str(item): idx for idx, item in enumerate(self.items)}
139+
140+
# Update embeddings
141+
for item, embedding in zip(items, new_embeddings):
142+
item_key = str(item)
143+
if item_key in item_to_idx:
144+
# Update existing embedding
145+
self.E[item_to_idx[item_key]] = embedding
146+
else:
147+
# Add new embedding
148+
self.E = np.vstack([self.E, embedding])
149+
self.items = np.append(self.items, item)
150+
151+
# Update commit hash
152+
self.commit_hash = self._get_current_commit()
153+
154+
def save(self, save_path: Optional[str] = None) -> None:
155+
"""Save the index to disk."""
156+
if self.E is None or self.items is None:
157+
msg = "No embeddings to save. Call create() first."
158+
raise ValueError(msg)
159+
160+
save_path = Path(save_path) if save_path else self._get_default_save_path()
161+
save_path.parent.mkdir(parents=True, exist_ok=True)
162+
163+
self._save_index(save_path)
164+
165+
def load(self, load_path: Optional[str] = None) -> None:
166+
"""Load the index from disk."""
167+
load_path = Path(load_path) if load_path else self._get_default_save_path()
168+
169+
if not load_path.exists():
170+
msg = f"No index found at {load_path}"
171+
raise FileNotFoundError(msg)
172+
173+
self._load_index(load_path)
174+
175+
@abstractmethod
176+
def _save_index(self, path: Path) -> None:
177+
"""Save index data to disk."""
178+
pass
179+
180+
@abstractmethod
181+
def _load_index(self, path: Path) -> None:
182+
"""Load index data from disk."""
183+
pass
184+
185+
def _similarity_search_raw(self, query: str, k: int = 5) -> list[tuple[str, float]]:
186+
"""Internal method to find the k most similar items by their string identifiers.
187+
188+
Args:
189+
query: The text to search for
190+
k: Number of results to return
191+
192+
Returns:
193+
List of tuples (item_identifier, similarity_score) sorted by similarity
194+
"""
195+
if self.E is None or self.items is None:
196+
msg = "No embeddings available. Call create() or load() first."
197+
raise ValueError(msg)
198+
199+
# Get query embedding
200+
query_embeddings = self._get_embeddings([query])
201+
query_embedding = query_embeddings[0]
202+
203+
# Compute cosine similarity
204+
query_norm = query_embedding / np.linalg.norm(query_embedding)
205+
E_norm = self.E / np.linalg.norm(self.E, axis=1)[:, np.newaxis]
206+
similarities = np.dot(E_norm, query_norm)
207+
208+
# Get top k indices
209+
top_indices = np.argsort(similarities)[-k:][::-1]
210+
211+
# Return items and similarity scores
212+
return [(str(self.items[idx]), float(similarities[idx])) for idx in top_indices]
213+
214+
@abstractmethod
215+
def similarity_search(self, query: str, k: int = 5) -> list[tuple[T, float]]:
216+
"""Find the k most similar items to a query.
217+
218+
Args:
219+
query: The text to search for
220+
k: Number of results to return
221+
222+
Returns:
223+
List of tuples (item, similarity_score) sorted by similarity
224+
"""
225+
pass

0 commit comments

Comments
 (0)