Skip to content

Commit e2b2da2

Browse files
authored
feat: adds VectorIndex extension (#378)
1 parent 7aa4d45 commit e2b2da2

File tree

8 files changed

+459
-0
lines changed

8 files changed

+459
-0
lines changed
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
---
2+
title: "Semantic Code Search"
3+
sidebarTitle: "Semantic Code Search"
4+
icon: "magnifying-glass"
5+
iconType: "solid"
6+
---
7+
8+
Codegen's `VectorIndex` enables semantic code search capabilities using embeddings. This allows you to search codebases using natural language queries and find semantically related code, even when the exact terms aren't present.
9+
10+
<Warning>This is under active development. Interested in an application? [Reach out to the team!](/introduction/about.tsx)</Warning>
11+
12+
## Basic Usage
13+
14+
Create and save a vector index for your codebase:
15+
16+
```python
17+
from codegen.extensions import VectorIndex
18+
19+
# Initialize with your codebase
20+
index = VectorIndex(codebase)
21+
22+
# Create embeddings for all files
23+
index.create()
24+
25+
# Save to disk (defaults to .codegen/vector_index.pkl)
26+
index.save()
27+
```
28+
29+
Later, load the index and perform semantic searches:
30+
31+
```python
32+
# Create a codebase
33+
codebase = Codebase.from_repo('fastapi/fastapi')
34+
35+
# Load a previously created index
36+
index = VectorIndex(codebase)
37+
index.load()
38+
39+
# Search with natural language
40+
results = index.similarity_search(
41+
"How does FastAPI handle dependency injection?",
42+
k=5 # number of results
43+
)
44+
45+
# Print results with previews
46+
for filepath, score in results:
47+
print(f"\nScore: {score:.3f} | File: {filepath}")
48+
file = codebase.get_file(filepath)
49+
print(f"Preview: {file.content[:200]}...")
50+
```
51+
52+
<Note>
53+
The search uses cosine similarity between embeddings to find the most semantically related files, regardless of exact keyword matches.
54+
</Note>
55+
56+
## Getting Embeddings
57+
58+
You can also get embeddings for arbitrary text using the same model:
59+
60+
```python
61+
# Get embeddings for a list of texts
62+
texts = [
63+
"Some code or text to embed",
64+
"Another piece of text"
65+
]
66+
embeddings = index.get_embeddings(texts) # shape: (n_texts, embedding_dim)
67+
```
68+
69+
## How It Works
70+
71+
The `VectorIndex` class:
72+
1. Processes each file in your codebase
73+
2. Splits large files into chunks that fit within token limits
74+
3. Uses OpenAI's text-embedding-3-small model to create embeddings
75+
4. Stores embeddings in a numpy array for efficient similarity search
76+
5. Saves the index to disk for reuse
77+
78+
When searching:
79+
1. Your query is converted to an embedding using the same model
80+
2. Cosine similarity is computed between the query and all file embeddings
81+
3. The most similar files are returned, along with their similarity scores
82+
83+
<Warning>
84+
Creating embeddings requires an OpenAI API key with access to the embeddings endpoint.
85+
</Warning>
86+
87+
## Example Searches
88+
89+
Here are some example semantic searches that demonstrate the power of the system:
90+
91+
```python
92+
# Find authentication-related code
93+
results = index.similarity_search(
94+
"How is user authentication implemented?",
95+
k=3
96+
)
97+
98+
# Find error handling patterns
99+
results = index.similarity_search(
100+
"Show me examples of error handling and custom exceptions",
101+
k=3
102+
)
103+
104+
# Find configuration management
105+
results = index.similarity_search(
106+
"Where is the application configuration and settings handled?",
107+
k=3
108+
)
109+
```
110+
111+
The semantic search can understand concepts and return relevant results even when the exact terms aren't present in the code.

docs/mint.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -134,6 +134,7 @@
134134
"building-with-codegen/codebase-visualization",
135135
"building-with-codegen/flagging-symbols",
136136
"building-with-codegen/calling-out-to-llms",
137+
"building-with-codegen/semantic-code-search",
137138
"building-with-codegen/reducing-conditions"
138139
]
139140
},

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ dependencies = [
6565
"langchain[openai]",
6666
"langchain_core",
6767
"langchain_openai",
68+
"numpy>=2.2.2",
6869
]
6970

7071
license = { text = "Apache-2.0" }

src/codegen/extensions/__init__.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
"""Extensions for the codegen package."""
2+
3+
from codegen.extensions.vector_index import VectorIndex
4+
5+
__all__ = ["VectorIndex"]

src/codegen/extensions/langchain/tools.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@
1919
reveal_symbol,
2020
search,
2121
semantic_edit,
22+
semantic_search,
2223
view_file,
2324
)
2425

@@ -317,3 +318,27 @@ def _run(
317318
include_dependencies=include_dependencies,
318319
)
319320
return json.dumps(result, indent=2)
321+
322+
323+
class SemanticSearchTool(BaseTool):
324+
"""Tool for semantic code search."""
325+
326+
name: ClassVar[str] = "semantic_search"
327+
description: ClassVar[str] = "Search the codebase using natural language queries and semantic similarity"
328+
args_schema: ClassVar[type[BaseModel]] = type(
329+
"SemanticSearchInput",
330+
(BaseModel,),
331+
{
332+
"query": (str, Field(..., description="The natural language search query")),
333+
"k": (int, Field(default=5, description="Number of results to return")),
334+
"preview_length": (int, Field(default=200, description="Length of content preview in characters")),
335+
},
336+
)
337+
codebase: Codebase = Field(exclude=True)
338+
339+
def __init__(self, codebase: Codebase) -> None:
340+
super().__init__(codebase=codebase)
341+
342+
def _run(self, query: str, k: int = 5, preview_length: int = 200) -> str:
343+
result = semantic_search(self.codebase, query, k=k, preview_length=preview_length)
344+
return json.dumps(result, indent=2)

src/codegen/extensions/tools/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@
1313
from .reveal_symbol import reveal_symbol
1414
from .search import search
1515
from .semantic_edit import semantic_edit
16+
from .semantic_search import semantic_search
1617

1718
__all__ = [
1819
"commit",
@@ -29,5 +30,6 @@
2930
"search",
3031
# Semantic edit
3132
"semantic_edit",
33+
"semantic_search",
3234
"view_file",
3335
]
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
"""Semantic search over codebase files."""
2+
3+
from typing import Any, Optional
4+
5+
from codegen import Codebase
6+
from codegen.extensions.vector_index import VectorIndex
7+
8+
9+
def semantic_search(
10+
codebase: Codebase,
11+
query: str,
12+
k: int = 5,
13+
preview_length: int = 200,
14+
index_path: Optional[str] = None,
15+
) -> dict[str, Any]:
16+
"""Search the codebase using semantic similarity.
17+
18+
This function provides semantic search over a codebase by using OpenAI's embeddings.
19+
Currently, it loads/saves the index from disk each time, but could be optimized to
20+
maintain embeddings in memory for frequently accessed codebases.
21+
22+
TODO(CG-XXXX): Add support for maintaining embeddings in memory across searches,
23+
potentially with an LRU cache or similar mechanism to avoid recomputing embeddings
24+
for frequently searched codebases.
25+
26+
Args:
27+
codebase: The codebase to search
28+
query: The search query in natural language
29+
k: Number of results to return (default: 5)
30+
preview_length: Length of content preview in characters (default: 200)
31+
index_path: Optional path to a saved vector index
32+
33+
Returns:
34+
Dict containing search results or error information. Format:
35+
{
36+
"status": "success",
37+
"query": str,
38+
"results": [
39+
{
40+
"filepath": str,
41+
"score": float,
42+
"preview": str
43+
},
44+
...
45+
]
46+
}
47+
Or on error:
48+
{
49+
"error": str
50+
}
51+
"""
52+
try:
53+
# Initialize vector index
54+
index = VectorIndex(codebase)
55+
56+
# Try to load existing index
57+
try:
58+
if index_path:
59+
index.load(index_path)
60+
else:
61+
index.load()
62+
except FileNotFoundError:
63+
# Create new index if none exists
64+
index.create()
65+
index.save(index_path)
66+
67+
# Perform search
68+
results = index.similarity_search(query, k=k)
69+
70+
# Format results with previews
71+
formatted_results = []
72+
for filepath, score in results:
73+
try:
74+
file = codebase.get_file(filepath)
75+
preview = file.content[:preview_length].replace("\n", " ").strip()
76+
if len(file.content) > preview_length:
77+
preview += "..."
78+
79+
formatted_results.append({"filepath": filepath, "score": float(score), "preview": preview})
80+
except Exception as e:
81+
# Skip files that can't be read
82+
print(f"Warning: Could not read file {filepath}: {e}")
83+
continue
84+
85+
return {"status": "success", "query": query, "results": formatted_results}
86+
87+
except Exception as e:
88+
return {"error": f"Failed to perform semantic search: {e!s}"}

0 commit comments

Comments
 (0)