codegen-sh · jayhack · Feb 17, 2025 · Feb 17, 2025 · Feb 17, 2025 · Feb 17, 2025
@@ -5,88 +5,110 @@ icon: "magnifying-glass"
 iconType: "solid"
 ---
 
-Codegen's `VectorIndex` enables semantic code search capabilities using embeddings. This allows you to search codebases using natural language queries and find semantically related code, even when the exact terms aren't present.
+Codegen provides semantic code search capabilities using embeddings. This allows you to search codebases using natural language queries and find semantically related code, even when the exact terms aren't present.
 
 <Warning>This is under active development. Interested in an application? [Reach out to the team!](/introduction/about.tsx)</Warning>
 
 ## Basic Usage
 
-Create and save a vector index for your codebase:
+Here's how to create and use a semantic code search index:
 
 ```python
-from codegen.extensions import VectorIndex
+# Parse a codebase
+codebase = Codebase.from_repo('fastapi/fastapi', language='python')
 
-# Initialize with your codebase
-index = VectorIndex(codebase)
+# Create index
+index = FileIndex(codebase)
+index.create() # computes per-file embeddings
 
-# Create embeddings for all files
-index.create()
+# Save index to .pkl
+index.save('index.pkl')
+
+# Load index into memory
+index.load('index.pkl')
 
-# Save to disk (defaults to .codegen/vector_index.pkl)
-index.save()
+# Update index after changes
+codebase.files[0].edit('# 🌈 Replacing File Content 🌈')
+codebase.commit()
+index.update() # re-computes 1 embedding
 ```
 
-Later, load the index and perform semantic searches:
 
-```python
-# Create a codebase
-codebase = Codebase.from_repo('fastapi/fastapi')
+## Searching Code
 
-# Load a previously created index
-index = VectorIndex(codebase)
-index.load()
+Once you have an index, you can perform semantic searches:
 
+```python
 # Search with natural language
 results = index.similarity_search(
     "How does FastAPI handle dependency injection?",
     k=5  # number of results
 )
 
-# Print results with previews
-for filepath, score in results:
-    print(f"\nScore: {score:.3f} | File: {filepath}")
-    file = codebase.get_file(filepath)
+# Print results
+for file, score in results:
+    print(f"\nScore: {score:.3f} | File: {file.filepath}")
     print(f"Preview: {file.content[:200]}...")
 ```
+<Tip>The `FileIndex` returns tuples of ([File](/api-reference/core/SourceFile), `score`)</Tip>
 
 <Note>
 The search uses cosine similarity between embeddings to find the most semantically related files, regardless of exact keyword matches.
 </Note>
 
-## Getting Embeddings
+## Available Indices
 
-You can also get embeddings for arbitrary text using the same model:
+Codegen provides two types of semantic indices:
+
+### FileIndex
+
+The `FileIndex` operates at the file level:
+- Indexes entire files, splitting large files into chunks
+- Best for finding relevant files or modules
+- Simpler and faster to create/update
 
 ```python
-# Get embeddings for a list of texts
-texts = [
-    "Some code or text to embed",
-    "Another piece of text"
-]
-embeddings = index.get_embeddings(texts)  # shape: (n_texts, embedding_dim)
+from codegen import FileIndex
+
+index = FileIndex(codebase)
+index.create()
+```
+
+### SymbolIndex (Experimental)
+
+The `SymbolIndex` operates at the symbol level:
+- Indexes individual functions, classes, and methods
+- Better for finding specific code elements
+- More granular search results
+
+```python
+from codegen import SymbolIndex
+
+index = SymbolIndex(codebase)
+index.create()
 ```
 
 ## How It Works
 
-The `VectorIndex` class:
-1. Processes each file in your codebase
-2. Splits large files into chunks that fit within token limits
-3. Uses OpenAI's text-embedding-3-small model to create embeddings
-4. Stores embeddings in a numpy array for efficient similarity search
-5. Saves the index to disk for reuse
+The semantic indices:
+1. Process code at either file or symbol level
+2. Split large content into chunks that fit within token limits
+3. Use OpenAI's text-embedding-3-small model to create embeddings
+4. Store embeddings efficiently for similarity search
+5. Support incremental updates when code changes
 
 When searching:
-1. Your query is converted to an embedding using the same model
-2. Cosine similarity is computed between the query and all file embeddings
-3. The most similar files are returned, along with their similarity scores
+1. Your query is converted to an embedding
+2. Cosine similarity is computed with all stored embeddings
+3. The most similar items are returned with their scores
 
 <Warning>
 Creating embeddings requires an OpenAI API key with access to the embeddings endpoint.
 </Warning>
 
 ## Example Searches
 
-Here are some example semantic searches that demonstrate the power of the system:
+Here are some example semantic searches:
 
 ```python
 # Find authentication-related code

@@ -1,5 +1,9 @@
 from codegen.cli.sdk.decorator import function
 from codegen.cli.sdk.functions import Function
+
+# from codegen.extensions.index.file_index import FileIndex
+# from codegen.extensions.langchain.agent import create_agent_with_tools, create_codebase_agent
 from codegen.sdk.core.codebase import Codebase
+from codegen.shared.enums.programming_language import ProgrammingLanguage
 
-__all__ = ["Codebase", "Function", "function"]
+__all__ = ["Codebase", "Function", "ProgrammingLanguage", "function"]
@@ -1,5 +1,6 @@
 """Extensions for the codegen package."""
 
-from codegen.extensions.vector_index import VectorIndex
+from codegen.extensions.index.code_index import CodeIndex
+from codegen.extensions.index.file_index import FileIndex
 
-__all__ = ["VectorIndex"]
+__all__ = ["CodeIndex", "FileIndex"]
@@ -0,0 +1,225 @@
+"""Abstract base class for code indexing implementations."""
+
+from abc import ABC, abstractmethod
+from pathlib import Path
+from typing import Optional, TypeVar
+
+import numpy as np
+
+from codegen import Codebase
+
+T = TypeVar("T")  # Type of the items being indexed (e.g., File, Symbol)
+
+
+class CodeIndex(ABC):
+    """Abstract base class for semantic code search indices.
+
+    This class defines the interface for different code indexing implementations.
+    Implementations can index at different granularities (files, symbols, etc.)
+    and use different embedding strategies.
+
+    Attributes:
+        codebase (Codebase): The codebase being indexed
+        E (Optional[np.ndarray]): The embeddings matrix
+        items (Optional[np.ndarray]): Array of items corresponding to embeddings
+        commit_hash (Optional[str]): Git commit hash when index was last updated
+    """
+
+    DEFAULT_SAVE_DIR = ".codegen"
+
+    def __init__(self, codebase: Codebase):
+        """Initialize the code index.
+
+        Args:
+            codebase: The codebase to index
+        """
+        self.codebase = codebase
+        self.E: Optional[np.ndarray] = None
+        self.items: Optional[np.ndarray] = None
+        self.commit_hash: Optional[str] = None
+
+    @property
+    @abstractmethod
+    def save_file_name(self) -> str:
+        """The filename template for saving the index."""
+        pass
+
+    @abstractmethod
+    def _get_embeddings(self, items: list[T]) -> list[list[float]]:
+        """Get embeddings for a list of items.
+
+        Args:
+            items: List of items to get embeddings for
+
+        Returns:
+            List of embedding vectors
+        """
+        pass
+
+    @abstractmethod
+    def _get_items_to_index(self) -> list[tuple[T, str]]:
+        """Get all items that should be indexed and their content.
+
+        Returns:
+            List of tuples (item, content_to_embed)
+        """
+        pass
+
+    @abstractmethod
+    def _get_changed_items(self) -> set[T]:
+        """Get set of items that have changed since last index update.
+
+        Returns:
+            Set of changed items
+        """
+        pass
+
+    def _get_current_commit(self) -> str:
+        """Get the current git commit hash."""
+        current = self.codebase.current_commit
+        if current is None:
+            msg = "No current commit found. Repository may be empty or in a detached HEAD state."
+            raise ValueError(msg)
+        return current.hexsha
+
+    def _get_default_save_path(self) -> Path:
+        """Get the default save path for the index."""
+        save_dir = Path(self.codebase.repo_path) / self.DEFAULT_SAVE_DIR
+        save_dir.mkdir(exist_ok=True)
+
+        if self.commit_hash is None:
+            self.commit_hash = self._get_current_commit()
+
+        filename = self.save_file_name.format(commit=self.commit_hash[:8])
+        return save_dir / filename
+
+    def create(self) -> None:
+        """Create embeddings for all indexed items."""
+        self.commit_hash = self._get_current_commit()
+
+        # Get items and their content
+        items_with_content = self._get_items_to_index()
+        if not items_with_content:
+            self.E = np.array([])
+            self.items = np.array([])
+            return
+
+        # Split into separate lists
+        items, contents = zip(*items_with_content)
+
+        # Get embeddings
+        embeddings = self._get_embeddings(contents)
+
+        # Store embeddings and item identifiers
+        self.E = np.array(embeddings)
+        self.items = np.array([str(item) for item in items])  # Store string identifiers
+
+    def update(self) -> None:
+        """Update embeddings for changed items only."""
+        if self.E is None or self.items is None or self.commit_hash is None:
+            msg = "No index to update. Call create() or load() first."
+            raise ValueError(msg)
+
+        # Get changed items
+        changed_items = self._get_changed_items()
+        if not changed_items:
+            return
+
+        # Get content for changed items
+        items_with_content = [(item, content) for item, content in self._get_items_to_index() if item in changed_items]
+
+        if not items_with_content:
+            return
+
+        items, contents = zip(*items_with_content)
+        new_embeddings = self._get_embeddings(contents)
+
+        # Create mapping of items to their indices
+        item_to_idx = {str(item): idx for idx, item in enumerate(self.items)}
+
+        # Update embeddings
+        for item, embedding in zip(items, new_embeddings):
+            item_key = str(item)
+            if item_key in item_to_idx:
+                # Update existing embedding
+                self.E[item_to_idx[item_key]] = embedding
+            else:
+                # Add new embedding
+                self.E = np.vstack([self.E, embedding])
+                self.items = np.append(self.items, item)
+
+        # Update commit hash
+        self.commit_hash = self._get_current_commit()
+
+    def save(self, save_path: Optional[str] = None) -> None:
+        """Save the index to disk."""
+        if self.E is None or self.items is None:
+            msg = "No embeddings to save. Call create() first."
+            raise ValueError(msg)
+
+        save_path = Path(save_path) if save_path else self._get_default_save_path()
+        save_path.parent.mkdir(parents=True, exist_ok=True)
+
+        self._save_index(save_path)
+
+    def load(self, load_path: Optional[str] = None) -> None:
+        """Load the index from disk."""
+        load_path = Path(load_path) if load_path else self._get_default_save_path()
+
+        if not load_path.exists():
+            msg = f"No index found at {load_path}"
+            raise FileNotFoundError(msg)
+
+        self._load_index(load_path)
+
+    @abstractmethod
+    def _save_index(self, path: Path) -> None:
+        """Save index data to disk."""
+        pass
+
+    @abstractmethod
+    def _load_index(self, path: Path) -> None:
+        """Load index data from disk."""
+        pass
+
+    def _similarity_search_raw(self, query: str, k: int = 5) -> list[tuple[str, float]]:
+        """Internal method to find the k most similar items by their string identifiers.
+
+        Args:
+            query: The text to search for
+            k: Number of results to return
+
+        Returns:
+            List of tuples (item_identifier, similarity_score) sorted by similarity
+        """
+        if self.E is None or self.items is None:
+            msg = "No embeddings available. Call create() or load() first."
+            raise ValueError(msg)
+
+        # Get query embedding
+        query_embeddings = self._get_embeddings([query])
+        query_embedding = query_embeddings[0]
+
+        # Compute cosine similarity
+        query_norm = query_embedding / np.linalg.norm(query_embedding)
+        E_norm = self.E / np.linalg.norm(self.E, axis=1)[:, np.newaxis]
+        similarities = np.dot(E_norm, query_norm)
+
+        # Get top k indices
+        top_indices = np.argsort(similarities)[-k:][::-1]
+
+        # Return items and similarity scores
+        return [(str(self.items[idx]), float(similarities[idx])) for idx in top_indices]
+
+    @abstractmethod
+    def similarity_search(self, query: str, k: int = 5) -> list[tuple[T, float]]:
+        """Find the k most similar items to a query.
+
+        Args:
+            query: The text to search for
+            k: Number of results to return
+
+        Returns:
+            List of tuples (item, similarity_score) sorted by similarity
+        """
+        pass