Skip to content

Commit f4af64b

Browse files
Edward-Codegencodegen-bot
and
codegen-bot
committed
New Codebase Init Flow (#139)
# Motivation <!-- Why is this change necessary? --> # Content <!-- Please include a summary of the change --> # Testing <!-- How was the change tested? --> # Please check the following before marking your PR as ready for review - [ ] I have added tests for my changes - [ ] I have updated the documentation or added new documentation as needed - [ ] I have read and agree to the [Contributor License Agreement](../CLA.md) --------- Co-authored-by: codegen-bot <[email protected]>
1 parent e055880 commit f4af64b

File tree

4 files changed

+126
-26
lines changed

4 files changed

+126
-26
lines changed

docs/building-with-codegen/parsing-codebases.mdx

Lines changed: 80 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -9,21 +9,29 @@ The primary entrypoint to programs leveraging Codegen is the [Codebase](/api-ref
99

1010
## Local Codebases
1111

12-
Construct a Codebase by passing in a path to a local `git` repository.
12+
Construct a Codebase by passing in a path to a local `git` repository or any subfolder within it. The path must be within a git repository (i.e., somewhere in the parent directory tree must contain a `.git` folder).
1313

1414
```python
1515
from codegen import Codebase
16+
from codegen.sdk.enums import ProgrammingLanguage
1617

17-
# Parse from a local directory
18+
# Parse from a git repository root
1819
codebase = Codebase("path/to/repository")
1920

20-
# Parse from current directory
21+
# Parse from a subfolder within a git repository
22+
codebase = Codebase("path/to/repository/src/subfolder")
23+
24+
# Parse from current directory (must be within a git repo)
2125
codebase = Codebase("./")
26+
27+
# Specify programming language (instead of inferring from file extensions)
28+
codebase = Codebase("./", programming_language=ProgrammingLanguage.TYPESCRIPT)
2229
```
2330

2431
<Note>
25-
This will automatically infer the programming language of the codebase and
26-
parse all files in the codebase.
32+
By default, Codegen will automatically infer the programming language of the codebase and
33+
parse all files in the codebase. You can override this by passing the `programming_language` parameter
34+
with a value from the `ProgrammingLanguage` enum.
2735
</Note>
2836

2937
<Tip>
@@ -38,16 +46,18 @@ To fetch and parse a repository directly from GitHub, use the `from_repo` functi
3846

3947
```python
4048
import codegen
49+
from codegen.sdk.enums import ProgrammingLanguage
4150

4251
# Fetch and parse a repository (defaults to /tmp/codegen/{repo_name})
4352
codebase = codegen.from_repo('fastapi/fastapi')
4453

45-
# Customize temp directory, clone depth, or specific commit
54+
# Customize temp directory, clone depth, specific commit, or programming language
4655
codebase = codegen.from_repo(
4756
'fastapi/fastapi',
4857
tmp_dir='/custom/temp/dir', # Optional: custom temp directory
49-
commit='786a8ada7ed0c7f9d8b04d49f24596865e4b7901',
58+
commit='786a8ada7ed0c7f9d8b04d49f24596865e4b7901', # Optional: specific commit
5059
shallow=False, # Optional: full clone instead of shallow
60+
programming_language=ProgrammingLanguage.PYTHON # Optional: override language detection
5161
)
5262
```
5363

@@ -56,6 +66,69 @@ codebase = codegen.from_repo(
5666
default. The clone is shallow by default for better performance.
5767
</Note>
5868

69+
## Configuration Options
70+
71+
You can customize the behavior of your Codebase instance by passing a `CodebaseConfig` object. This allows you to configure secrets (like API keys) and toggle specific features:
72+
73+
```python
74+
from codegen import Codebase
75+
from codegen.sdk.codebase.config import CodebaseConfig, GSFeatureFlags, Secrets
76+
77+
codebase = Codebase(
78+
"path/to/repository",
79+
config=CodebaseConfig(
80+
secrets=Secrets(
81+
openai_key="your-openai-key" # For AI-powered features
82+
),
83+
feature_flags=GSFeatureFlags(
84+
sync_enabled=True, # Enable graph synchronization
85+
... # Add other feature flags as needed
86+
)
87+
)
88+
)
89+
```
90+
91+
The `CodebaseConfig` allows you to configure:
92+
- `secrets`: API keys and other sensitive information needed by the codebase
93+
- `feature_flags`: Toggle specific features like language engines, dependency management, and graph synchronization
94+
95+
For a complete list of available feature flags and configuration options, see the [source code on GitHub](https://github.com/codegen-sh/codegen-sdk/blob/develop/src/codegen/sdk/codebase/config.py).
96+
97+
## Advanced Initialization
98+
99+
For more complex scenarios, Codegen supports an advanced initialization mode using `ProjectConfig`. This allows for fine-grained control over:
100+
101+
- Repository configuration
102+
- Base path and subdirectory filtering
103+
- Multiple project configurations
104+
105+
Here's an example:
106+
107+
```python
108+
from codegen import Codebase
109+
from codegen.git.repo_operator.local_repo_operator import LocalRepoOperator
110+
from codegen.git.schemas.repo_config import BaseRepoConfig
111+
from codegen.sdk.codebase.config import ProjectConfig
112+
from codegen.sdk.enums import ProgrammingLanguage
113+
114+
codebase = Codebase(
115+
projects = [
116+
ProjectConfig(
117+
repo_operator=LocalRepoOperator(
118+
repo_path="/tmp/codegen-sdk",
119+
repo_config=BaseRepoConfig(),
120+
bot_commit=True
121+
),
122+
programming_language=ProgrammingLanguage.TYPESCRIPT,
123+
base_path="src/codegen/sdk/typescript",
124+
subdirectories=["src/codegen/sdk/typescript"]
125+
)
126+
]
127+
)
128+
```
129+
130+
For more details on advanced configuration options, see the [source code on GitHub](https://github.com/codegen-sh/codegen-sdk/blob/develop/src/codegen/sdk/core/codebase.py).
131+
59132
## Supported Languages
60133

61134
Codegen currently supports:

src/codegen/git/repo_operator/local_repo_operator.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -32,14 +32,15 @@ class LocalRepoOperator(RepoOperator):
3232

3333
def __init__(
3434
self,
35-
repo_config: BaseRepoConfig,
3635
repo_path: str, # full path to the repo
36+
repo_config: BaseRepoConfig | None = None,
3737
bot_commit: bool = True,
3838
) -> None:
3939
self._repo_path = repo_path
4040
self._repo_name = os.path.basename(repo_path)
4141
os.makedirs(self.repo_path, exist_ok=True)
4242
GitCLI.init(self.repo_path)
43+
repo_config = repo_config or BaseRepoConfig()
4344
super().__init__(repo_config, self.repo_path, bot_commit)
4445

4546
####################################################################################################################

src/codegen/sdk/codebase/config.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,13 @@
1+
import os
2+
from typing import Self
3+
14
from pydantic import BaseModel, ConfigDict, Field
25

6+
from codegen.git.repo_operator.local_repo_operator import LocalRepoOperator
37
from codegen.git.repo_operator.repo_operator import RepoOperator
48
from codegen.sdk.enums import ProgrammingLanguage
59
from codegen.sdk.secrets import Secrets
10+
from codegen.sdk.utils import determine_project_language, split_git_path
611

712
HARD_MAX_AI_LIMIT = 500 # Global limit for AI requests
813

@@ -55,6 +60,28 @@ class ProjectConfig(BaseModel):
5560
subdirectories: list[str] | None = None
5661
programming_language: ProgrammingLanguage = ProgrammingLanguage.PYTHON
5762

63+
@classmethod
64+
def from_path(cls, path: str, programming_language: ProgrammingLanguage | None = None) -> Self:
65+
# Split repo_path into (git_root, base_path)
66+
repo_path = os.path.abspath(path)
67+
git_root, base_path = split_git_path(repo_path)
68+
# Create main project
69+
return cls(
70+
repo_operator=LocalRepoOperator(repo_path=git_root),
71+
programming_language=programming_language or determine_project_language(repo_path),
72+
base_path=base_path,
73+
subdirectories=[base_path] if base_path else None,
74+
)
75+
76+
@classmethod
77+
def from_repo_operator(cls, repo_operator: RepoOperator, programming_language: ProgrammingLanguage | None = None, base_path: str | None = None) -> Self:
78+
return cls(
79+
repo_operator=repo_operator,
80+
programming_language=programming_language or determine_project_language(repo_operator.repo_path),
81+
base_path=base_path,
82+
subdirectories=[base_path] if base_path else None,
83+
)
84+
5885

5986
class CodebaseConfig(BaseModel):
6087
"""Configuration for a Codebase. There can be 1 -> many codebases in a single repo

src/codegen/sdk/core/codebase.py

Lines changed: 17 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@
2323
from codegen.git.repo_operator.remote_repo_operator import RemoteRepoOperator
2424
from codegen.git.repo_operator.repo_operator import RepoOperator
2525
from codegen.git.schemas.enums import CheckoutResult
26-
from codegen.git.schemas.repo_config import BaseRepoConfig
2726
from codegen.sdk._proxy import proxy_property
2827
from codegen.sdk.ai.helpers import AbstractAIHelper, MultiProviderAIHelper
2928
from codegen.sdk.codebase.codebase_ai import generate_system_prompt, generate_tools
@@ -74,7 +73,6 @@
7473
from codegen.sdk.typescript.statements.import_statement import TSImportStatement
7574
from codegen.sdk.typescript.symbol import TSSymbol
7675
from codegen.sdk.typescript.type_alias import TSTypeAlias
77-
from codegen.sdk.utils import determine_project_language, split_git_path
7876
from codegen.shared.decorators.docs import apidoc, noapidoc, py_noapidoc
7977
from codegen.shared.exceptions.control_flow import MaxAIRequestsError
8078
from codegen.shared.performance.stopwatch_utils import stopwatch
@@ -119,7 +117,8 @@ def __init__(
119117
self,
120118
repo_path: None = None,
121119
*,
122-
projects: list[ProjectConfig],
120+
programming_language: None = None,
121+
projects: list[ProjectConfig] | ProjectConfig,
123122
config: CodebaseConfig = DefaultConfig,
124123
) -> None: ...
125124

@@ -128,6 +127,7 @@ def __init__(
128127
self,
129128
repo_path: str,
130129
*,
130+
programming_language: ProgrammingLanguage,
131131
projects: None = None,
132132
config: CodebaseConfig = DefaultConfig,
133133
) -> None: ...
@@ -136,7 +136,8 @@ def __init__(
136136
self,
137137
repo_path: str | None = None,
138138
*,
139-
projects: list[ProjectConfig] | None = None,
139+
programming_language: ProgrammingLanguage | None = None,
140+
projects: list[ProjectConfig] | ProjectConfig | None = None,
140141
config: CodebaseConfig = DefaultConfig,
141142
) -> None:
142143
# Sanity check inputs
@@ -146,19 +147,16 @@ def __init__(
146147
if repo_path is None and projects is None:
147148
raise ValueError("Must specify either repo_path or projects")
148149

150+
if projects is not None and programming_language is not None:
151+
raise ValueError("Cannot specify both projects and programming_language. Use ProjectConfig.from_path() to create projects with a custom programming_language.")
152+
153+
# If projects is a single ProjectConfig, convert it to a list
154+
if isinstance(projects, ProjectConfig):
155+
projects = [projects]
156+
149157
# Initialize project with repo_path if projects is None
150158
if repo_path is not None:
151-
# Split repo_path into (git_root, base_path)
152-
repo_path = os.path.abspath(repo_path)
153-
git_root, base_path = split_git_path(repo_path)
154-
# Create repo_config
155-
repo_config = BaseRepoConfig()
156-
# Create main project
157-
main_project = ProjectConfig(
158-
repo_operator=LocalRepoOperator(repo_config=repo_config, repo_path=git_root),
159-
programming_language=determine_project_language(repo_path),
160-
base_path=base_path,
161-
)
159+
main_project = ProjectConfig.from_path(repo_path, programming_language=programming_language)
162160
projects = [main_project]
163161
else:
164162
main_project = projects[0]
@@ -1137,14 +1135,16 @@ def set_session_options(self, **kwargs: Unpack[SessionOptions]) -> None:
11371135
self.G.transaction_manager.reset_stopwatch(self.G.session_options.max_seconds)
11381136

11391137
@classmethod
1140-
def from_repo(cls, repo_name: str, *, tmp_dir: str | None = None, commit: str | None = None, shallow: bool = True) -> "Codebase":
1138+
def from_repo(cls, repo_name: str, *, tmp_dir: str | None = None, commit: str | None = None, shallow: bool = True, programming_language: ProgrammingLanguage | None = None) -> "Codebase":
11411139
"""Fetches a codebase from GitHub and returns a Codebase instance.
11421140
11431141
Args:
11441142
repo_name (str): The name of the repository in format "owner/repo"
11451143
tmp_dir (Optional[str]): The directory to clone the repo into. Defaults to /tmp/codegen
11461144
commit (Optional[str]): The specific commit hash to clone. Defaults to HEAD
11471145
shallow (bool): Whether to do a shallow clone. Defaults to True
1146+
programming_language (ProgrammingLanguage | None): The programming language of the repo. Defaults to None.
1147+
11481148
Returns:
11491149
Codebase: A Codebase instance initialized with the cloned repository
11501150
"""
@@ -1175,15 +1175,14 @@ def from_repo(cls, repo_name: str, *, tmp_dir: str | None = None, commit: str |
11751175
# Ensure the operator can handle remote operations
11761176
repo_operator = LocalRepoOperator.create_from_commit(
11771177
repo_path=repo_path,
1178-
default_branch="main", # We'll get the actual default branch after clone
11791178
commit=commit,
11801179
url=repo_url,
11811180
)
11821181
logger.info("Clone completed successfully")
11831182

11841183
# Initialize and return codebase with proper context
11851184
logger.info("Initializing Codebase...")
1186-
project = ProjectConfig(repo_operator=repo_operator, programming_language=determine_project_language(repo_path))
1185+
project = ProjectConfig.from_repo_operator(repo_operator=repo_operator, programming_language=programming_language)
11871186
codebase = Codebase(projects=[project], config=DefaultConfig)
11881187
logger.info("Codebase initialization complete")
11891188
return codebase

0 commit comments

Comments
 (0)