Skip to content

Architecture docs v0 #225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 3, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
# Motivation

<!-- Why is this change necessary? -->

# Content

<!-- Please include a summary of the change -->

# Testing

<!-- How was the change tested? -->

# Please check the following before marking your PR as ready for review

- [ ] I have added tests for my changes
Expand Down
10 changes: 10 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -103,3 +103,13 @@ repos:
language: system
pass_filenames: false
always_run: true
- repo: https://github.com/hukkin/mdformat
rev: 0.7.22 # Use the ref you want to point at
hooks:
- id: mdformat
# Optionally add plugins
additional_dependencies:
- mdformat-gfm
- mdformat-ruff
- mdformat-config
- mdformat-pyproject
55 changes: 30 additions & 25 deletions CLA.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,44 +7,49 @@
**Project Owner/Organization:** Codegen, Inc.

1. **Definitions**
1. **“You”** or **“Contributor”** means the individual or entity (and its Affiliates) that Submits a Contribution.
2. **“Contribution”** means any work of authorship (including any modifications or additions) that is intentionally Submitted by You for inclusion in the Project, in any form (including but not limited to source code, documentation, or other materials).
3. **“Submit”** or **“Submitted”** means any act of transferring a Contribution to Codegen, Inc. via pull request, email, or any other method of communication for the purpose of inclusion in the Project.
2. **Grant of Copyright License**

Subject to the terms and conditions of this CLA, You hereby grant to Codegen, Inc. and to recipients of software distributed by Codegen, Inc.:
1. **“You”** or **“Contributor”** means the individual or entity (and its Affiliates) that Submits a Contribution.
1. **“Contribution”** means any work of authorship (including any modifications or additions) that is intentionally Submitted by You for inclusion in the Project, in any form (including but not limited to source code, documentation, or other materials).
1. **“Submit”** or **“Submitted”** means any act of transferring a Contribution to Codegen, Inc. via pull request, email, or any other method of communication for the purpose of inclusion in the Project.

- A perpetual, worldwide, non-exclusive, royalty-free, irrevocable copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense, and distribute Your Contributions and such derivative works.
3. **Grant of Patent License**
1. **Grant of Copyright License**

Subject to the terms and conditions of this CLA, You hereby grant to Codegen, Inc. and to recipients of software distributed by Codegen, Inc. a perpetual, worldwide, non-exclusive, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer Your Contribution, where such license applies only to those patent claims licensable by You that are necessarily infringed by Your Contribution alone or by combination of Your Contribution with the Project to which You Submitted it.
Subject to the terms and conditions of this CLA, You hereby grant to Codegen, Inc. and to recipients of software distributed by Codegen, Inc.:

If any entity institutes patent litigation against You or any other entity (including a cross-claim or counterclaim in a lawsuit) alleging that Your Contribution, or the Project to which You have contributed, directly or indirectly infringes any patent, then any patent licenses granted to that entity under this CLA for that Contribution or Project shall terminate as of the date such litigation is filed.
- A perpetual, worldwide, non-exclusive, royalty-free, irrevocable copyright license to reproduce, prepare derivative works of, publicly display, publicly perform, sublicense, and distribute Your Contributions and such derivative works.

4. **Representations and Warranties**
1. **Original Work**. You represent that each of Your Contributions is an original work of authorship and that You have the necessary rights to grant the licenses under this CLA.
2. **Third-Party Rights**. If Your employer(s) or any third party has rights to intellectual property that You create, You represent that You have received permission to make Contributions on behalf of that employer or third party (or that such employer or third party has waived those rights for Your Contributions).
3. **No Other Agreements**. You represent that You are not aware of any other agreement or obligation that is inconsistent with the rights granted under this CLA.
5. **Disclaimer of Warranty**
1. **Grant of Patent License**

UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, YOU PROVIDE YOUR CONTRIBUTIONS ON AN **“AS IS”** BASIS, **WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND**, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
Subject to the terms and conditions of this CLA, You hereby grant to Codegen, Inc. and to recipients of software distributed by Codegen, Inc. a perpetual, worldwide, non-exclusive, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer Your Contribution, where such license applies only to those patent claims licensable by You that are necessarily infringed by Your Contribution alone or by combination of Your Contribution with the Project to which You Submitted it.

6. **Limitation of Liability**
If any entity institutes patent litigation against You or any other entity (including a cross-claim or counterclaim in a lawsuit) alleging that Your Contribution, or the Project to which You have contributed, directly or indirectly infringes any patent, then any patent licenses granted to that entity under this CLA for that Contribution or Project shall terminate as of the date such litigation is filed.

IN NO EVENT SHALL CODEGEN, INC. OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE), ARISING IN ANY WAY OUT OF OR IN CONNECTION WITH THIS AGREEMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
1. **Representations and Warranties**

7. **Subsequent Contributions and Updates**
1. **Original Work**. You represent that each of Your Contributions is an original work of authorship and that You have the necessary rights to grant the licenses under this CLA.
1. **Third-Party Rights**. If Your employer(s) or any third party has rights to intellectual property that You create, You represent that You have received permission to make Contributions on behalf of that employer or third party (or that such employer or third party has waived those rights for Your Contributions).
1. **No Other Agreements**. You represent that You are not aware of any other agreement or obligation that is inconsistent with the rights granted under this CLA.

You agree that all current and future Contributions to the Project Submitted by You shall be subject to the terms of this CLA. Codegen, Inc. may publish updates to this CLA from time to time; in such case, You may need to agree to new terms before any subsequent Contributions.
1. **Disclaimer of Warranty**

8. **License Modification Rights**
UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, YOU PROVIDE YOUR CONTRIBUTIONS ON AN **“AS IS”** BASIS, **WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND**, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OR CONDITIONS OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.

You agree that Codegen, Inc. may change the license(s) applicable to the open source project(s) to which Your Contributions relate at Codegen, Inc.’s sole discretion, including without limitation by re-licensing the project(s) and Your Contributions under any other open source or “free” software license, or a commercial or proprietary license of Codegen, Inc.’s choosing.
1. **Limitation of Liability**

9. **Governing Law**
IN NO EVENT SHALL CODEGEN, INC. OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE), ARISING IN ANY WAY OUT OF OR IN CONNECTION WITH THIS AGREEMENT, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

This CLA shall be governed by and construed in accordance with the laws of the State of Delaware, without regard to its conflicts of laws provisions.
1. **Subsequent Contributions and Updates**

10. **Signature / Electronic Consent**
You agree that all current and future Contributions to the Project Submitted by You shall be subject to the terms of this CLA. Codegen, Inc. may publish updates to this CLA from time to time; in such case, You may need to agree to new terms before any subsequent Contributions.

By signing or otherwise indicating Your acceptance of this CLA, You acknowledge that You have read and agree to be bound by its terms. If You are signing on behalf of an entity, You represent and warrant that You have the authority to do so.
1. **License Modification Rights**

You agree that Codegen, Inc. may change the license(s) applicable to the open source project(s) to which Your Contributions relate at Codegen, Inc.’s sole discretion, including without limitation by re-licensing the project(s) and Your Contributions under any other open source or “free” software license, or a commercial or proprietary license of Codegen, Inc.’s choosing.

1. **Governing Law**

This CLA shall be governed by and construed in accordance with the laws of the State of Delaware, without regard to its conflicts of laws provisions.

1. **Signature / Electronic Consent**

By signing or otherwise indicating Your acceptance of this CLA, You acknowledge that You have read and agree to be bound by its terms. If You are signing on behalf of an entity, You represent and warrant that You have the authority to do so.
15 changes: 9 additions & 6 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ Thank you for your interest in contributing to Codegen! This document outlines t
By contributing to Codegen, you agree that:

1. Your contributions will be licensed under the project's license.
2. You have the right to license your contribution under the project's license.
3. You grant Codegen a perpetual, worldwide, non-exclusive, royalty-free license to use your contribution.
1. You have the right to license your contribution under the project's license.
1. You grant Codegen a perpetual, worldwide, non-exclusive, royalty-free license to use your contribution.

See our [CLA](CLA.md) for more details.

Expand All @@ -19,6 +19,7 @@ See our [CLA](CLA.md) for more details.
UV is a fast Python package installer and resolver. To install:

**macOS**:

```bash
brew install uv
```
Expand All @@ -28,13 +29,15 @@ For other platforms, see the [UV installation docs](https://github.com/astral-sh
### Setting Up the Development Environment

After installing UV, set up your development environment:

```bash
uv venv
source .venv/bin/activate
uv sync --dev
```

> [!TIP]
>
> - If sync fails with `missing field 'version'`, you may need to delete lockfile and rerun `rm uv.lock && uv sync --dev`.
> - If sync fails with failed compilation, you may need to install clang and rerun `uv sync --dev`.

Expand All @@ -51,10 +54,10 @@ uv run pytest tests/integration/codemod/test_codemods.py -n auto
## Pull Request Process

1. Fork the repository and create your branch from `develop`.
2. Ensure your code passes all tests.
3. Update documentation as needed.
4. Submit a pull request to the `develop` branch.
5. Include a clear description of your changes in the PR.
1. Ensure your code passes all tests.
1. Update documentation as needed.
1. Submit a pull request to the `develop` branch.
1. Include a clear description of your changes in the PR.

## Release Process

Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@

[Codegen](https://docs.codegen.com) is a python library for manipulating codebases.


```python
from codegen import Codebase

Expand All @@ -37,11 +36,13 @@ for function in codebase.functions:
# Comprehensive static analysis for references, dependencies, etc.
if not function.usages:
# Auto-handles references and imports to maintain correctness
function.move_to_file('deprecated.py')
function.move_to_file("deprecated.py")
```

Write code that transforms code. Codegen combines the parsing power of [Tree-sitter](https://tree-sitter.github.io/tree-sitter/) with the graph algorithms of [rustworkx](https://github.com/Qiskit/rustworkx) to enable scriptable, multi-language code manipulation at scale.

## Installation and Usage

We support

- Running Codegen in Python 3.12 – 3.13
Expand All @@ -50,7 +51,6 @@ We support
- Windows is not supported
- Python, Typescript, Javascript and React codebases


```
# Install inside existing project
uv pip install codegen
Expand Down
19 changes: 19 additions & 0 deletions architecture/1. plumbing/file-discovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# File Discovery

The file discovery process is responsible for identifying and organizing all relevant files in a project that need to be processed by the SDK.

## Initialization

- We take in either a list of projects or a path to a filesystem.
- If we get a path, we'll detect the programming language, initialize the git client based on the path and get a Project

## File discovery

- We discover files using the git client so we can respect gitignored files
- We then filter files based on the language and the project configuration
- If specified, we filter by subdirectories
- We also filter by file extensions

## Next Step

After file discovery is complete, the files are passed to the [Tree-sitter Parsing](../parsing/tree-sitter.md) phase, where each file is parsed into a concrete syntax tree.
33 changes: 33 additions & 0 deletions architecture/2. parsing/A. Tree Sitter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Tree-sitter Parsing

Tree-sitter is used as the primary parsing engine for converting source code into concrete syntax trees. Tree-sitter supports two modes of operation:

```python
def my_function():
pass
```

Tree sitter parses this as the following:

```
module [0, 0] - [3, 0]
function_definition [0, 0] - [1, 8]
name: identifier [0, 4] - [0, 15]
parameters: parameters [0, 15] - [0, 17]
body: block [1, 4] - [1, 8]
pass_statement [1, 4] - [1, 8]
```

- An CST mode which includes syntax nodes (for example, the `def` keyword, spaces, or parentheses). The syntax nodes are "anonymous" and don't have any semantic meaning.
- You don't see these nodes in the tree-sitter output, but they are there.
- A AST mode where we only focus on the semantic nodes (for example, the `my_function` identifier, and the `pass` statement). These are 'named nodes' and have semantic meaning.
- This is different from field names (like 'body'). These mean nothing about the node, they indicate what role the child node ('block') plays in the parent node ('function_definition').

## Implementation Details

- We construct a mapping between file type and the tree-sitter grammar
- For each file given to us (via git), we parse it using the appropriate grammar

## Next Step

Once the concrete syntax trees are built, they are transformed into our abstract syntax tree representation in the [AST Construction](./B.%20AST%20Construction.md) phase.
77 changes: 77 additions & 0 deletions architecture/2. parsing/B. AST Construction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# AST Construction

The tree-sitter CST/AST is powerful but it focuses on syntax highlighting and not semantic meaning.
For example, take decorators:

```python
@decorator
def my_function():
pass
```

```
module [0, 0] - [3, 0]
decorated_definition [0, 0] - [2, 8]
decorator [0, 0] - [0, 10]
identifier [0, 1] - [0, 10]
definition: function_definition [1, 0] - [2, 8]
name: identifier [1, 4] - [1, 15]
parameters: parameters [1, 15] - [1, 17]
body: block [2, 4] - [2, 8]
pass_statement [2, 4] - [2, 8]

```

You can see the decorated_definition node has a decorator and a definition. This makes sense for syntax highlighting - the decorator is highlighted seperately from the function definition.

However, this is not useful for semantic analysis. We need to know that the decorator is decorating the function definition - there is a single function definition which may contain multiple decorators.
This becomes visibile when we consider function call chains:

```python
a().b().c().d()
```

```
module [0, 0] - [2, 0]
expression_statement [0, 0] - [0, 15]
call [0, 0] - [0, 15]
function: attribute [0, 0] - [0, 13]
object: call [0, 0] - [0, 11]
function: attribute [0, 0] - [0, 9]
object: call [0, 0] - [0, 7]
function: attribute [0, 0] - [0, 5]
object: call [0, 0] - [0, 3]
function: identifier [0, 0] - [0, 1]
arguments: argument_list [0, 1] - [0, 3]
attribute: identifier [0, 4] - [0, 5]
arguments: argument_list [0, 5] - [0, 7]
attribute: identifier [0, 8] - [0, 9]
arguments: argument_list [0, 9] - [0, 11]
attribute: identifier [0, 12] - [0, 13]
arguments: argument_list [0, 13] - [0, 15]
```

You can see that the chain of calls is represented as a deeply nested structure. This is not useful for semantic analysis or performing edits on these nodes. Therefore, when parsing we need to build an AST that is more useful for semantic analysis.

## Implementation

- For each file, we parse a file-specific AST
- We offer two modes of parsing:
- Pattern based parsing: It maps a particular node type to a semantic node type. For example, we broadly map all identifiers to the `Name` node type.
- Custom parsing: It takes a CST and builds a custom node type. For example, we can turn a decorated_definition node into a function_definition node with decorators. This involves careful arranging of the CST nodes into a new structure.

## Pattern based parsing

To do this, we need to build a mapping between the tree-sitter node types and our semantic node types. These mappings are language specific and stored in node_classes. They are processed by parser.py at runtime. We can access these via many functions - child_by_field_name, \_parse_expression, etc. These methods both wrap the tree-sitter methods and parse the tree-sitter node into our semantic node.

## Custom parsing

These are more complex and require more work. Most symbols (classes, functions, etc), imports, exports, and other complex constructs are parsed using custom parsing.

## Statement parsing

Statements have another layer of complexity. They are essentially pattern based but the mapping and logic is defined directly in the parser.py file.

## Next Step

After the AST is constructed, the system moves on to [Import Resolution](../3.%20imports-exports/A.%20Imports.md) to analyze module dependencies and resolve symbols across files.
7 changes: 7 additions & 0 deletions architecture/3. imports-exports/A. Imports.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Import Resolution

TODO

## Next Step

After import resolution, the system analyzes [Export Analysis](./B.%20Exports.md) and handles [TSConfig Support](./C.%20TSConfig.md) for TypeScript projects. This is followed by comprehensive [Type Analysis](../4.%20type-analysis/A.%20Type%20Analysis.md).
7 changes: 7 additions & 0 deletions architecture/3. imports-exports/B. Exports.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Export Analysis

TODO

## Next Step

After export analysis is complete, for TypeScript projects, the system processes [TSConfig Support](./C.%20TSConfig.md) configurations. Then it moves on to [Type Analysis](../4.%20type-analysis/A.%20Type%20Analysis.md) to build a complete understanding of types and symbols.
Loading