Skip to content

Commit af4bc3f

Browse files
Initial commit
fbshipit-source-id: 16e0587f82d1775d328127e2da4a676be9230052
0 parents  commit af4bc3f

File tree

435 files changed

+116367
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

435 files changed

+116367
-0
lines changed

.flake8

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[flake8]
2+
max-line-length = 256
3+
extend-ignore = E302, G004, SIM105, G201, SIM115, SIM904

.github/workflows/test.yml

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
name: Build monarch
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
branches:
9+
- main
10+
11+
concurrency:
12+
group: ${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
13+
cancel-in-progress: true
14+
15+
jobs:
16+
test:
17+
name: cuda12.6-py3.10-4xlarge
18+
strategy:
19+
fail-fast: true
20+
matrix:
21+
include:
22+
- name: 4xlarge
23+
runs-on: linux.g5.4xlarge.nvidia.gpu
24+
torch-spec: '--pre torch --index-url https://download.pytorch.org/whl/nightly/cu126'
25+
gpu-arch-type: "cuda"
26+
gpu-arch-version: "12.6"
27+
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
28+
with:
29+
timeout: 60
30+
runner: ${{ matrix.runs-on }}
31+
gpu-arch-type: ${{ matrix.gpu-arch-type }}
32+
gpu-arch-version: ${{ matrix.gpu-arch-version }}
33+
submodules: recursive
34+
script: |
35+
conda create -n venv python=3.10 -y
36+
conda activate venv
37+
export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
38+
python -m pip install --upgrade pip
39+
40+
# Install native dependencies
41+
dnf update -y
42+
dnf install clang-devel libunwind libunwind-devel -y
43+
44+
# Install rust and setup nightly toolchain
45+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
46+
source $HOME/.cargo/env
47+
rustup toolchain install nightly
48+
rustup default nightly
49+
50+
# Install torch
51+
pip install torch
52+
53+
# Install Python dependencies
54+
pip install setuptools-rust
55+
pip install pyzmq requests numpy pyre-extensions
56+
57+
# Test dependencies
58+
pip install pytest cloudpickle
59+
60+
# Build and install monarch
61+
python setup.py install
62+
63+
# Run tests
64+
pytest python/tests/ -s -v -m "not oss_skip"
65+
python python/tests/test_mock_cuda.py

.gitignore

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
syntax: glob
2+
3+
python/**/*.so
4+
python/**/*.json
5+
python/**/*.html
6+
python/**/*.pkl
7+
python/monarch.egg-info/*
8+
*.egg
9+
build/*
10+
dist/*
11+
monarch.egg-info/*
12+
python/monarch/monarch_controller
13+
14+
.ipynb_checkpoints
15+
16+
# Rust stuff
17+
target/

CODE_OF_CONDUCT.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Code of Conduct
2+
3+
## Our Pledge
4+
5+
In the interest of fostering an open and welcoming environment, we as
6+
contributors and maintainers pledge to make participation in our project and
7+
our community a harassment-free experience for everyone, regardless of age, body
8+
size, disability, ethnicity, sex characteristics, gender identity and expression,
9+
level of experience, education, socio-economic status, nationality, personal
10+
appearance, race, religion, or sexual identity and orientation.
11+
12+
## Our Standards
13+
14+
Examples of behavior that contributes to creating a positive environment
15+
include:
16+
17+
* Using welcoming and inclusive language
18+
* Being respectful of differing viewpoints and experiences
19+
* Gracefully accepting constructive criticism
20+
* Focusing on what is best for the community
21+
* Showing empathy towards other community members
22+
23+
Examples of unacceptable behavior by participants include:
24+
25+
* The use of sexualized language or imagery and unwelcome sexual attention or
26+
advances
27+
* Trolling, insulting/derogatory comments, and personal or political attacks
28+
* Public or private harassment
29+
* Publishing others' private information, such as a physical or electronic
30+
address, without explicit permission
31+
* Other conduct which could reasonably be considered inappropriate in a
32+
professional setting
33+
34+
## Our Responsibilities
35+
36+
Project maintainers are responsible for clarifying the standards of acceptable
37+
behavior and are expected to take appropriate and fair corrective action in
38+
response to any instances of unacceptable behavior.
39+
40+
Project maintainers have the right and responsibility to remove, edit, or
41+
reject comments, commits, code, wiki edits, issues, and other contributions
42+
that are not aligned to this Code of Conduct, or to ban temporarily or
43+
permanently any contributor for other behaviors that they deem inappropriate,
44+
threatening, offensive, or harmful.
45+
46+
## Scope
47+
48+
This Code of Conduct applies within all project spaces, and it also applies when
49+
an individual is representing the project or its community in public spaces.
50+
Examples of representing a project or community include using an official
51+
project e-mail address, posting via an official social media account, or acting
52+
as an appointed representative at an online or offline event. Representation of
53+
a project may be further defined and clarified by project maintainers.
54+
55+
## Enforcement
56+
57+
Instances of abusive, harassing, or otherwise unacceptable behavior may be
58+
reported by contacting the project team at <[email protected]>. All
59+
complaints will be reviewed and investigated and will result in a response that
60+
is deemed necessary and appropriate to the circumstances. The project team is
61+
obligated to maintain confidentiality with regard to the reporter of an incident.
62+
Further details of specific enforcement policies may be posted separately.
63+
64+
Project maintainers who do not follow or enforce the Code of Conduct in good
65+
faith may face temporary or permanent repercussions as determined by other
66+
members of the project's leadership.
67+
68+
## Attribution
69+
70+
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71+
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
72+
73+
[homepage]: https://www.contributor-covenant.org
74+
75+
For answers to common questions about this code of conduct, see
76+
https://www.contributor-covenant.org/faq

CONTRIBUTING.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Contributing to Meta Open Source Projects
2+
3+
We want to make contributing to this project as easy and transparent as
4+
possible.
5+
6+
## Pull Requests
7+
We actively welcome your pull requests.
8+
9+
Note: pull requests are not imported into the GitHub directory in the usual way. There is an internal Meta repository that is the "source of truth" for the project. The GitHub repository is generated *from* the internal Meta repository. So we don't merge GitHub PRs directly to the GitHub repository -- they must first be imported into internal Meta repository. When Meta employees look at the GitHub PR, there is a special button visible only to them that executes that import. The changes are then automatically reflected from the internal Meta repository back to GitHub. This is why you won't see your PR having being directly merged, but you still see your changes in the repository once it reflects the imported changes.
10+
11+
1. Fork the repo and create your branch from `main`.
12+
2. If you've added code that should be tested, add tests.
13+
3. If you've changed APIs, update the documentation.
14+
4. Ensure the test suite passes.
15+
5. Make sure your code lints.
16+
6. If you haven't already, complete the Contributor License Agreement ("CLA").
17+
18+
## Contributor License Agreement ("CLA")
19+
In order to accept your pull request, we need you to submit a CLA. You only need
20+
to do this once to work on any of Meta's open source projects.
21+
22+
Complete your CLA here: <https://code.facebook.com/cla>
23+
24+
## Issues
25+
We use GitHub issues to track public bugs. Please ensure your description is
26+
clear and has sufficient instructions to be able to reproduce the issue.
27+
28+
Meta has a [bounty program](https://www.facebook.com/whitehat/) for the safe
29+
disclosure of security bugs. In those cases, please go through the process
30+
outlined on that page and do not file a public issue.
31+
32+
## License
33+
By contributing to this project, you agree that your contributions will be licensed
34+
under the LICENSE file in the root directory of this source tree.

Cargo.toml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
[workspace]
2+
3+
members = [
4+
"controller",
5+
"hyper",
6+
"hyperactor",
7+
"hyperactor_macros",
8+
"hyperactor_multiprocess",
9+
"hyperactor_mesh",
10+
"hyperactor_mesh_macros",
11+
"ndslice",
12+
"monarch_extension",
13+
"monarch_worker",
14+
"nccl-sys",
15+
"torch-sys",
16+
]

LICENSE

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
BSD 3-Clause License
2+
3+
Copyright (c) Meta Platforms, Inc. and affiliates.
4+
All rights reserved.
5+
6+
Redistribution and use in source and binary forms, with or without
7+
modification, are permitted provided that the following conditions are met:
8+
9+
* Redistributions of source code must retain the above copyright notice, this
10+
list of conditions and the following disclaimer.
11+
12+
* Redistributions in binary form must reproduce the above copyright notice,
13+
this list of conditions and the following disclaimer in the documentation
14+
and/or other materials provided with the distribution.
15+
16+
* Neither the name of the copyright holder nor the names of its
17+
contributors may be used to endorse or promote products derived from
18+
this software without specific prior written permission.
19+
20+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Monarch
2+
3+
**Monarch** is a distributed execution engine for PyTorch.
4+
5+
> ⚠️ **Early Development Warning**
6+
> Monarch is currently in an experimental stage. You should expect bugs, incomplete features, and APIs that may change in future versions. The project welcomes bugfixes, but to make sure things are well coordinated you should discuss any significant change before starting the work. It's recommended that you signal your intention to contribute in the issue tracker, either by filing a new issue or by claiming an existing one.
7+
8+
## Installation
9+
10+
```sh
11+
12+
# Create and activate the conda environment
13+
conda create -n monarchenv python=3.10 -y
14+
conda activate monarchenv
15+
16+
# Install nightly rust toolchain
17+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
18+
19+
20+
rustup toolchain install nightly
21+
rustup default nightly
22+
23+
# Install non-python dependencies
24+
conda install python=3.10
25+
conda install libunwind
26+
27+
# needs cuda-toolkit-12-0 as that is the version that matches the /usr/local/cuda/ on devservers
28+
sudo dnf install cuda-toolkit-12-0 cuda-12-0 libnccl-devel clang-devel
29+
# install build dependencies
30+
pip install setuptools-rust
31+
# install torch, can use conda or build it yourself or whatever
32+
pip install torch
33+
# install other deps, see pyproject.toml for latest
34+
pip install pyzmq requests numpy pyre-extensions pytest-timeout cloudpickle
35+
36+
# install the package
37+
python setup.py install
38+
# or setup for development
39+
python setup.py develop
40+
```
41+
42+
## Running examples
43+
44+
TODO
45+
46+
## Debugging
47+
48+
If everything is hanging, set the environment
49+
`CONTROLLER_PYSPY_REPORT_INTERVAL=10` to get a py-spy dump of the controller and
50+
its subprocesses every 10 seconds.
51+
52+
Calling `pdb.set_trace()` inside a worker remote function will cause pdb to
53+
attach to the controller process to debug the worker. Keep in mind that if there
54+
are multiple workers, this will create multiple sequential debug sessions for
55+
each worker.
56+
57+
For the rust based setup you can adjust the log level with
58+
`RUST_LOG=<log level>` (eg. `RUST_LOG=debug`).
59+
60+
## Profiling
61+
62+
The `monarch.profiler` module provides functionality similar to
63+
[PyTorch's Profiler](https://pytorch.org/docs/stable/profiler.html) for model
64+
profiling. It includes `profile` and `record_function` methods. The usage is
65+
generally the same as `torch.profiler.profile` and
66+
`torch.profiler.record_function`, with a few modifications specific to
67+
`monarch.profiler.profile`:
68+
69+
1. `monarch.profiler.profile` exclusively accepts `monarch.profiler.Schedule`, a
70+
dataclass that mimics `torch.profiler.schedule`.
71+
2. The `on_trace_ready` argument in `monarch.profiler.profile` must be a string
72+
that specifies the directory where the worker should save the trace files.
73+
74+
Below is an example demonstrating how to use `monarch.profiler`:
75+
76+
```py
77+
from monarch.profiler import profile, record_function
78+
with profile(
79+
activities=[
80+
torch.profiler.ProfilerActivity.CPU,
81+
torch.profiler.ProfilerActivity.CUDA,
82+
],
83+
on_trace_ready="./traces/",
84+
schedule=monarch.profilerSchedule(wait=1, warmup=1, active=2, repeat=1),
85+
record_shapes=True,
86+
) as prof:
87+
with record_function("forward"):
88+
loss = model(batch)
89+
90+
prof.step()
91+
```
92+
93+
## Memory Viewer
94+
95+
The `monarch.memory` module provides functionality similar to
96+
[PyTorch's Memory Snapshot and Viewer](https://pytorch.org/docs/stable/torch_cuda_memory.html)
97+
for visualizing and analyzing memory usage in PyTorch models. It includes
98+
`monarch.memory.dump_memory_snapshot` and `monarch.memory.record_memory_history`
99+
methods:
100+
101+
1. `monarch.memory.dump_memory_snapshot`: This function wraps
102+
`torch.cuda.memory._dump_snapshot()` to dump memory snapshot remotely. It can
103+
be used to save a snapshot of the current memory usage to a file.
104+
2. `monarch.memory.record_memory_history`: This function wraps
105+
`torch.cuda.memory_record_memory_history()` to allow recording memory history
106+
remotely. It can be used to track memory allocation and deallocation over
107+
time.
108+
109+
Both functions use `remote` to execute the corresponding remote functions
110+
`_memory_controller_record` and `_memory_controller_dump` on the specified
111+
device mesh.
112+
113+
Below is an example demonstrating how to use `monarch.memory`:
114+
115+
```py
116+
...
117+
monarch.memory.record_memory_history()
118+
for step in range(2):
119+
batch = torch.randn((8, DIM))
120+
loss = net(batch)
121+
...
122+
monarch.memory.dump_memory_snapshot(dir_snapshots="./snapshots/")
123+
```

0 commit comments

Comments
 (0)