Skip to content

Commit c1172ad

Browse files
Initial commit
fbshipit-source-id: c961163e1a606ca29416bebbf1f20e02c3d582bf
0 parents  commit c1172ad

File tree

422 files changed

+117035
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

422 files changed

+117035
-0
lines changed

.flake8

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
[flake8]
2+
max-line-length = 256
3+
extend-ignore = E302, G004, SIM105, G201, SIM115, SIM904

.github/workflows/test.yml

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
name: Build monarch
2+
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
branches:
9+
- main
10+
- gh/**
11+
12+
concurrency:
13+
group: ${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
14+
cancel-in-progress: true
15+
16+
jobs:
17+
test:
18+
name: cuda12.6-py3.10-4xlarge
19+
strategy:
20+
fail-fast: true
21+
matrix:
22+
include:
23+
- name: 4xlarge
24+
runs-on: linux.g5.4xlarge.nvidia.gpu
25+
torch-spec: '--pre torch --index-url https://download.pytorch.org/whl/nightly/cu126'
26+
gpu-arch-type: "cuda"
27+
gpu-arch-version: "12.6"
28+
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
29+
with:
30+
timeout: 60
31+
runner: ${{ matrix.runs-on }}
32+
gpu-arch-type: ${{ matrix.gpu-arch-type }}
33+
gpu-arch-version: ${{ matrix.gpu-arch-version }}
34+
submodules: recursive
35+
script: |
36+
conda create -n venv python=3.10 -y
37+
conda activate venv
38+
export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
39+
python -m pip install --upgrade pip
40+
41+
# Install native dependencies
42+
dnf update -y
43+
dnf install clang-devel libunwind libunwind-devel -y
44+
45+
# Install rust and setup nightly toolchain
46+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
47+
source $HOME/.cargo/env
48+
rustup toolchain install nightly
49+
rustup default nightly
50+
51+
# Install torch
52+
pip install torch
53+
54+
# Install Python dependencies
55+
pip install setuptools-rust
56+
pip install pyzmq requests numpy pyre-extensions
57+
58+
# Test dependencies
59+
pip install pytest cloudpickle pytest-timeout pytest-asyncio
60+
61+
# Build and install monarch
62+
python setup.py install
63+
64+
# Run tests
65+
LC_ALL=C pytest python/tests/ -s -v -m "not oss_skip"
66+
python python/tests/test_mock_cuda.py

.gitignore

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
syntax: glob
2+
3+
python/**/*.so
4+
python/**/*.json
5+
python/**/*.html
6+
python/**/*.pkl
7+
python/monarch.egg-info/*
8+
*.egg
9+
build/*
10+
dist/*
11+
monarch.egg-info/*
12+
python/monarch/monarch_controller
13+
14+
.ipynb_checkpoints
15+
16+
# Rust stuff
17+
target/

CODE_OF_CONDUCT.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Code of Conduct
2+
3+
## Our Pledge
4+
5+
In the interest of fostering an open and welcoming environment, we as
6+
contributors and maintainers pledge to make participation in our project and
7+
our community a harassment-free experience for everyone, regardless of age, body
8+
size, disability, ethnicity, sex characteristics, gender identity and expression,
9+
level of experience, education, socio-economic status, nationality, personal
10+
appearance, race, religion, or sexual identity and orientation.
11+
12+
## Our Standards
13+
14+
Examples of behavior that contributes to creating a positive environment
15+
include:
16+
17+
* Using welcoming and inclusive language
18+
* Being respectful of differing viewpoints and experiences
19+
* Gracefully accepting constructive criticism
20+
* Focusing on what is best for the community
21+
* Showing empathy towards other community members
22+
23+
Examples of unacceptable behavior by participants include:
24+
25+
* The use of sexualized language or imagery and unwelcome sexual attention or
26+
advances
27+
* Trolling, insulting/derogatory comments, and personal or political attacks
28+
* Public or private harassment
29+
* Publishing others' private information, such as a physical or electronic
30+
address, without explicit permission
31+
* Other conduct which could reasonably be considered inappropriate in a
32+
professional setting
33+
34+
## Our Responsibilities
35+
36+
Project maintainers are responsible for clarifying the standards of acceptable
37+
behavior and are expected to take appropriate and fair corrective action in
38+
response to any instances of unacceptable behavior.
39+
40+
Project maintainers have the right and responsibility to remove, edit, or
41+
reject comments, commits, code, wiki edits, issues, and other contributions
42+
that are not aligned to this Code of Conduct, or to ban temporarily or
43+
permanently any contributor for other behaviors that they deem inappropriate,
44+
threatening, offensive, or harmful.
45+
46+
## Scope
47+
48+
This Code of Conduct applies within all project spaces, and it also applies when
49+
an individual is representing the project or its community in public spaces.
50+
Examples of representing a project or community include using an official
51+
project e-mail address, posting via an official social media account, or acting
52+
as an appointed representative at an online or offline event. Representation of
53+
a project may be further defined and clarified by project maintainers.
54+
55+
## Enforcement
56+
57+
Instances of abusive, harassing, or otherwise unacceptable behavior may be
58+
reported by contacting the project team at <[email protected]>. All
59+
complaints will be reviewed and investigated and will result in a response that
60+
is deemed necessary and appropriate to the circumstances. The project team is
61+
obligated to maintain confidentiality with regard to the reporter of an incident.
62+
Further details of specific enforcement policies may be posted separately.
63+
64+
Project maintainers who do not follow or enforce the Code of Conduct in good
65+
faith may face temporary or permanent repercussions as determined by other
66+
members of the project's leadership.
67+
68+
## Attribution
69+
70+
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4,
71+
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html
72+
73+
[homepage]: https://www.contributor-covenant.org
74+
75+
For answers to common questions about this code of conduct, see
76+
https://www.contributor-covenant.org/faq

CONTRIBUTING.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,34 @@
1+
# Contributing to Meta Open Source Projects
2+
3+
We want to make contributing to this project as easy and transparent as
4+
possible.
5+
6+
## Pull Requests
7+
We actively welcome your pull requests.
8+
9+
Note: pull requests are not imported into the GitHub directory in the usual way. There is an internal Meta repository that is the "source of truth" for the project. The GitHub repository is generated *from* the internal Meta repository. So we don't merge GitHub PRs directly to the GitHub repository -- they must first be imported into internal Meta repository. When Meta employees look at the GitHub PR, there is a special button visible only to them that executes that import. The changes are then automatically reflected from the internal Meta repository back to GitHub. This is why you won't see your PR having being directly merged, but you still see your changes in the repository once it reflects the imported changes.
10+
11+
1. Fork the repo and create your branch from `main`.
12+
2. If you've added code that should be tested, add tests.
13+
3. If you've changed APIs, update the documentation.
14+
4. Ensure the test suite passes.
15+
5. Make sure your code lints.
16+
6. If you haven't already, complete the Contributor License Agreement ("CLA").
17+
18+
## Contributor License Agreement ("CLA")
19+
In order to accept your pull request, we need you to submit a CLA. You only need
20+
to do this once to work on any of Meta's open source projects.
21+
22+
Complete your CLA here: <https://code.facebook.com/cla>
23+
24+
## Issues
25+
We use GitHub issues to track public bugs. Please ensure your description is
26+
clear and has sufficient instructions to be able to reproduce the issue.
27+
28+
Meta has a [bounty program](https://www.facebook.com/whitehat/) for the safe
29+
disclosure of security bugs. In those cases, please go through the process
30+
outlined on that page and do not file a public issue.
31+
32+
## License
33+
By contributing to this project, you agree that your contributions will be licensed
34+
under the LICENSE file in the root directory of this source tree.

Cargo.toml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
[workspace]
2+
3+
members = [
4+
"controller",
5+
"hyper",
6+
"hyperactor",
7+
"hyperactor_macros",
8+
"hyperactor_multiprocess",
9+
"hyperactor_mesh",
10+
"hyperactor_mesh_macros",
11+
"ndslice",
12+
"monarch_extension",
13+
"monarch_worker",
14+
"nccl-sys",
15+
"torch-sys",
16+
]

LICENSE

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
BSD 3-Clause License
2+
3+
Copyright (c) Meta Platforms, Inc. and affiliates.
4+
All rights reserved.
5+
6+
Redistribution and use in source and binary forms, with or without
7+
modification, are permitted provided that the following conditions are met:
8+
9+
* Redistributions of source code must retain the above copyright notice, this
10+
list of conditions and the following disclaimer.
11+
12+
* Redistributions in binary form must reproduce the above copyright notice,
13+
this list of conditions and the following disclaimer in the documentation
14+
and/or other materials provided with the distribution.
15+
16+
* Neither the name of the copyright holder nor the names of its
17+
contributors may be used to endorse or promote products derived from
18+
this software without specific prior written permission.
19+
20+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
21+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
23+
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
24+
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25+
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
26+
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
27+
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
28+
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
29+
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

README.md

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# Monarch
2+
3+
**Monarch** is a distributed execution engine for PyTorch.
4+
5+
> ⚠️ **Early Development Warning**
6+
> Monarch is currently in an experimental stage. You should expect bugs, incomplete features, and APIs that may change in future versions. The project welcomes bugfixes, but to make sure things are well coordinated you should discuss any significant change before starting the work. It's recommended that you signal your intention to contribute in the issue tracker, either by filing a new issue or by claiming an existing one.
7+
8+
## Installation
9+
10+
```sh
11+
12+
# Create and activate the conda environment
13+
conda create -n monarchenv python=3.10 -y
14+
conda activate monarchenv
15+
16+
# Install nightly rust toolchain
17+
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
18+
19+
20+
rustup toolchain install nightly
21+
rustup default nightly
22+
23+
# Install non-python dependencies
24+
conda install python=3.10
25+
conda install libunwind
26+
27+
# needs cuda-toolkit-12-0 as that is the version that matches the /usr/local/cuda/ on devservers
28+
sudo dnf install cuda-toolkit-12-0 cuda-12-0 libnccl-devel clang-devel
29+
# install build dependencies
30+
pip install setuptools-rust
31+
# install torch, can use conda or build it yourself or whatever
32+
pip install torch
33+
# install core deps, see pyproject.toml for latest
34+
pip install pyzmq requests numpy pyre-extensions cloudpickle
35+
# Install test dependencies
36+
pip install pytest pytest-timeout pytest-asyncio
37+
38+
# install the package
39+
python setup.py install
40+
# or setup for development
41+
python setup.py develop
42+
43+
# run unit tests. consider -s for more verbose output
44+
pytest python/tests/ -v -m "not oss_skip"
45+
```
46+
47+
## Running examples
48+
49+
TODO
50+
51+
## Debugging
52+
53+
If everything is hanging, set the environment
54+
`CONTROLLER_PYSPY_REPORT_INTERVAL=10` to get a py-spy dump of the controller and
55+
its subprocesses every 10 seconds.
56+
57+
Calling `pdb.set_trace()` inside a worker remote function will cause pdb to
58+
attach to the controller process to debug the worker. Keep in mind that if there
59+
are multiple workers, this will create multiple sequential debug sessions for
60+
each worker.
61+
62+
For the rust based setup you can adjust the log level with
63+
`RUST_LOG=<log level>` (eg. `RUST_LOG=debug`).
64+
65+
## Profiling
66+
67+
The `monarch.profiler` module provides functionality similar to
68+
[PyTorch's Profiler](https://pytorch.org/docs/stable/profiler.html) for model
69+
profiling. It includes `profile` and `record_function` methods. The usage is
70+
generally the same as `torch.profiler.profile` and
71+
`torch.profiler.record_function`, with a few modifications specific to
72+
`monarch.profiler.profile`:
73+
74+
1. `monarch.profiler.profile` exclusively accepts `monarch.profiler.Schedule`, a
75+
dataclass that mimics `torch.profiler.schedule`.
76+
2. The `on_trace_ready` argument in `monarch.profiler.profile` must be a string
77+
that specifies the directory where the worker should save the trace files.
78+
79+
Below is an example demonstrating how to use `monarch.profiler`:
80+
81+
```py
82+
from monarch.profiler import profile, record_function
83+
with profile(
84+
activities=[
85+
torch.profiler.ProfilerActivity.CPU,
86+
torch.profiler.ProfilerActivity.CUDA,
87+
],
88+
on_trace_ready="./traces/",
89+
schedule=monarch.profilerSchedule(wait=1, warmup=1, active=2, repeat=1),
90+
record_shapes=True,
91+
) as prof:
92+
with record_function("forward"):
93+
loss = model(batch)
94+
95+
prof.step()
96+
```
97+
98+
## Memory Viewer
99+
100+
The `monarch.memory` module provides functionality similar to
101+
[PyTorch's Memory Snapshot and Viewer](https://pytorch.org/docs/stable/torch_cuda_memory.html)
102+
for visualizing and analyzing memory usage in PyTorch models. It includes
103+
`monarch.memory.dump_memory_snapshot` and `monarch.memory.record_memory_history`
104+
methods:
105+
106+
1. `monarch.memory.dump_memory_snapshot`: This function wraps
107+
`torch.cuda.memory._dump_snapshot()` to dump memory snapshot remotely. It can
108+
be used to save a snapshot of the current memory usage to a file.
109+
2. `monarch.memory.record_memory_history`: This function wraps
110+
`torch.cuda.memory_record_memory_history()` to allow recording memory history
111+
remotely. It can be used to track memory allocation and deallocation over
112+
time.
113+
114+
Both functions use `remote` to execute the corresponding remote functions
115+
`_memory_controller_record` and `_memory_controller_dump` on the specified
116+
device mesh.
117+
118+
Below is an example demonstrating how to use `monarch.memory`:
119+
120+
```py
121+
...
122+
monarch.memory.record_memory_history()
123+
for step in range(2):
124+
batch = torch.randn((8, DIM))
125+
loss = net(batch)
126+
...
127+
monarch.memory.dump_memory_snapshot(dir_snapshots="./snapshots/")
128+
```
129+
130+
## License
131+
Monarch is BSD-3 licensed, as found in the [LICENSE](LICENSE) file.

clippy.toml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
disallowed-methods = [
2+
{ path = "tokio::time::sleep", reason = "use `hyperactor::clock::Clock::sleep` instead." },
3+
{ path = "std::thread::sleep", reason = "use `hyperactor::clock::Clock::sleep` instead." },
4+
{ path = "tokio::time::Instant::now", reason = "use `hyperactor::clock::Clock::now` instead." },
5+
{ path = "std::time::SystemTime::now", reason = "use `hyperactor::clock::Clock::system_time_now` instead." },
6+
]

0 commit comments

Comments
 (0)