migrate existing set_worker_random_seed callsites #28

amirafzali · 2025-05-22T16:57:04Z

Summary:
Diff Purpose & Changes

We can migrate all instances of worker.set_random_seed_impl to now use builtins.random.set_manual_seed.
Wonder if some of these cases would benefit from random_seed() instead. Maybe that can be checked out in the future.

Reviewed By: suo

Differential Revision: D72945165

fbshipit-source-id: 16e0587f82d1775d328127e2da4a676be9230052

Summary: This should be the right method since it simulate the comm actor part too. Reviewed By: shayne-fletcher Differential Revision: D75099703 fbshipit-source-id: f318503663ea3cb19bc38387db381b415dcdcd9b

Summary: fix ```lang=text,counterexample warning: using `.deref()` on a double reference, which returns `&ProcMesh` instead of dereferencing the inner type --> hyperactor_mesh/src/actor_mesh.rs:60:35 | 60 | Self::Borrowed(p) => p.deref(), | ^^^^^^^^ | = note: `#[warn(suspicious_double_ref_op)]` on by default ``` Reviewed By: dulinriley Differential Revision: D75151288 fbshipit-source-id: 788bdfd2587af70de453cc1f2fb8a9eeb86937e5

…alled with the wheel Summary: Makes it such that `pip install monarch` installs the monarch CLI. Reviewed By: colin2328 Differential Revision: D74915462 fbshipit-source-id: 24c75430ec9a510d96cb4412763882a583e4eefe

Summary: Pull Request resolved: pytorch-labs/monarch#20 Pull Request resolved: pytorch-labs/monarch#19 should have been in D74971945 looks like Reviewed By: benjipelletier, mariusae Differential Revision: D75147364 fbshipit-source-id: e33270720e820f7e01297b24042a7726c29206c7

Summary: We weren't exiting the device mesh properly when invoking the local mesh. (I may have to remove exit() in the testing code will wait for the unit tests to run). Reviewed By: andrewjcg, moonli Differential Revision: D75143906 fbshipit-source-id: 2ebca9055e59d696d8b7243bcc80698c439bca3e

Summary: Pull Request resolved: pytorch-labs/monarch#24 Make CI run for ghstacked commits, maybe? ghstack-source-id: 285475797 Reviewed By: colin2328 Differential Revision: D75176799 fbshipit-source-id: 242462e52581d5f31cc74f54ceca07139ebed5b9

Summary: Pull Request resolved: pytorch-labs/monarch#22 1. OSS workflow runs into a seg fault when executing tests. Core dump (P1818378198) seems to imply this is from a localization assignment issue. Tentatively, we can fix this with LC_ALL=C Reviewed By: dcci Differential Revision: D75159472 fbshipit-source-id: a27d4cfea5125cd01bef6d91cd7f236419693e6d

…da hyperactor components to meta subdir and create a hyperactor component for oss. Summary: Either remove or refactor meta-isms out of `monarch.tools.components,cli,commands`. NOTE: TorchX (as originally designed) should hide away most of the differences in terms of job definitions between OSS and Meta. However the `mast_conda` plugin to TorchX leaks abstraction like crazy to the point that the AppDef (job def) looks completely different and cannot be reused. To deal with this I've created `monarch.tools.components.meta.hyperactor` and `monarch.tools.components.meta.hyperactor_conda` and we switch between them based on the target scheduler (e.g. `-s mast` vs `-s mast_conda`) What this diff does: 1. In `monarch.tools.components`: 1. `conda.hyperactor` --> `meta.hyperactor.proc_mesh_conda` 1. `base.hyperactor` --> `meta.hyperactor.proc_mesh` 1. (new) `hyperactor.proc_mesh` (for oss) 1. Moves configs from `monarch.tools.commands` to: 1. `monarch.tools.config.defaults` (oss defaults) 1. `monarch.tools.config.meta.defaults` (internal defaults) 1. Takes out all the meta-isms from `monarch.tools.commands` and `cli` 1. Moves test cases specific to meta into `monarch/tools/**/meta/**_test.py` 1. Adds a bunch of oss schedulers to the oss defaults (not yet tested except for dryruns) Next Steps: 1. [6/n] Have `kd_monarch` use the default component (the custom `mast.py` is no longer needed). Update the README with updated instructions. 1. [7/n] Remove rust CLI in favor of all-python (we delegate to torchx for most things anyways) 1. [8/n] Add E2E unittest using the `local_cwd` scheduler (actually run a mini-trainer actor) 1. [9/n] Write an oss `hyperactor mesh-worker` entrypoint binary 1. [10/n] Author a Dockerfile that sets up the environment (much like fbpkgs do it for internal runs) 1. [11/n] Author a `TorchXAllocator` Reviewed By: colin2328 Differential Revision: D74913314 fbshipit-source-id: d1e87e8355592de87821cbfec3128baabe0ac511

Summary: remove direct `nom` dependency from `monarch_hyperactor` • eliminate direct `nom` usage in `monarch_hyperactor` by centralizing selection parsing through `ndslice::selection::parse::parse` • update `Selection::parse` Python binding and internal test code to use new helper • remove `nom` from dependencies and build targets this reduces build complexity and ensures all parsing goes through a single interface. Reviewed By: granulae Differential Revision: D75155319 fbshipit-source-id: 7e7a85193fafb0536bbbbe3e8a4589f71aa1c339

…tion impl (#14) Summary: Pull Request resolved: pytorch-labs/monarch#14 This diff fixes two issues related to the remote function implementation inside `StreamActor`: 1. There were two calls to `tracing::debug!` that use the function inputs and outputs as arguments. Even if the log level was set such that the message wasn't printed to the console, this still incurred the overhead of serialization (and therefore, when tensors were involved, synchronizing the host with the cuda stream). 2. The code to extract an `RValue` from a `PyObject` worked by attempting to do a conversion in C++, and if that failed, handling the exception via rust and trying the next `RValue` variant. Each time the conversion failed, the code created a `PyValueError` that serialized the `PyObject` as part of its message. Importantly, the case for tensors was only handled after the cases for `ScalarType`, `Layout`, and `MemoryFormat`. So if the `RValue` was a tensor, we would incur 3 C++ exceptions, 3 cuda stream syncs, and 3 tensor serializations. And this would happen for every tensor in the output of a remote function. The fixes in this diff are: 1. Remove the `tracing::debug!` calls. 2. Instead of trying an unchecked type conversion and handling the exception, actually check the type before trying to do the conversion. Reviewed By: colin2328 Differential Revision: D75035191 fbshipit-source-id: 9acf397c0ea9a9af0ffef215f39f2ee5c50b7395

Summary: Pull Request resolved: pytorch-labs/monarch#16 As titled. This involved fixing two issues with pipes (allow pipe processes to create tensors from wire values and explicitly flush the pipe buffer during pipe send) and fixing an issue with tensor serialization. Reviewed By: dulinriley Differential Revision: D74854607 fbshipit-source-id: b457a51feb18741a0b4535c405e1ca9c46a8f24e

Summary: to be compliant Reviewed By: colin2328 Differential Revision: D75192678 fbshipit-source-id: 7ac7346365d6be95412bcd96bb35bc4378c44420

Summary: This diff generizes the conecept of `RemoteProcessAlloc` such that any other multi-host `Allocator` can use the same `Alloc`, `MastAllocator` being the first. This has mutiple advantages: * Unified health checks across all allocators. * Not relying on MAST control plane for health. Now we run our own heartbeats. * Reusable componoent for future `Allocator` such as kubernetes. Reviewed By: mariusae Differential Revision: D74943143 fbshipit-source-id: 75322b03773c1a6afad05549b92ff72585d94ba5

Summary: Pull Request resolved: pytorch-labs/monarch#25 Original commit changeset: c26e2dc498f5 Original Phabricator Diff: D74718068 Reviewed By: dcci Differential Revision: D75181087 fbshipit-source-id: e2fa074e33cae3ee6a4c5694496ffcada3884466

Summary: Pull Request resolved: pytorch-labs/monarch#26 add some necessary testing libraries to the worker script and README Reviewed By: dcci Differential Revision: D75028810 fbshipit-source-id: 343ae098929f26122da6723316309325261a2b26

facebook-github-bot · 2025-05-22T16:57:36Z

This pull request was exported from Phabricator. Differential Revision: D72945165

…ss-allocator and monarch CLI component Summary: Updates `kd_monarch` with the capability to dynamically query for information needed to create and use the `MastAllocator` in `controller.py`. Specifically it makes it possible to get these fields from the provided `ALLOCATOR_JOB_NAME` 1. `(hosts, gpus)` ProcMesh dimensions 2. RemoteProcessAllocator `port` (no need to hard code or have the user provide) 3. Task group name Also changes the remote-process-allocator (aka `hyperactor mesh-worker`) entrypoint's binary name in `kd_monarch_pkg` to be `monarch_bootstrap` to be consistent with its equivalent in OSS (hence makes the TorchX AppDef portable) Next: ~~[6/n] Have kd_monarch use the default component (the custom mast.py is no longer needed). Update the README with updated instructions.~~ [7/n] Remove rust CLI in favor of all-python (we delegate to torchx for most things anyways) [8/n] Add E2E unittest using the local_cwd scheduler (actually run a mini-trainer actor) [9/n] Write an oss hyperactor mesh-worker entrypoint binary [10/n] Author a Dockerfile that sets up the environment (much like fbpkgs do it for internal runs) [11/n] Author a TorchXAllocator Reviewed By: vidhyav Differential Revision: D75162358 fbshipit-source-id: d2756ffb529b3069baf94cd9dfdde00b8ef2ede4

Summary: All the working subcommands were falling back to Python anyways. Moved the (currently unimplemented) subcommand stubs: `bounce` and `stop` to Python. **Note:** couple of reasons why a Rust CLI for monarch isn't ideal: 1. Uses TorchX under the hood. TorchX is a Python library. 2. Due to #1 we have to run a Python CLI fallback anyways and the mechanics of this is meta specific (won't work for OSS). 3. Reverse pyo3 binding TorchX (call Python from Rust) doesn't work internally due to the way we package Python (hermetic PAR). 4. Any material benefits (e.g. performance?) of implementing the CLI in Rust would be negated by the effort to fix/deal-with #1-3. **Next:** ~~[6/n] Have kd_monarch use the default component (the custom mast.py is no longer needed). Update the README with updated instructions.~~ ~~[7/n] Remove rust CLI in favor of all-python (we delegate to torchx for most things anyways)~~ [8/n] Add E2E unittest using the local_cwd scheduler (actually run a mini-trainer actor) [9/n] Write an oss hyperactor mesh-worker entrypoint binary [10/n] Author a Dockerfile that sets up the environment (much like fbpkgs do it for internal runs) [11/n] Author a TorchXAllocator Reviewed By: vidhyav, suo Differential Revision: D75176535 fbshipit-source-id: 29020f4032bd642af26b393ade74f40b868df973

Summary: Pull Request resolved: pytorch-labs/monarch#29 Previous implementation was converting `Slice(offset, sizes=[size, 1, 1, ...], strides=[stride, ...])` into a selection with `Range(offset, offset + size, stride)`, which is wrong. It should be `Range(offset, offset + size * stride, stride)`. Reviewed By: shayne-fletcher, andrewjcg Differential Revision: D75102147 fbshipit-source-id: 979c3712e65a49cd4e04de71921b82baead721bc

Summary: Pull Request resolved: pytorch-labs/monarch#30 Fix warnings related to mesh slicing in `test_controller.py`, and update the file to use `TestingContext` in order to deal with device mesh shutdown warnings. Reviewed By: colin2328 Differential Revision: D75106997 fbshipit-source-id: 05fd5c85b612c6356560a4df26854931d2d90c64

Summary: applying code changes to be compatible with both 2021 and 2024 rust edition Reviewed By: mariusae, dtolnay Differential Revision: D75194983 fbshipit-source-id: 7cfa9cea68a63bb86f5c816edce493ff7defb3e2

Summary: `payload` is a tuple of `(rank, value)` so unpack it before raising. Before: ``` bugging__/debugging-inplace#link-tree/monarch/service.py", line 431, in _process raise payload TypeError: exceptions must derive from BaseException ``` After: the exception is raised properly Reviewed By: colin2328 Differential Revision: D75264392 fbshipit-source-id: cb78b2cde58c223441830788b572551db44fc5b4

Summary: As title. Reviewed By: mariusae Differential Revision: D74482409 fbshipit-source-id: 6957f3c803deed802d4ac90be1995520006e0dcb

Summary: When reducer is none, the split port's lambda is doing a simply forward. This diff separate this scenario into its own lambda so the code is easier to read and maintain. Reviewed By: mariusae Differential Revision: D75090318 fbshipit-source-id: 532182da291eebaa733e0c11f05816fcc0720a67

Summary: context would make it easier to track where the error is from. Reviewed By: dulinriley Differential Revision: D75244043 fbshipit-source-id: b05eda65896d0cdd750c0fa24247783f2d6739cc

Summary: Required by new versions of Cargo. Reviewed By: diliop Differential Revision: D75295569 fbshipit-source-id: 04295493397f2305f6e619565a1b3c4f031f80a9

Summary: Pull Request resolved: pytorch-labs/monarch#31 D75106997 appears to have broken github CI, and this partial revert doesn't seem to reintroduce any warnings, so hopefully this fixes things. Reviewed By: colin2328 Differential Revision: D75314370 fbshipit-source-id: 16caf12df35e53a20b547d4d9b5c290ea67a4641

Summary: event::read()? is blocking so we won't redraw the UI until an input event has been received. What we want instead of make it not blocking is to wrap it in a crossterm::poll Reviewed By: kaiyuan-li Differential Revision: D75296531 fbshipit-source-id: c6da46b6e7e58086232cca763e42d70384a0eb59

Summary: Fetching data in the UI loop results in a brief jank while we await newly fetched results Reviewed By: kaiyuan-li Differential Revision: D75296533 fbshipit-source-id: c269e392130c8bbb8bd8a45e6e4d616d93aed392

Summary: - Make table header green - Resize table column widths appropriately given lengths of real actor ids and world names from mast job - Sort actors by actor id after sorting by messages Reviewed By: kaiyuan-li Differential Revision: D75296532 fbshipit-source-id: 1959b510b4bd6d9be0b1bc68c57b743bce92f396

Summary: the `messages_received` and `messages_sent` key in monarch_metrics allow us to filter for ingress and egress events Reviewed By: kaiyuan-li Differential Revision: D75296530 fbshipit-source-id: e4735be37daf065c8e87417128324ed61acc4b00

Reviewed By: pablorfb-meta Differential Revision: D75297949 fbshipit-source-id: 55ea59b07801b7d5a42ebffd09f80bdb3589dd9a

Summary: Adds the OSS equivalent of `hyperactor_meta mesh-worker` called `process_allocator`. Also created a meta version of it (but haven't switched over to it in the job launcher). Reviewed By: colin2328 Differential Revision: D75299505 fbshipit-source-id: 1b4a89ddd2da424ce8f8ded9343d134979ebabb5

Summary: Required by new versions of Cargo. Reviewed By: bigfootjon Differential Revision: D75343369 fbshipit-source-id: 6af89d222a6de541c9c924a466ffd3165e707266

Summary: Pull Request resolved: pytorch-labs/monarch#32 This is the second try after D74718068 got backed out. Turns out `bootstrap_main` from `_rust_bindings` needs a reference to a running event loop, so it can't simply be passed into `asyncio.run()`: https://github.com/awestlake87/pyo3-asyncio#a-note-about-asynciorun We wrap it into a trivial coroutine instead. Reviewed By: amirafzali Differential Revision: D75284480 fbshipit-source-id: 24352a212d0490d2f1fe4eb6245212c1abfc1bfa

Summary: fix broken monarch_hyperactor build on master Reviewed By: amirafzali Differential Revision: D75454230 fbshipit-source-id: 3f70c4eba7eed4ece9a754888bbbc1a2cdc91fbc

Summary: fix various 2024 edition lints Reviewed By: benjipelletier Differential Revision: D75454728 fbshipit-source-id: 981b6a599ca51b8862159903cc13e032bf7eb322

Reviewed By: colin2328 Differential Revision: D75330629 fbshipit-source-id: b843fc690050b75ead957049a17da7c91ee26462

Summary: side-note: looks like `monarch.bootstrap_main` used to be `monarch._monarch.hyperactor.bootstrap_main`. Adds `monarch_bootstrap` as a console script so that it is installed with the monarch wheel. This makes it simpler to pass it as the the `program` (`cmd`) to the process-allocator. Reviewed By: LucasLLC, colin2328 Differential Revision: D75325523 fbshipit-source-id: af9cb48c0dbeb5b45bdd919c23acc9f05b147ddc

Summary: **Diff Purpose & Changes** 1. Creating the random/RNG remote functions for our new builtins library. torch.manual_seed(seed) torch.initial_seed() torch.get_rng_state() torch.set_rng_state(state) torch.cuda.get_rng_state_all() torch.cuda.set_rng_state_all(states) These two appear to be the same function. we can consider removing one or the other in the library. torch.seed() torch.random.seed() Reviewed By: vidhyav, colin2328 Differential Revision: D72944566

Summary: **Diff Purpose & Changes** - We can migrate all instances of worker.set_random_seed_impl to now use builtins.random.set_manual_seed. - Wonder if some of these cases would benefit from random_seed() instead. Maybe that can be checked out in the future. Reviewed By: suo Differential Revision: D72945165

facebook-github-bot · 2025-05-27T19:39:20Z

This pull request was exported from Phabricator. Differential Revision: D72945165

Summary: Pull Request resolved: #28 **Diff Purpose & Changes** - We can migrate all instances of worker.set_random_seed_impl to now use builtins.random.set_manual_seed. - Wonder if some of these cases would benefit from random_seed() instead. Maybe that can be checked out in the future. Reviewed By: suo Differential Revision: D72945165 fbshipit-source-id: b2f9cc48b34b021c41a6032be7bdd7ca31580823

facebook-github-bot and others added 16 commits May 21, 2025 09:09

Initial commit

af4bc3f

fbshipit-source-id: 16e0587f82d1775d328127e2da4a676be9230052

Use collect_commactor_routing_tree

4143c07

Summary: This should be the right method since it simulate the comm actor part too. Reviewed By: shayne-fletcher Differential Revision: D75099703 fbshipit-source-id: f318503663ea3cb19bc38387db381b415dcdcd9b

OSS CI for stacked commits (#24)

3f35ada

Summary: Pull Request resolved: pytorch-labs/monarch#24 Make CI run for ghstacked commits, maybe? ghstack-source-id: 285475797 Reviewed By: colin2328 Differential Revision: D75176799 fbshipit-source-id: 242462e52581d5f31cc74f54ceca07139ebed5b9

Mention the license at the end of the readme

ab69546

Summary: to be compliant Reviewed By: colin2328 Differential Revision: D75192678 fbshipit-source-id: 7ac7346365d6be95412bcd96bb35bc4378c44420

include pytest-asyncio and timeout (#26)

a1acccd

Summary: Pull Request resolved: pytorch-labs/monarch#26 add some necessary testing libraries to the worker script and README Reviewed By: dcci Differential Revision: D75028810 fbshipit-source-id: 343ae098929f26122da6723316309325261a2b26

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 22, 2025

facebook-github-bot added the fb-exported label May 22, 2025

kiukchung and others added 11 commits May 22, 2025 09:57

applying rustfix to fbcode

a1ba864

Summary: applying code changes to be compatible with both 2021 and 2024 rust edition Reviewed By: mariusae, dtolnay Differential Revision: D75194983 fbshipit-source-id: 7cfa9cea68a63bb86f5c816edce493ff7defb3e2

Introduce CommReducer and use it in fn split_port_id

d3c37fb

Summary: As title. Reviewed By: mariusae Differential Revision: D74482409 fbshipit-source-id: 6957f3c803deed802d4ac90be1995520006e0dcb

Add error context for debugging

74b86fd

Summary: context would make it easier to track where the error is from. Reviewed By: dulinriley Differential Revision: D75244043 fbshipit-source-id: b05eda65896d0cdd750c0fa24247783f2d6739cc

Update tracing-subscriber from 0.3.18 to 0.3.19

2e1022b

Summary: Required by new versions of Cargo. Reviewed By: diliop Differential Revision: D75295569 fbshipit-source-id: 04295493397f2305f6e619565a1b3c4f031f80a9

Thomas Wang and others added 14 commits May 23, 2025 15:04

Fetch data in separate loop

d91fda5

Summary: Fetching data in the UI loop results in a brief jank while we await newly fetched results Reviewed By: kaiyuan-li Differential Revision: D75296533 fbshipit-source-id: c269e392130c8bbb8bd8a45e6e4d616d93aed392

Show message ingress/egress

e2f78ea

Summary: the `messages_received` and `messages_sent` key in monarch_metrics allow us to filter for ingress and egress events Reviewed By: kaiyuan-li Differential Revision: D75296530 fbshipit-source-id: e4735be37daf065c8e87417128324ed61acc4b00

Refactor AllocConstraints::none() to default()

13d7be6

Reviewed By: pablorfb-meta Differential Revision: D75297949 fbshipit-source-id: 55ea59b07801b7d5a42ebffd09f80bdb3589dd9a

Update thiserror from 2.0.7 to 2.0.12

f46d0b5

Summary: Required by new versions of Cargo. Reviewed By: bigfootjon Differential Revision: D75343369 fbshipit-source-id: 6af89d222a6de541c9c924a466ffd3165e707266

AllocConstraints::none() has gone away

eaf40ce

Summary: fix broken monarch_hyperactor build on master Reviewed By: amirafzali Differential Revision: D75454230 fbshipit-source-id: 3f70c4eba7eed4ece9a754888bbbc1a2cdc91fbc

fix various 2024 lints

ed831f7

Summary: fix various 2024 edition lints Reviewed By: benjipelletier Differential Revision: D75454728 fbshipit-source-id: 981b6a599ca51b8862159903cc13e032bf7eb322

Remove spuriously committed file.

86770c5

Reviewed By: colin2328 Differential Revision: D75330629 fbshipit-source-id: b843fc690050b75ead957049a17da7c91ee26462

amirafzali force-pushed the export-D72945165 branch from 7226b38 to 86f09ca Compare May 27, 2025 19:39

facebook-github-bot closed this May 28, 2025

facebook-github-bot force-pushed the main branch from ed1d5d2 to c1172ad Compare May 28, 2025 00:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

migrate existing set_worker_random_seed callsites #28

migrate existing set_worker_random_seed callsites #28

Uh oh!

amirafzali commented May 22, 2025

Uh oh!

facebook-github-bot commented May 22, 2025

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

Uh oh!

migrate existing set_worker_random_seed callsites #28

migrate existing set_worker_random_seed callsites #28

Uh oh!

Conversation

amirafzali commented May 22, 2025

Uh oh!

facebook-github-bot commented May 22, 2025

Uh oh!

facebook-github-bot commented May 27, 2025

Uh oh!

Uh oh!