-
Notifications
You must be signed in to change notification settings - Fork 35
migrate existing set_worker_random_seed callsites #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fbshipit-source-id: 16e0587f82d1775d328127e2da4a676be9230052
Summary: This should be the right method since it simulate the comm actor part too. Reviewed By: shayne-fletcher Differential Revision: D75099703 fbshipit-source-id: f318503663ea3cb19bc38387db381b415dcdcd9b
Summary: fix ```lang=text,counterexample warning: using `.deref()` on a double reference, which returns `&ProcMesh` instead of dereferencing the inner type --> hyperactor_mesh/src/actor_mesh.rs:60:35 | 60 | Self::Borrowed(p) => p.deref(), | ^^^^^^^^ | = note: `#[warn(suspicious_double_ref_op)]` on by default ``` Reviewed By: dulinriley Differential Revision: D75151288 fbshipit-source-id: 788bdfd2587af70de453cc1f2fb8a9eeb86937e5
…alled with the wheel Summary: Makes it such that `pip install monarch` installs the monarch CLI. Reviewed By: colin2328 Differential Revision: D74915462 fbshipit-source-id: 24c75430ec9a510d96cb4412763882a583e4eefe
Summary: Pull Request resolved: pytorch-labs/monarch#20 Pull Request resolved: pytorch-labs/monarch#19 should have been in D74971945 looks like Reviewed By: benjipelletier, mariusae Differential Revision: D75147364 fbshipit-source-id: e33270720e820f7e01297b24042a7726c29206c7
Summary: We weren't exiting the device mesh properly when invoking the local mesh. (I may have to remove exit() in the testing code will wait for the unit tests to run). Reviewed By: andrewjcg, moonli Differential Revision: D75143906 fbshipit-source-id: 2ebca9055e59d696d8b7243bcc80698c439bca3e
Summary: Pull Request resolved: pytorch-labs/monarch#24 Make CI run for ghstacked commits, maybe? ghstack-source-id: 285475797 Reviewed By: colin2328 Differential Revision: D75176799 fbshipit-source-id: 242462e52581d5f31cc74f54ceca07139ebed5b9
Summary: Pull Request resolved: pytorch-labs/monarch#22 1. OSS workflow runs into a seg fault when executing tests. Core dump (P1818378198) seems to imply this is from a localization assignment issue. Tentatively, we can fix this with LC_ALL=C Reviewed By: dcci Differential Revision: D75159472 fbshipit-source-id: a27d4cfea5125cd01bef6d91cd7f236419693e6d
…da hyperactor components to meta subdir and create a hyperactor component for oss. Summary: Either remove or refactor meta-isms out of `monarch.tools.components,cli,commands`. NOTE: TorchX (as originally designed) should hide away most of the differences in terms of job definitions between OSS and Meta. However the `mast_conda` plugin to TorchX leaks abstraction like crazy to the point that the AppDef (job def) looks completely different and cannot be reused. To deal with this I've created `monarch.tools.components.meta.hyperactor` and `monarch.tools.components.meta.hyperactor_conda` and we switch between them based on the target scheduler (e.g. `-s mast` vs `-s mast_conda`) What this diff does: 1. In `monarch.tools.components`: 1. `conda.hyperactor` --> `meta.hyperactor.proc_mesh_conda` 1. `base.hyperactor` --> `meta.hyperactor.proc_mesh` 1. (new) `hyperactor.proc_mesh` (for oss) 1. Moves configs from `monarch.tools.commands` to: 1. `monarch.tools.config.defaults` (oss defaults) 1. `monarch.tools.config.meta.defaults` (internal defaults) 1. Takes out all the meta-isms from `monarch.tools.commands` and `cli` 1. Moves test cases specific to meta into `monarch/tools/**/meta/**_test.py` 1. Adds a bunch of oss schedulers to the oss defaults (not yet tested except for dryruns) Next Steps: 1. [6/n] Have `kd_monarch` use the default component (the custom `mast.py` is no longer needed). Update the README with updated instructions. 1. [7/n] Remove rust CLI in favor of all-python (we delegate to torchx for most things anyways) 1. [8/n] Add E2E unittest using the `local_cwd` scheduler (actually run a mini-trainer actor) 1. [9/n] Write an oss `hyperactor mesh-worker` entrypoint binary 1. [10/n] Author a Dockerfile that sets up the environment (much like fbpkgs do it for internal runs) 1. [11/n] Author a `TorchXAllocator` Reviewed By: colin2328 Differential Revision: D74913314 fbshipit-source-id: d1e87e8355592de87821cbfec3128baabe0ac511
Summary: remove direct `nom` dependency from `monarch_hyperactor` • eliminate direct `nom` usage in `monarch_hyperactor` by centralizing selection parsing through `ndslice::selection::parse::parse` • update `Selection::parse` Python binding and internal test code to use new helper • remove `nom` from dependencies and build targets this reduces build complexity and ensures all parsing goes through a single interface. Reviewed By: granulae Differential Revision: D75155319 fbshipit-source-id: 7e7a85193fafb0536bbbbe3e8a4589f71aa1c339
…tion impl (#14) Summary: Pull Request resolved: pytorch-labs/monarch#14 This diff fixes two issues related to the remote function implementation inside `StreamActor`: 1. There were two calls to `tracing::debug!` that use the function inputs and outputs as arguments. Even if the log level was set such that the message wasn't printed to the console, this still incurred the overhead of serialization (and therefore, when tensors were involved, synchronizing the host with the cuda stream). 2. The code to extract an `RValue` from a `PyObject` worked by attempting to do a conversion in C++, and if that failed, handling the exception via rust and trying the next `RValue` variant. Each time the conversion failed, the code created a `PyValueError` that serialized the `PyObject` as part of its message. Importantly, the case for tensors was only handled after the cases for `ScalarType`, `Layout`, and `MemoryFormat`. So if the `RValue` was a tensor, we would incur 3 C++ exceptions, 3 cuda stream syncs, and 3 tensor serializations. And this would happen for every tensor in the output of a remote function. The fixes in this diff are: 1. Remove the `tracing::debug!` calls. 2. Instead of trying an unchecked type conversion and handling the exception, actually check the type before trying to do the conversion. Reviewed By: colin2328 Differential Revision: D75035191 fbshipit-source-id: 9acf397c0ea9a9af0ffef215f39f2ee5c50b7395
Summary: Pull Request resolved: pytorch-labs/monarch#16 As titled. This involved fixing two issues with pipes (allow pipe processes to create tensors from wire values and explicitly flush the pipe buffer during pipe send) and fixing an issue with tensor serialization. Reviewed By: dulinriley Differential Revision: D74854607 fbshipit-source-id: b457a51feb18741a0b4535c405e1ca9c46a8f24e
Summary: to be compliant Reviewed By: colin2328 Differential Revision: D75192678 fbshipit-source-id: 7ac7346365d6be95412bcd96bb35bc4378c44420
Summary: This diff generizes the conecept of `RemoteProcessAlloc` such that any other multi-host `Allocator` can use the same `Alloc`, `MastAllocator` being the first. This has mutiple advantages: * Unified health checks across all allocators. * Not relying on MAST control plane for health. Now we run our own heartbeats. * Reusable componoent for future `Allocator` such as kubernetes. Reviewed By: mariusae Differential Revision: D74943143 fbshipit-source-id: 75322b03773c1a6afad05549b92ff72585d94ba5
Summary: Pull Request resolved: pytorch-labs/monarch#25 Original commit changeset: c26e2dc498f5 Original Phabricator Diff: D74718068 Reviewed By: dcci Differential Revision: D75181087 fbshipit-source-id: e2fa074e33cae3ee6a4c5694496ffcada3884466
Summary: Pull Request resolved: pytorch-labs/monarch#26 add some necessary testing libraries to the worker script and README Reviewed By: dcci Differential Revision: D75028810 fbshipit-source-id: 343ae098929f26122da6723316309325261a2b26
This pull request was exported from Phabricator. Differential Revision: D72945165 |
…ss-allocator and monarch CLI component Summary: Updates `kd_monarch` with the capability to dynamically query for information needed to create and use the `MastAllocator` in `controller.py`. Specifically it makes it possible to get these fields from the provided `ALLOCATOR_JOB_NAME` 1. `(hosts, gpus)` ProcMesh dimensions 2. RemoteProcessAllocator `port` (no need to hard code or have the user provide) 3. Task group name Also changes the remote-process-allocator (aka `hyperactor mesh-worker`) entrypoint's binary name in `kd_monarch_pkg` to be `monarch_bootstrap` to be consistent with its equivalent in OSS (hence makes the TorchX AppDef portable) Next: ~~[6/n] Have kd_monarch use the default component (the custom mast.py is no longer needed). Update the README with updated instructions.~~ [7/n] Remove rust CLI in favor of all-python (we delegate to torchx for most things anyways) [8/n] Add E2E unittest using the local_cwd scheduler (actually run a mini-trainer actor) [9/n] Write an oss hyperactor mesh-worker entrypoint binary [10/n] Author a Dockerfile that sets up the environment (much like fbpkgs do it for internal runs) [11/n] Author a TorchXAllocator Reviewed By: vidhyav Differential Revision: D75162358 fbshipit-source-id: d2756ffb529b3069baf94cd9dfdde00b8ef2ede4
Summary: All the working subcommands were falling back to Python anyways. Moved the (currently unimplemented) subcommand stubs: `bounce` and `stop` to Python. **Note:** couple of reasons why a Rust CLI for monarch isn't ideal: 1. Uses TorchX under the hood. TorchX is a Python library. 2. Due to #1 we have to run a Python CLI fallback anyways and the mechanics of this is meta specific (won't work for OSS). 3. Reverse pyo3 binding TorchX (call Python from Rust) doesn't work internally due to the way we package Python (hermetic PAR). 4. Any material benefits (e.g. performance?) of implementing the CLI in Rust would be negated by the effort to fix/deal-with #1-3. **Next:** ~~[6/n] Have kd_monarch use the default component (the custom mast.py is no longer needed). Update the README with updated instructions.~~ ~~[7/n] Remove rust CLI in favor of all-python (we delegate to torchx for most things anyways)~~ [8/n] Add E2E unittest using the local_cwd scheduler (actually run a mini-trainer actor) [9/n] Write an oss hyperactor mesh-worker entrypoint binary [10/n] Author a Dockerfile that sets up the environment (much like fbpkgs do it for internal runs) [11/n] Author a TorchXAllocator Reviewed By: vidhyav, suo Differential Revision: D75176535 fbshipit-source-id: 29020f4032bd642af26b393ade74f40b868df973
Summary: Pull Request resolved: pytorch-labs/monarch#29 Previous implementation was converting `Slice(offset, sizes=[size, 1, 1, ...], strides=[stride, ...])` into a selection with `Range(offset, offset + size, stride)`, which is wrong. It should be `Range(offset, offset + size * stride, stride)`. Reviewed By: shayne-fletcher, andrewjcg Differential Revision: D75102147 fbshipit-source-id: 979c3712e65a49cd4e04de71921b82baead721bc
Summary: Pull Request resolved: pytorch-labs/monarch#30 Fix warnings related to mesh slicing in `test_controller.py`, and update the file to use `TestingContext` in order to deal with device mesh shutdown warnings. Reviewed By: colin2328 Differential Revision: D75106997 fbshipit-source-id: 05fd5c85b612c6356560a4df26854931d2d90c64
Summary: applying code changes to be compatible with both 2021 and 2024 rust edition Reviewed By: mariusae, dtolnay Differential Revision: D75194983 fbshipit-source-id: 7cfa9cea68a63bb86f5c816edce493ff7defb3e2
Summary: `payload` is a tuple of `(rank, value)` so unpack it before raising. Before: ``` bugging__/debugging-inplace#link-tree/monarch/service.py", line 431, in _process raise payload TypeError: exceptions must derive from BaseException ``` After: the exception is raised properly Reviewed By: colin2328 Differential Revision: D75264392 fbshipit-source-id: cb78b2cde58c223441830788b572551db44fc5b4
Summary: As title. Reviewed By: mariusae Differential Revision: D74482409 fbshipit-source-id: 6957f3c803deed802d4ac90be1995520006e0dcb
Summary: When reducer is none, the split port's lambda is doing a simply forward. This diff separate this scenario into its own lambda so the code is easier to read and maintain. Reviewed By: mariusae Differential Revision: D75090318 fbshipit-source-id: 532182da291eebaa733e0c11f05816fcc0720a67
Summary: context would make it easier to track where the error is from. Reviewed By: dulinriley Differential Revision: D75244043 fbshipit-source-id: b05eda65896d0cdd750c0fa24247783f2d6739cc
Summary: Required by new versions of Cargo. Reviewed By: diliop Differential Revision: D75295569 fbshipit-source-id: 04295493397f2305f6e619565a1b3c4f031f80a9
Summary: Pull Request resolved: pytorch-labs/monarch#31 D75106997 appears to have broken github CI, and this partial revert doesn't seem to reintroduce any warnings, so hopefully this fixes things. Reviewed By: colin2328 Differential Revision: D75314370 fbshipit-source-id: 16caf12df35e53a20b547d4d9b5c290ea67a4641
Summary: event::read()? is blocking so we won't redraw the UI until an input event has been received. What we want instead of make it not blocking is to wrap it in a crossterm::poll Reviewed By: kaiyuan-li Differential Revision: D75296531 fbshipit-source-id: c6da46b6e7e58086232cca763e42d70384a0eb59
Summary: Fetching data in the UI loop results in a brief jank while we await newly fetched results Reviewed By: kaiyuan-li Differential Revision: D75296533 fbshipit-source-id: c269e392130c8bbb8bd8a45e6e4d616d93aed392
Summary: - Make table header green - Resize table column widths appropriately given lengths of real actor ids and world names from mast job - Sort actors by actor id after sorting by messages Reviewed By: kaiyuan-li Differential Revision: D75296532 fbshipit-source-id: 1959b510b4bd6d9be0b1bc68c57b743bce92f396
Summary: the `messages_received` and `messages_sent` key in monarch_metrics allow us to filter for ingress and egress events Reviewed By: kaiyuan-li Differential Revision: D75296530 fbshipit-source-id: e4735be37daf065c8e87417128324ed61acc4b00
Reviewed By: pablorfb-meta Differential Revision: D75297949 fbshipit-source-id: 55ea59b07801b7d5a42ebffd09f80bdb3589dd9a
Summary: Adds the OSS equivalent of `hyperactor_meta mesh-worker` called `process_allocator`. Also created a meta version of it (but haven't switched over to it in the job launcher). Reviewed By: colin2328 Differential Revision: D75299505 fbshipit-source-id: 1b4a89ddd2da424ce8f8ded9343d134979ebabb5
Summary: Required by new versions of Cargo. Reviewed By: bigfootjon Differential Revision: D75343369 fbshipit-source-id: 6af89d222a6de541c9c924a466ffd3165e707266
Summary: Pull Request resolved: pytorch-labs/monarch#32 This is the second try after D74718068 got backed out. Turns out `bootstrap_main` from `_rust_bindings` needs a reference to a running event loop, so it can't simply be passed into `asyncio.run()`: https://github.com/awestlake87/pyo3-asyncio#a-note-about-asynciorun We wrap it into a trivial coroutine instead. Reviewed By: amirafzali Differential Revision: D75284480 fbshipit-source-id: 24352a212d0490d2f1fe4eb6245212c1abfc1bfa
Summary: fix broken monarch_hyperactor build on master Reviewed By: amirafzali Differential Revision: D75454230 fbshipit-source-id: 3f70c4eba7eed4ece9a754888bbbc1a2cdc91fbc
Summary: fix various 2024 edition lints Reviewed By: benjipelletier Differential Revision: D75454728 fbshipit-source-id: 981b6a599ca51b8862159903cc13e032bf7eb322
Reviewed By: colin2328 Differential Revision: D75330629 fbshipit-source-id: b843fc690050b75ead957049a17da7c91ee26462
Summary: side-note: looks like `monarch.bootstrap_main` used to be `monarch._monarch.hyperactor.bootstrap_main`. Adds `monarch_bootstrap` as a console script so that it is installed with the monarch wheel. This makes it simpler to pass it as the the `program` (`cmd`) to the process-allocator. Reviewed By: LucasLLC, colin2328 Differential Revision: D75325523 fbshipit-source-id: af9cb48c0dbeb5b45bdd919c23acc9f05b147ddc
Summary: **Diff Purpose & Changes** 1. Creating the random/RNG remote functions for our new builtins library. torch.manual_seed(seed) torch.initial_seed() torch.get_rng_state() torch.set_rng_state(state) torch.cuda.get_rng_state_all() torch.cuda.set_rng_state_all(states) These two appear to be the same function. we can consider removing one or the other in the library. torch.seed() torch.random.seed() Reviewed By: vidhyav, colin2328 Differential Revision: D72944566
Summary: **Diff Purpose & Changes** - We can migrate all instances of worker.set_random_seed_impl to now use builtins.random.set_manual_seed. - Wonder if some of these cases would benefit from random_seed() instead. Maybe that can be checked out in the future. Reviewed By: suo Differential Revision: D72945165
7226b38
to
86f09ca
Compare
This pull request was exported from Phabricator. Differential Revision: D72945165 |
facebook-github-bot
pushed a commit
that referenced
this pull request
May 28, 2025
Summary: Pull Request resolved: #28 **Diff Purpose & Changes** - We can migrate all instances of worker.set_random_seed_impl to now use builtins.random.set_manual_seed. - Wonder if some of these cases would benefit from random_seed() instead. Maybe that can be checked out in the future. Reviewed By: suo Differential Revision: D72945165 fbshipit-source-id: b2f9cc48b34b021c41a6032be7bdd7ca31580823
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Diff Purpose & Changes
Reviewed By: suo
Differential Revision: D72945165