-
Notifications
You must be signed in to change notification settings - Fork 35
: rust bindings: fix expose Selection to python. missed a bit #19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
shayne-fletcher
wants to merge
24
commits into
pytorch-labs:main
from
shayne-fletcher:export-D75147364
Closed
: rust bindings: fix expose Selection to python. missed a bit #19
shayne-fletcher
wants to merge
24
commits into
pytorch-labs:main
from
shayne-fletcher:export-D75147364
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
fbshipit-source-id: 3e19879d1fc444bef4596e58157014628792c364
Summary: `aws-lc-rs` is painful to build inside of Buck because the build script is complicated (see aws/aws-lc-rs#800 for context). Note that while we expect the `aws-lc-rs` build to become less complicated, that'll likely take more than a quarter. Reviewed By: capickett Differential Revision: D74884051 fbshipit-source-id: 9c77973f51e6887665e8694ad85adb3a9fe62e09
…ate's tests Summary: Used in D74651293 Reviewed By: shayne-fletcher Differential Revision: D74919216 fbshipit-source-id: 6df4e256de1df31cd06004cb0fdb45a285180c9e
Summary: As title. Reviewed By: mariusae Differential Revision: D74651293 fbshipit-source-id: 5c82ba9fe0929d3b9bd7384f4345dcaa891c959f
Summary: **Diff Purpose & Changes** 1. Library creation for 'builtins', where our new UDFs will lie. 2. Implementation of the logging related remote functions. Reviewed By: colin2328 Differential Revision: D72941478 fbshipit-source-id: dcea47ab363997b80c9de5e33007734eebbf495f
Summary: **Diff Purpose & Changes** Adding some unit tests for the logging builtins. Reviewed By: colin2328 Differential Revision: D72941477 fbshipit-source-id: 1d41ca38f5b5eb6dbc7881daf281782ef69ca9e3
Summary: **Diff Purpose & Changes** Now that we have the library log function, let's migrate existing usages over. Reviewed By: colin2328 Differential Revision: D72941480 fbshipit-source-id: a39e1dfad8e9f6dfd0d2c8ee429b04ed04065bb1
Summary: **Diff Purpose & Changes** Same as D72941480, but for set_worker_logging_level Reviewed By: colin2328 Differential Revision: D72942837 fbshipit-source-id: 83115a6915ec353fb295f72baf6285dcc45fa24a
Summary: Verified by Shayne Reviewed By: amirafzali Differential Revision: D75013277 fbshipit-source-id: 350b6d876b80a5f9a68be5d66315beafc6f48de1
Summary: this diff provides a Python binding for the `Selection` type. includes a class-method constructor `Selection.from_string(…)`. integrates into `monarch_extension` and adds a basic test. Reviewed By: dulinriley Differential Revision: D74971945 fbshipit-source-id: f63d6bdd9a9db23cef8f099f72a139bf046d079f
Summary: this diff introduces property-based testing for `collect_commactor_routing_tree` to validate path determinism under CommActor routing semantics. the main test asserts that if two selections `S` and `T` both deliver to a rank `n`, then the logical multicast path to `n` must be the same: ```lang=text, ∀ S, T, n. n ∈ eval(S, slice) ∧ n ∈ eval(T, slice) ⇒ route(n, S) == route(n, T) ``` this complements `collect_routed_path_determinism` by checking the same invariant in the context of CommActor routing—i.e., the actual mechanism used in production. it confirms that message delivery paths are structurally deterministic, even when selections differ syntactically. Reviewed By: moonli Differential Revision: D74997294 fbshipit-source-id: 9a90427f640f98a993763b9ece55e65f19dba132
Summary: This diff adds e2e support for `monarch.compile` with rust meshes. It adds logic inside `WorkerActor` to forward recording messages to the relevant streams, and then dispatch a subsequent `CallRecording` message to those streams. It also adds unit tests for the rust backend inside `test_coalescing`. Some very simple testing shows that **`monarch.compile` reduces the client-side dispatch overhead of the nanoGPT training loop by ~300x** . There were several gotchas that made this harder than expected: - Maximum frame size in `tokio` channels - We set a maximum frame size of 8 MiB, but when testing nanoGPT with `monarch.compile`, the `DefineRecording` message for the training step was ~17 MB, which caused a crash. Since the discussion around message compression was left unresolved, I had to implement manual chunking, where a recording is defined across several `DefineRecording` messages such that each `DefineRecording` message contains at most 1000 subcommands. The `WorkerActor` then needs to combine these together before forwarding the recording to the streams. - IValue shenanigans - Depending on the contents of the recording, building the `DefineRecording` pyo3 rust struct may require cloning a torch IValue tensor, which breaks when our custom torch dispatch mode is active and the cloning happens outside `__torch_dispatch__` (see D74422995). As a workaround, if a `CallFunction` message is part of a recording, its python parameters are passed into rust via `DefineRecording.append_call_function(...)`, which internally constructs the rust pyo3 `CallFunction`. This is as opposed to previously, where the rust pyo3 `CallFunction` was passed directly to `DefineRecording` and then cloned. - Some torch ops (for example, I ran into this issue with scaled dot product attention backward for nanoGPT) can accept undefined tensors if `None` is passed to the python function but the underlying C++ function schema requires a reference to a tensor. So we end up with IValues that contain undefined tensors. Attempting to clone these causes a crash independent of the `__torch_dispatch__` mode issue described above. As a workaround, our FFI implementation of `ivalue_deepcopy` now explicitly checks for an undefined tensor and handles that case separately. Reviewed By: colin2328 Differential Revision: D74590123 fbshipit-source-id: 3eb021dca973f14ad9b9942ebc3036d12b3e2e30
Summary: Clarify expectations around early development and reporting bugs. Reviewed By: jspisak Differential Revision: D75032118 fbshipit-source-id: 9644503d4474475d016967a72efc7bb4b4096919
Summary: Pull Request resolved: #10 apparently you can't gave `#[pymethods]` twice. not clear why this is detected in the OSS build but passes locally :-/ Reviewed By: moonli Differential Revision: D75027440 fbshipit-source-id: 03719ae8a5a51fc75c530d6963c658e8bb0f956b
Summary: Bootstrap entire backend through pyo3 rust bindings instead of spawning a binary to keep everything in a single process Reviewed By: pablorfb-meta Differential Revision: D73666832 fbshipit-source-id: 61def53812f91c73c2f9aaed4c5fed7cf4356c0b
Summary: This is to set us up for the next 2 diffs where we will remove source/destination semantics from SimAddress by passing in the source address when we dial an address The important parts of this are in `channel/mod.rs` and `channel/sim.rs`. Everything else is just passing in `None` as the dialer for now. In the next diff we will make ProcActors and SystemActors pass in their listen addresses for this arg. Full Context: One of the biggest pain points with the simulator is that there is a concept of directionality in our addresses. This adds cognitive complexity as we have to reason about pointing the channel in the correct direction and making sure to always have the correct source. One of the kinks caused by this is for example, we introduced the capability for an actor to talk to another actor directly instead of going through the system after it learns its address, but the system would have to make sure to modify this address to use the right source each time instead of only being concerned with the address of the destination. The reason why we need directionality for simulation is so that we can figure out how much latency should occur when a message is sent. The only piece that needs to know this is the transmitter, and for a transmitter, it does make sense semantically have a concept of directionality (I am sending to an address, from an address). What we will do instead of source-to-destination addresses for sim channels, is we will store the source and destination information in the transmitter, and initialize these values when we dial a channel. dial() will now take in both the address we want to dial, as well as and optional address of the dialer. The most common use case is when a proc dials the system, it will give the address it is being served on as the dialer. The system will also need to bind it's own id to the address is it being served on so it can provide this when dialing other procs. Reviewed By: moonli, pablorfb-meta Differential Revision: D74854480 fbshipit-source-id: 3d522453b46029c68bbc2096d48f90b6741dcae6
Summary: this diff enhances the `collect_routed_paths` simulation to track additional structural information about the routing traversal: • introduces a new `RoutedPathTree` struct capturing: • `delivered`: map from destination ranks to delivery paths (previous behavior). • `predecessors`: map from rank to set of direct predecessor ranks. • updates `collect_routed_paths` to return `RoutedPathTree`, recording parent → child edges at each hop. • updates the `collect_routed_path_determinism` test to access the new delivered field explicitly. Reviewed By: moonli Differential Revision: D75002480 fbshipit-source-id: f97479d7b43b8bfcd5cfed870445ceff1d1e8b5f
Summary: this adds a property-based test for the “Unique Predecessor Theorem” on the routing structure produced by `collect_routed_paths`. the theorem asserts that for any reachable node `x`, there is at most one predecessor `y ≠ x` responsible for routing a message to `x`. Any additional edges to `x` must either be self-loops (`x → x`) or structurally duplicated hops from the same `y` (e.g., due to unions). the test inspects the predecessor map and asserts that all non-self predecessors per node are unique. this provides structural validation of the routing semantics and helps detect unintended duplication or routing ambiguity. Reviewed By: moonli Differential Revision: D75010254 fbshipit-source-id: 73cd244511d1be13249b48ae05f41dc51de1a076
…ponent_helpers to further reduce component complexity in monarch Summary: Takes out `enable-ttls` and `run-as-root` logic from `monarch.tools.components` into `torchx.specs.fb.component_helpers` to reduce the amount of meta-isms in `monarch.tools`. Next up: [5/n] Create oss and meta default configs and push meta-isms from `components.hyperactor` to `components.meta.hyperactor` Reviewed By: highker Differential Revision: D74748885 fbshipit-source-id: 49a0fafee4824b16cdb5a0d545cd88a5ac866405
Summary: this test captures a key structural guarantee of the routing system: that overdelivery is prevented not by explicit guards, but by the semantics of the traversal combined with frame deduplication. the comment explains that when multiple delivery frames reach the same destination, they must have the same routing lineage and state — making them structurally equivalent. as a result, deduplication ensures only one of them is retained. by disabling deduplication, the test demonstrates that overdelivery becomes possible, highlighting the value of this mechanism. i believe this is helpful both as regression protection and as executable documentation of a subtle invariant. Reviewed By: moonli Differential Revision: D75027898 fbshipit-source-id: 2a9eed9056e03e6c7d2c3c99db6344def4d9a0dc
…ting_tree Summary: adds a property-based test validating the “Unique Predecessor Theorem” under `collect_commactor_routing_tree`. this test asserts that in the CommActor-style multicast routing simulation: - each destination is forwarded to by at most one distinct peer (i.e., no fan-in), - and no node forwards to itself (`x → x`), a stricter property than general breadth-first routing. we reconstruct the predecessor relation from `tree.forwards`, checking for violations across 256 randomly generated selections and mesh topologies. this complements the earlier test for `collect_routed_paths` and reinforces structural routing invariants. Reviewed By: moonli Differential Revision: D75016108 fbshipit-source-id: 75db6eb8de91b928a771252d4a8284a5e5cf3e17
Summary: Moves the monarch_hyperactor python bindings into modules that match where they are defined in python. This completes the move of binding code. Reviewed By: colin2328 Differential Revision: D74975746 fbshipit-source-id: 5ec7d62c54ee3e7ad3f689463d410c8626b30df8
Summary: get rid of the unneeded redirects, fix more cases where the rust module name didn't match the python name Reviewed By: colin2328 Differential Revision: D75016551 fbshipit-source-id: 7eb37cd399764c91f597ad046f86c259119afddb
Summary: should have been in D74971945 looks like Reviewed By: benjipelletier, mariusae Differential Revision: D75147364
This pull request was exported from Phabricator. Differential Revision: D75147364 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary: should have been in D74971945 looks like
Reviewed By: benjipelletier, mariusae
Differential Revision: D75147364