This repository was archived by the owner on Aug 7, 2024. It is now read-only.

Removed `module` arg from `fsdp_pre_all_gather` #217

Closed

awgu wants to merge 7 commits into gh/awgu/4/base from gh/awgu/4/head

awgu commented Feb 15, 2024 •

edited

Loading

Stack from ghstack (oldest at bottom):

On the discussion of whether fp8 all-gather is viable without compiling the pre-all-gather cast to fp8, this PR would add more CPU overhead due to the __torch_function__ override, making it less viable.


          Removed module arg from fsdp_pre_all_gather

fcd7115

[ghstack-poisoned]

This was referenced Feb 15, 2024

Added use_activation_hooks: bool to swap #214

Closed

Added initial compile tests for flat-parameter FSDP #215

Closed

[POC] Added fp8 all-gather extensions #216

Closed

facebook-github-bot added the CLA Signed label

awgu pushed a commit that referenced this pull request


          Removed module arg from fsdp_pre_all_gather

ghstack-source-id: d3c0a7b
Pull Request resolved: #217

awgu commented

View reviewed changes

float8_experimental/float8_linear.py

@@ @@ -232,7 +232,7 @@ def cast_w_to_float8( @@
                           self.fp8_scale_w,
                           scale_fn_name,
                           torch.float8_e4m3fn,
-                          is_amax_initialized,
+                          self.is_amax_initialized,

Author

awgu Feb 15, 2024

We have to push this is_amax_initialized into the cast function since the Float8LinearWeightTensor subclass cannot keep a reference to the bool.

awgu commented

View reviewed changes

float8_experimental/float8_linear.py

+                          return o
+                      with torch._C.DisableTorchFunctionSubclass():
+                          if isinstance(args[0], cls):

Author

awgu Feb 15, 2024

The heuristic here is just to propagate the Float8LinearWeightTensor as long as it is the first argument.


          Update on "Removed module arg from fsdp_pre_all_gather"

839bcb3

On the discussion of whether fp8 all-gather is viable _without_ compiling the pre-all-gather cast to fp8, this PR would add _more_ CPU overhead due to the `__torch_function__` override, making it less viable.

[ghstack-poisoned]

awgu pushed a commit that referenced this pull request


          Removed module arg from fsdp_pre_all_gather

b7584ba

ghstack-source-id: aa0fe4a
Pull Request resolved: #217

awgu commented

View reviewed changes

float8_experimental/float8_dynamic_linear.py

@@ @@ -142,12 +147,34 @@ def from_float( @@
               class Float8DynamicLinearWeightTensor(torch.Tensor):
-                  # TODO: Remove `module` arg, save state on subclass, and propagate it.
-                  def fsdp_pre_all_gather(
-                      self, module: nn.Module

Author

awgu Feb 15, 2024

The problem statement is that in order to implement the pre-all-gather transform, the subclass needs some state (e.g. mainly emulate but preferably the cast function as well). In the first prototype, I shortcutted by passing module into fsdp_pre_all_gather() so that the transform could read state off the module instead of storing it on the subclass itself.

However, the morally right thing (from a design perspective) should be to put that state on the subclass. I wonder though, how does this kind of __torch_function__ subclass interact with torch.compile today. cc: @bdhirsh


          Update on "Removed module arg from fsdp_pre_all_gather"

d839417

On the discussion of whether fp8 all-gather is viable _without_ compiling the pre-all-gather cast to fp8, this PR would add _more_ CPU overhead due to the `__torch_function__` override, making it less viable.

[ghstack-poisoned]

awgu pushed a commit that referenced this pull request


          Removed module arg from fsdp_pre_all_gather

033d8c4

ghstack-source-id: fc5a6df
Pull Request resolved: #217

awgu mentioned this pull request

Used functional all-reduce for amax reduction #219

Closed


          Update on "Removed module arg from fsdp_pre_all_gather"

6efa0bc

On the discussion of whether fp8 all-gather is viable _without_ compiling the pre-all-gather cast to fp8, this PR would add _more_ CPU overhead due to the `__torch_function__` override, making it less viable.

[ghstack-poisoned]

awgu mentioned this pull request

Set amax_and_scale_synced unconditionally #220

Closed

awgu pushed a commit that referenced this pull request


          Removed module arg from fsdp_pre_all_gather

206660c

ghstack-source-id: 7b34f4e
Pull Request resolved: #217


          Update on "Removed module arg from fsdp_pre_all_gather"

7ef9fe2

On the discussion of whether fp8 all-gather is viable _without_ compiling the pre-all-gather cast to fp8, this PR would add _more_ CPU overhead due to the `__torch_function__` override, making it less viable.

[ghstack-poisoned]

awgu pushed a commit that referenced this pull request


          Removed module arg from fsdp_pre_all_gather

6b79dd4

ghstack-source-id: bb1e4c0
Pull Request resolved: #217


          Update on "Removed module arg from fsdp_pre_all_gather"

bbfb405

On the discussion of whether fp8 all-gather is viable _without_ compiling the pre-all-gather cast to fp8, this PR would add _more_ CPU overhead due to the `__torch_function__` override, making it less viable.

[ghstack-poisoned]

awgu pushed a commit that referenced this pull request


          Removed module arg from fsdp_pre_all_gather

5d172f5

ghstack-source-id: bb1e4c0
Pull Request resolved: #217


          Update on "Removed module arg from fsdp_pre_all_gather"

c10d100

On the discussion of whether fp8 all-gather is viable _without_ compiling the pre-all-gather cast to fp8, this PR would add _more_ CPU overhead due to the `__torch_function__` override, making it less viable.

[ghstack-poisoned]

awgu pushed a commit that referenced this pull request


          Removed module arg from fsdp_pre_all_gather

f075848

ghstack-source-id: 9bfc02a
Pull Request resolved: #217

awgu closed this

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels