[POC] Added fp8 all-gather extensions #216

awgu · 2024-02-14T22:52:59Z

Stack from ghstack (oldest at bottom):

Overview

This PR shows the prototype for enabling fp8 all-gather for Float8Linear.weight and Float8DynamicLinear.weight. This requires changes from pytorch/pytorch#119378.

The approach is to change the weight tensor into a tensor subclass that defines two methods fsdp_pre_all_gather() and fsdp_post_all_gather(). We currently prefer this approach since subclasses are the blessed approach for extending at the tensor level. However, we are evaluating the implications on both eager performance and compile.

See #201 for some more notes on per-parameter FSDP and fp8.

`torch.compile` w/o fp8 all-gather

TL;DR only transformer_block.forward = torch.compile(transformer_block.forward) works today.

	Delayed Scaling	Dynamic Scaling
Compile Transformer Block `forward`	🙁 requires disabling amax init 🙁 requires disabling pre/post-forward ✅ one graph per transformer block	✅ one graph per transformer block ❌ error in float8_mm if compiling output projection
Compile Transformer	🙁 requires disabling amax init ❌ unexpected graph breaks	❌ error in float8_mm compile

`torch.compile` w/ fp8 all-gather

Context: Per-parameter FSDP runs all-gather in a pre-forward hook and frees parameters in a post-forward hook. Using transformer_block.forward = torch.compile(transformer_block.forward) does not compile the hooks, so we distinguish between two cases when doing this block-level compile: including and not including hooks.

One way to emulate including the hooks is to change per-parameter FSDP to override forward() instead of use hooks, in which case transformer_block.forward would include FSDP's pre/post-forward logic directly. We have not investigated this yet.

	Delayed Scaling	Dynamic Scaling
Compile Transformer Block `forward` w/o Forward Hooks	🙁 requires disabling amax init 🙁 requires disabling pre/post-forward ✅ one graph per transformer block	✅ one graph per transformer block ❌ error in float8_mm if compiling output projection
Compile Transformer	🙁 requires disabling amax init ❌ error in pre-all-gather compile	❌ error in float8_mm compile
Compile Transformer Block incl. Forward Hooks

[ghstack-poisoned]

ghstack-source-id: ff0433d Pull Request resolved: #216

### Overview This PR shows the prototype for enabling fp8 all-gather for `Float8Linear.weight` and `Float8DynamicLinear.weight`. This requires changes from pytorch/pytorch#119378. The approach is to change the `weight` tensor into a tensor subclass that defines two methods `fsdp_pre_all_gather()` and `fsdp_post_all_gather()`. We currently prefer this approach since subclasses are the blessed approach for extending at the tensor level. However, we are evaluating the implications on both eager performance and compile. See #201 for some more notes on per-parameter FSDP and fp8. ### `torch.compile` w/o fp8 all-gather **TL;DR** only `transformer_block.forward = torch.compile(transformer_block.forward)` works today. | | Delayed Scaling | Dynamic Scaling | |---------------------------|----------------------------------------------------------|----------------------------------------------------------------------| | Compile Transformer Block `forward` | 🙁 requires disabling amax init 🙁 requires disabling pre/post-forward ✅ one graph per transformer block | ✅ one graph per transformer block ❌ error in float8_mm if compiling output projection | | Compile Transformer | 🙁 requires disabling amax init ❌ unexpected graph breaks | ❌ error in float8_mm compile | ### `torch.compile` w/ fp8 all-gather Context: Per-parameter FSDP runs all-gather in a pre-forward hook and frees parameters in a post-forward hook. Using `transformer_block.forward = torch.compile(transformer_block.forward)` does not compile the hooks, so we distinguish between two cases when doing this block-level compile: including and not including hooks. One way to emulate including the hooks is to change per-parameter FSDP to override `forward()` instead of use hooks, in which case `transformer_block.forward` would include FSDP's pre/post-forward logic directly. We have not investigated this yet. | | Delayed Scaling | Dynamic Scaling | |---------------------------|------------------------------------------------------------------|---------------------------------------------------------------------------------------| | Compile Transformer Block `forward` w/o Forward Hooks | 🙁 requires disabling amax init 🙁 requires disabling pre/post-forward ✅ one graph per transformer block | ✅ one graph per transformer block ❌ error in float8_mm if compiling output projection | | Compile Transformer | 🙁 requires disabling amax init ❌ error in pre-all-gather compile | ❌ error in float8_mm compile | | Compile Transformer Block incl. Forward Hooks | | | [ghstack-poisoned]

[POC] Added fp8 all-gather extensions

9811efd

[ghstack-poisoned]

This was referenced Feb 14, 2024

Added use_activation_hooks: bool to swap #214

Closed

Added initial compile tests for flat-parameter FSDP #215

Closed

awgu pushed a commit that referenced this pull request Feb 14, 2024

[POC] Added fp8 all-gather extensions

61d6611

ghstack-source-id: ff0433d Pull Request resolved: #216

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 14, 2024

awgu mentioned this pull request Feb 15, 2024

Removed module arg from fsdp_pre_all_gather #217

Closed

awgu mentioned this pull request Feb 16, 2024

Used functional all-reduce for amax reduction #219

Closed

awgu mentioned this pull request Feb 16, 2024

Set amax_and_scale_synced unconditionally #220

Closed

Andrew Gu added 3 commits February 16, 2024 07:36

awgu closed this Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[POC] Added fp8 all-gather extensions #216

[POC] Added fp8 all-gather extensions #216

Uh oh!

awgu commented Feb 14, 2024 •

edited

Loading

Uh oh!

Uh oh!

[POC] Added fp8 all-gather extensions #216

[POC] Added fp8 all-gather extensions #216

Uh oh!

Conversation

awgu commented Feb 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

torch.compile w/o fp8 all-gather

torch.compile w/ fp8 all-gather

Uh oh!

Uh oh!

awgu commented Feb 14, 2024 •

edited

Loading

`torch.compile` w/o fp8 all-gather

`torch.compile` w/ fp8 all-gather