Skip to content
This repository was archived by the owner on Aug 7, 2024. It is now read-only.

static scaling support for training #306

Closed
wants to merge 6 commits into from
Closed

Conversation

vkuzo
Copy link
Contributor

@vkuzo vkuzo commented Jul 5, 2024

Stack from ghstack (oldest at bottom):

Summary:

Some activations such as sigmoid can have a bounded range. This PR adds support for setting a bounded range in training.

Test Plan:

// unit tests
pytest test/test_base.py

// baseline
> python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
category                                              
0_gemm         0.637     0.353       0.555       1.803
1_f8_overhead  0.000     0.175         inf       0.000
2_other        0.224     0.199       0.888       1.126
All            0.861     0.727       0.844       1.184

> python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear --scaling_type_x static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
category                                              
0_gemm         0.635     0.360       0.566       1.766
1_f8_overhead  0.000     0.182         inf       0.000
2_other        0.224     0.154       0.688       1.454
All            0.859     0.696       0.810       1.234

Reviewers:

Subscribers:

Tasks:

Tags:

Summary:

In certain cases, activations and gradients can have a bounded range.
For example, consider sigmoid -> fc -> ln -> sigmoid:
1. range of sigmoid in the forward is bounded, so we can scale
   statically if we are ok with a slight accuracy drop in the case that
   the observed values do not reach the theoretical bound
2. range of derivative of sigmoid is bounded
   (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x)
3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html)
   depends on the incoming gradient and the trainable LN parameters, so we can
   derive a bound based on the incoming bound and calculating max of LN parameters

This PR adds static scaling as an option for x, w, dL_dY, and a quick
benchmark to verify performance is as we expect.

TODO add numerics testing.

Test Plan:

```
// baseline
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.160     0.098       0.613       1.632
 1_f8_overhead  0.000     0.100         inf       0.000
 2_other        0.147     0.121       0.823       1.215
 All            0.307     0.319       1.040       0.962

 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid)
 python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.362       0.545       1.834
 1_f8_overhead  0.000     0.269         inf       0.000
 2_other        0.396     0.273       0.689       1.452
 All            1.061     0.904       0.853       1.173

 // static scaling for x and dL_dY (handwaving for now, the actual code would
 // need to read the LN params to get the max)
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.365       0.549       1.822
 1_f8_overhead  0.000     0.242         inf       0.000
 2_other        0.395     0.273       0.690       1.448
 All            1.060     0.879       0.830       1.205

```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jul 5, 2024
Summary:

In certain cases, activations and gradients can have a bounded range.
For example, consider sigmoid -> fc -> ln -> sigmoid:
1. range of sigmoid in the forward is bounded, so we can scale
   statically if we are ok with a slight accuracy drop in the case that
   the observed values do not reach the theoretical bound
2. range of derivative of sigmoid is bounded
   (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x)
3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html)
   depends on the incoming gradient and the trainable LN parameters, so we can
   derive a bound based on the incoming bound and calculating max of LN parameters

This PR adds static scaling as an option for x, w, dL_dY, and a quick
benchmark to verify performance is as we expect.

TODO add numerics testing.

Test Plan:

```
// baseline
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.160     0.098       0.613       1.632
 1_f8_overhead  0.000     0.100         inf       0.000
 2_other        0.147     0.121       0.823       1.215
 All            0.307     0.319       1.040       0.962

 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid)
 python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.362       0.545       1.834
 1_f8_overhead  0.000     0.269         inf       0.000
 2_other        0.396     0.273       0.689       1.452
 All            1.061     0.904       0.853       1.173

 // static scaling for x and dL_dY (handwaving for now, the actual code would
 // need to read the LN params to get the max)
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.365       0.549       1.822
 1_f8_overhead  0.000     0.242         inf       0.000
 2_other        0.395     0.273       0.690       1.448
 All            1.060     0.879       0.830       1.205

```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 816f87e
Pull Request resolved: #306
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 5, 2024
Summary:

In certain cases, activations and gradients can have a bounded range.
For example, consider sigmoid -> fc -> ln -> sigmoid:
1. range of sigmoid in the forward is bounded, so we can scale
   statically if we are ok with a slight accuracy drop in the case that
   the observed values do not reach the theoretical bound
2. range of derivative of sigmoid is bounded
   (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x)
3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html)
   depends on the incoming gradient and the trainable LN parameters, so we can
   derive a bound based on the incoming bound and calculating max of LN parameters

This PR adds static scaling as an option for x, w, dL_dY, and a quick
benchmark to verify performance is as we expect.

TODO add numerics testing.

Test Plan:

```
// baseline
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.160     0.098       0.613       1.632
 1_f8_overhead  0.000     0.100         inf       0.000
 2_other        0.147     0.121       0.823       1.215
 All            0.307     0.319       1.040       0.962

 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid)
 python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.362       0.545       1.834
 1_f8_overhead  0.000     0.269         inf       0.000
 2_other        0.396     0.273       0.689       1.452
 All            1.061     0.904       0.853       1.173

 // static scaling for x and dL_dY (handwaving for now, the actual code would
 // need to read the LN params to get the max)
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.365       0.549       1.822
 1_f8_overhead  0.000     0.242         inf       0.000
 2_other        0.395     0.273       0.690       1.448
 All            1.060     0.879       0.830       1.205

```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jul 5, 2024
Summary:

In certain cases, activations and gradients can have a bounded range.
For example, consider sigmoid -> fc -> ln -> sigmoid:
1. range of sigmoid in the forward is bounded, so we can scale
   statically if we are ok with a slight accuracy drop in the case that
   the observed values do not reach the theoretical bound
2. range of derivative of sigmoid is bounded
   (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x)
3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html)
   depends on the incoming gradient and the trainable LN parameters, so we can
   derive a bound based on the incoming bound and calculating max of LN parameters

This PR adds static scaling as an option for x, w, dL_dY, and a quick
benchmark to verify performance is as we expect.

TODO add numerics testing.

Test Plan:

```
// baseline
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.160     0.098       0.613       1.632
 1_f8_overhead  0.000     0.100         inf       0.000
 2_other        0.147     0.121       0.823       1.215
 All            0.307     0.319       1.040       0.962

 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid)
 python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.362       0.545       1.834
 1_f8_overhead  0.000     0.269         inf       0.000
 2_other        0.396     0.273       0.689       1.452
 All            1.061     0.904       0.853       1.173

 // static scaling for x and dL_dY (handwaving for now, the actual code would
 // need to read the LN params to get the max)
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.365       0.549       1.822
 1_f8_overhead  0.000     0.242         inf       0.000
 2_other        0.395     0.273       0.690       1.448
 All            1.060     0.879       0.830       1.205

```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 538c24e
Pull Request resolved: #306
Summary:

In certain cases, activations and gradients can have a bounded range.
Example 1: consider sigmoid -> fc -> ln -> sigmoid:
1. range of sigmoid in the forward is bounded, so we can scale
   statically if we are ok with a slight accuracy drop in the case that
   the observed values do not reach the theoretical bound
2. range of derivative of sigmoid is bounded
   (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x)
3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html)
   depends on the incoming gradient and the trainable LN parameters, so we can
   derive a bound based on the incoming bound and calculating max of LN parameters

Example 2: consider activations such as silu or GELU, the derivative has a bounded range.

This PR adds static scaling as an option for x, w, dL_dY, and a quick
benchmark to verify performance is as we expect.

TODO add numerics testing.

Test Plan:

```
// baseline
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
category                                              
0_gemm         0.664     0.358       0.539       1.857
1_f8_overhead  0.000     0.260         inf       0.000
2_other        0.397     0.318       0.802       1.247
All            1.061     0.935       0.882       1.134

 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid)
 python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.362       0.545       1.834
 1_f8_overhead  0.000     0.269         inf       0.000
 2_other        0.396     0.273       0.689       1.452
 All            1.061     0.904       0.853       1.173

 // static scaling for x and dL_dY (handwaving for now, the actual code would
 // need to read the LN params to get the max)
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.365       0.549       1.822
 1_f8_overhead  0.000     0.242         inf       0.000
 2_other        0.395     0.273       0.690       1.448
 All            1.060     0.879       0.830       1.205

```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jul 8, 2024
Summary:

In certain cases, activations and gradients can have a bounded range.
For example, consider sigmoid -> fc -> ln -> sigmoid:
1. range of sigmoid in the forward is bounded, so we can scale
   statically if we are ok with a slight accuracy drop in the case that
   the observed values do not reach the theoretical bound
2. range of derivative of sigmoid is bounded
   (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x)
3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html)
   depends on the incoming gradient and the trainable LN parameters, so we can
   derive a bound based on the incoming bound and calculating max of LN parameters

This PR adds static scaling as an option for x, w, dL_dY, and a quick
benchmark to verify performance is as we expect.

TODO add numerics testing.

Test Plan:

```
// baseline
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.160     0.098       0.613       1.632
 1_f8_overhead  0.000     0.100         inf       0.000
 2_other        0.147     0.121       0.823       1.215
 All            0.307     0.319       1.040       0.962

 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid)
 python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.362       0.545       1.834
 1_f8_overhead  0.000     0.269         inf       0.000
 2_other        0.396     0.273       0.689       1.452
 All            1.061     0.904       0.853       1.173

 // static scaling for x and dL_dY (handwaving for now, the actual code would
 // need to read the LN params to get the max)
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.365       0.549       1.822
 1_f8_overhead  0.000     0.242         inf       0.000
 2_other        0.395     0.273       0.690       1.448
 All            1.060     0.879       0.830       1.205

```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 59402fc
Pull Request resolved: #306
Summary:

In certain cases, activations and gradients can have a bounded range.
Example 1: consider sigmoid -> fc -> ln -> sigmoid:
1. range of sigmoid in the forward is bounded, so we can scale
   statically if we are ok with a slight accuracy drop in the case that
   the observed values do not reach the theoretical bound
2. range of derivative of sigmoid is bounded
   (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x)
3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html)
   depends on the incoming gradient and the trainable LN parameters, so we can
   derive a bound based on the incoming bound and calculating max of LN parameters

Example 2: consider activations such as silu or GELU, the derivative has a bounded range.

This PR adds static scaling as an option for x, w, dL_dY, and a quick
benchmark to verify performance is as we expect.

TODO add numerics testing.

Test Plan:

```
// baseline
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
category                                              
0_gemm         0.664     0.358       0.539       1.857
1_f8_overhead  0.000     0.260         inf       0.000
2_other        0.397     0.318       0.802       1.247
All            1.061     0.935       0.882       1.134

 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid)
 python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.362       0.545       1.834
 1_f8_overhead  0.000     0.269         inf       0.000
 2_other        0.396     0.273       0.689       1.452
 All            1.061     0.904       0.853       1.173

 // static scaling for x and dL_dY (handwaving for now, the actual code would
 // need to read the LN params to get the max)
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.365       0.549       1.822
 1_f8_overhead  0.000     0.242         inf       0.000
 2_other        0.395     0.273       0.690       1.448
 All            1.060     0.879       0.830       1.205

```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
@vkuzo vkuzo mentioned this pull request Jul 8, 2024
vkuzo added a commit that referenced this pull request Jul 8, 2024
Summary:

In certain cases, activations and gradients can have a bounded range.
For example, consider sigmoid -> fc -> ln -> sigmoid:
1. range of sigmoid in the forward is bounded, so we can scale
   statically if we are ok with a slight accuracy drop in the case that
   the observed values do not reach the theoretical bound
2. range of derivative of sigmoid is bounded
   (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x)
3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html)
   depends on the incoming gradient and the trainable LN parameters, so we can
   derive a bound based on the incoming bound and calculating max of LN parameters

This PR adds static scaling as an option for x, w, dL_dY, and a quick
benchmark to verify performance is as we expect.

TODO add numerics testing.

Test Plan:

```
// baseline
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.160     0.098       0.613       1.632
 1_f8_overhead  0.000     0.100         inf       0.000
 2_other        0.147     0.121       0.823       1.215
 All            0.307     0.319       1.040       0.962

 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid)
 python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.362       0.545       1.834
 1_f8_overhead  0.000     0.269         inf       0.000
 2_other        0.396     0.273       0.689       1.452
 All            1.061     0.904       0.853       1.173

 // static scaling for x and dL_dY (handwaving for now, the actual code would
 // need to read the LN params to get the max)
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.365       0.549       1.822
 1_f8_overhead  0.000     0.242         inf       0.000
 2_other        0.395     0.273       0.690       1.448
 All            1.060     0.879       0.830       1.205

```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: fb41cc2
Pull Request resolved: #306
Summary:

In certain cases, activations and gradients can have a bounded range.
Example 1: consider sigmoid -> fc -> ln -> sigmoid:
1. range of sigmoid in the forward is bounded, so we can scale
   statically if we are ok with a slight accuracy drop in the case that
   the observed values do not reach the theoretical bound
2. range of derivative of sigmoid is bounded
   (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x)
3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html)
   depends on the incoming gradient and the trainable LN parameters, so we can
   derive a bound based on the incoming bound and calculating max of LN parameters

Example 2: consider activations such as silu or GELU, the derivative has a bounded range.

This PR adds static scaling as an option for x, w, dL_dY, and a quick
benchmark to verify performance is as we expect.

TODO add numerics testing.

Test Plan:

```
// baseline
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
category                                              
0_gemm         0.664     0.358       0.539       1.857
1_f8_overhead  0.000     0.260         inf       0.000
2_other        0.397     0.318       0.802       1.247
All            1.061     0.935       0.882       1.134

 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid)
 python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.362       0.545       1.834
 1_f8_overhead  0.000     0.269         inf       0.000
 2_other        0.396     0.273       0.689       1.452
 All            1.061     0.904       0.853       1.173

 // static scaling for x and dL_dY (handwaving for now, the actual code would
 // need to read the LN params to get the max)
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.365       0.549       1.822
 1_f8_overhead  0.000     0.242         inf       0.000
 2_other        0.395     0.273       0.690       1.448
 All            1.060     0.879       0.830       1.205

```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jul 8, 2024
Summary:

In certain cases, activations and gradients can have a bounded range.
For example, consider sigmoid -> fc -> ln -> sigmoid:
1. range of sigmoid in the forward is bounded, so we can scale
   statically if we are ok with a slight accuracy drop in the case that
   the observed values do not reach the theoretical bound
2. range of derivative of sigmoid is bounded
   (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x)
3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html)
   depends on the incoming gradient and the trainable LN parameters, so we can
   derive a bound based on the incoming bound and calculating max of LN parameters

This PR adds static scaling as an option for x, w, dL_dY, and a quick
benchmark to verify performance is as we expect.

TODO add numerics testing.

Test Plan:

```
// baseline
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.160     0.098       0.613       1.632
 1_f8_overhead  0.000     0.100         inf       0.000
 2_other        0.147     0.121       0.823       1.215
 All            0.307     0.319       1.040       0.962

 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid)
 python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.362       0.545       1.834
 1_f8_overhead  0.000     0.269         inf       0.000
 2_other        0.396     0.273       0.689       1.452
 All            1.061     0.904       0.853       1.173

 // static scaling for x and dL_dY (handwaving for now, the actual code would
 // need to read the LN params to get the max)
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.365       0.549       1.822
 1_f8_overhead  0.000     0.242         inf       0.000
 2_other        0.395     0.273       0.690       1.448
 All            1.060     0.879       0.830       1.205

```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: fc2f8e3
Pull Request resolved: #306
Summary:

Some activations such as sigmoid can have a bounded range.  This PR adds support for setting a bounded range in training.

Test Plan:

```
// unit tests
pytest test/test_base.py

// baseline
> python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
category                                              
0_gemm         0.637     0.353       0.555       1.803
1_f8_overhead  0.000     0.175         inf       0.000
2_other        0.224     0.199       0.888       1.126
All            0.861     0.727       0.844       1.184

> python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear --scaling_type_x static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
category                                              
0_gemm         0.635     0.360       0.566       1.766
1_f8_overhead  0.000     0.182         inf       0.000
2_other        0.224     0.154       0.688       1.454
All            0.859     0.696       0.810       1.234

```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request Jul 8, 2024
Summary:

In certain cases, activations and gradients can have a bounded range.
For example, consider sigmoid -> fc -> ln -> sigmoid:
1. range of sigmoid in the forward is bounded, so we can scale
   statically if we are ok with a slight accuracy drop in the case that
   the observed values do not reach the theoretical bound
2. range of derivative of sigmoid is bounded
   (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x)
3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html)
   depends on the incoming gradient and the trainable LN parameters, so we can
   derive a bound based on the incoming bound and calculating max of LN parameters

This PR adds static scaling as an option for x, w, dL_dY, and a quick
benchmark to verify performance is as we expect.

TODO add numerics testing.

Test Plan:

```
// baseline
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.160     0.098       0.613       1.632
 1_f8_overhead  0.000     0.100         inf       0.000
 2_other        0.147     0.121       0.823       1.215
 All            0.307     0.319       1.040       0.962

 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid)
 python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.362       0.545       1.834
 1_f8_overhead  0.000     0.269         inf       0.000
 2_other        0.396     0.273       0.689       1.452
 All            1.061     0.904       0.853       1.173

 // static scaling for x and dL_dY (handwaving for now, the actual code would
 // need to read the LN params to get the max)
python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
 category
 0_gemm         0.665     0.365       0.549       1.822
 1_f8_overhead  0.000     0.242         inf       0.000
 2_other        0.395     0.273       0.690       1.448
 All            1.060     0.879       0.830       1.205

```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: b8fd4e9
Pull Request resolved: #306
@vkuzo vkuzo changed the title [wip] static scaling support for training static scaling support for training Jul 8, 2024
@vkuzo
Copy link
Contributor Author

vkuzo commented Jul 8, 2024

keeping this in the back pocket until it's needed, abandon for now

@vkuzo vkuzo closed this Jul 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants