static scaling support for training #306

vkuzo · 2024-07-05T22:22:51Z

Stack from ghstack (oldest at bottom):

Summary:

Some activations such as sigmoid can have a bounded range. This PR adds support for setting a bounded range in training.

Test Plan:

// unit tests
pytest test/test_base.py

// baseline
> python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
category                                              
0_gemm         0.637     0.353       0.555       1.803
1_f8_overhead  0.000     0.175         inf       0.000
2_other        0.224     0.199       0.888       1.126
All            0.861     0.727       0.844       1.184

> python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear --scaling_type_x static
...
 experiment     0_ref  1_float8  f8_div_ref  ref_div_f8
category                                              
0_gemm         0.635     0.360       0.566       1.766
1_f8_overhead  0.000     0.182         inf       0.000
2_other        0.224     0.154       0.688       1.454
All            0.859     0.696       0.810       1.234

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 816f87e Pull Request resolved: #306

Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 538c24e Pull Request resolved: #306

Summary: In certain cases, activations and gradients can have a bounded range. Example 1: consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters Example 2: consider activations such as silu or GELU, the derivative has a bounded range. This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.664 0.358 0.539 1.857 1_f8_overhead 0.000 0.260 inf 0.000 2_other 0.397 0.318 0.802 1.247 All 1.061 0.935 0.882 1.134 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 59402fc Pull Request resolved: #306

Summary: In certain cases, activations and gradients can have a bounded range. Example 1: consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters Example 2: consider activations such as silu or GELU, the derivative has a bounded range. This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.664 0.358 0.539 1.857 1_f8_overhead 0.000 0.260 inf 0.000 2_other 0.397 0.318 0.802 1.247 All 1.061 0.935 0.882 1.134 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: fb41cc2 Pull Request resolved: #306

Summary: In certain cases, activations and gradients can have a bounded range. Example 1: consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters Example 2: consider activations such as silu or GELU, the derivative has a bounded range. This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.664 0.358 0.539 1.857 1_f8_overhead 0.000 0.260 inf 0.000 2_other 0.397 0.318 0.802 1.247 All 1.061 0.935 0.882 1.134 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: fc2f8e3 Pull Request resolved: #306

Summary: Some activations such as sigmoid can have a bounded range. This PR adds support for setting a bounded range in training. Test Plan: ``` // unit tests pytest test/test_base.py // baseline > python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.637 0.353 0.555 1.803 1_f8_overhead 0.000 0.175 inf 0.000 2_other 0.224 0.199 0.888 1.126 All 0.861 0.727 0.844 1.184 > python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear --scaling_type_x static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.635 0.360 0.566 1.766 1_f8_overhead 0.000 0.182 inf 0.000 2_other 0.224 0.154 0.688 1.454 All 0.859 0.696 0.810 1.234 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: b8fd4e9 Pull Request resolved: #306

vkuzo · 2024-07-08T20:30:15Z

keeping this in the back pocket until it's needed, abandon for now

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 5, 2024

vkuzo mentioned this pull request Jul 8, 2024

unify linear test cases #307

Closed

vkuzo changed the title ~~[wip] static scaling support for training~~ static scaling support for training Jul 8, 2024

vkuzo closed this Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

static scaling support for training #306

static scaling support for training #306

Uh oh!

vkuzo commented Jul 5, 2024 •

edited

Loading

Uh oh!

vkuzo commented Jul 8, 2024

Uh oh!

Uh oh!

static scaling support for training #306

static scaling support for training #306

Uh oh!

Conversation

vkuzo commented Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Jul 8, 2024

Uh oh!

Uh oh!

vkuzo commented Jul 5, 2024 •

edited

Loading