This repository was archived by the owner on Aug 7, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 19
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
vkuzo
added a commit
that referenced
this pull request
Jul 5, 2024
Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 816f87e Pull Request resolved: #306
Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
vkuzo
added a commit
that referenced
this pull request
Jul 5, 2024
Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 538c24e Pull Request resolved: #306
Summary: In certain cases, activations and gradients can have a bounded range. Example 1: consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters Example 2: consider activations such as silu or GELU, the derivative has a bounded range. This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.664 0.358 0.539 1.857 1_f8_overhead 0.000 0.260 inf 0.000 2_other 0.397 0.318 0.802 1.247 All 1.061 0.935 0.882 1.134 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
vkuzo
added a commit
that referenced
this pull request
Jul 8, 2024
Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 59402fc Pull Request resolved: #306
Summary: In certain cases, activations and gradients can have a bounded range. Example 1: consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters Example 2: consider activations such as silu or GELU, the derivative has a bounded range. This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.664 0.358 0.539 1.857 1_f8_overhead 0.000 0.260 inf 0.000 2_other 0.397 0.318 0.802 1.247 All 1.061 0.935 0.882 1.134 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
Closed
vkuzo
added a commit
that referenced
this pull request
Jul 8, 2024
Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: fb41cc2 Pull Request resolved: #306
Summary: In certain cases, activations and gradients can have a bounded range. Example 1: consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters Example 2: consider activations such as silu or GELU, the derivative has a bounded range. This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.664 0.358 0.539 1.857 1_f8_overhead 0.000 0.260 inf 0.000 2_other 0.397 0.318 0.802 1.247 All 1.061 0.935 0.882 1.134 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
vkuzo
added a commit
that referenced
this pull request
Jul 8, 2024
Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: fc2f8e3 Pull Request resolved: #306
Summary: Some activations such as sigmoid can have a bounded range. This PR adds support for setting a bounded range in training. Test Plan: ``` // unit tests pytest test/test_base.py // baseline > python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.637 0.353 0.555 1.803 1_f8_overhead 0.000 0.175 inf 0.000 2_other 0.224 0.199 0.888 1.126 All 0.861 0.727 0.844 1.184 > python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear --scaling_type_x static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.635 0.360 0.566 1.766 1_f8_overhead 0.000 0.182 inf 0.000 2_other 0.224 0.154 0.688 1.454 All 0.859 0.696 0.810 1.234 ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]
vkuzo
added a commit
that referenced
this pull request
Jul 8, 2024
Summary: In certain cases, activations and gradients can have a bounded range. For example, consider sigmoid -> fc -> ln -> sigmoid: 1. range of sigmoid in the forward is bounded, so we can scale statically if we are ok with a slight accuracy drop in the case that the observed values do not reach the theoretical bound 2. range of derivative of sigmoid is bounded (https://math.stackexchange.com/questions/78575/derivative-of-sigmoid-function-sigma-x-frac11e-x) 3. derivative of LN (https://liorsinai.github.io/mathematics/2022/05/18/layernorm.html) depends on the incoming gradient and the trainable LN parameters, so we can derive a bound based on the incoming bound and calculating max of LN parameters This PR adds static scaling as an option for x, w, dL_dY, and a quick benchmark to verify performance is as we expect. TODO add numerics testing. Test Plan: ``` // baseline python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.160 0.098 0.613 1.632 1_f8_overhead 0.000 0.100 inf 0.000 2_other 0.147 0.121 0.823 1.215 All 0.307 0.319 1.040 0.962 // static scaling for x (easier to justify numerics given a bounded activation such as sigmoid) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.362 0.545 1.834 1_f8_overhead 0.000 0.269 inf 0.000 2_other 0.396 0.273 0.689 1.452 All 1.061 0.904 0.853 1.173 // static scaling for x and dL_dY (handwaving for now, the actual code would // need to read the LN params to get the max) python benchmarks/profile_linear_float8.py ~/local/tmp/test --model_type sigmoid_linear_ln_sigmoid --scaling_type_x static --scaling_type_dL_dY static ... experiment 0_ref 1_float8 f8_div_ref ref_div_f8 category 0_gemm 0.665 0.365 0.549 1.822 1_f8_overhead 0.000 0.242 inf 0.000 2_other 0.395 0.273 0.690 1.448 All 1.060 0.879 0.830 1.205 ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: b8fd4e9 Pull Request resolved: #306
keeping this in the back pocket until it's needed, abandon for now |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Labels
CLA Signed
This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Stack from ghstack (oldest at bottom):
Summary:
Some activations such as sigmoid can have a bounded range. This PR adds support for setting a bounded range in training.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags: