feat(fp8): use fbgemm kernels and load fp8 weights directly #2248

OlivierDehaene · 2024-07-18T13:07:11Z

@danieldk, since you were the one that reworked the weights logic, do you think there is a better to plug the new fp8 weights in Transformers?

server/text_generation_server/layers/fp8.py

server/text_generation_server/utils/weights.py

server/text_generation_server/layers/linear.py

OlivierDehaene · 2024-07-19T17:39:16Z

server/text_generation_server/models/__init__.py

@@ -302,6 +302,9 @@ def get_model(
        if quantize in ["awq", "exl2", "gptq", "marlin"]:
            # These quantizers only work with float16 params.
            dtype = torch.float16
+        elif quantize == "fp8":
+            # gemm kernels are fp8xfp8->bf16
+            dtype = torch.bfloat16


@danieldk is this compatible with the marlin kernels?

Supports both float16 and bfloat16.

However do we want to set the default to this, since most models are float16? Is it needed for the fbgemm quantization kernel?

Yes it's required for fbgemm.
I can add a method in layers/fp8.py to check whether we will use fbgemm and set the default appropriately.

Added a check.

OlivierDehaene · 2024-07-19T17:39:41Z

server/text_generation_server/utils/weights.py

        )

+        if w.dtype == torch.float8_e4m3fn:
+            # FIXME: here to avoid circular import
+            from text_generation_server.layers.fp8 import Fp8Weight


Not happy about this.

Hmmm, this breaks the abstraction quite a bit. This class is also used by other (explicit) quantizers like eetq.

I think it would make more sense to put this implementation in something like a HybridFP8FP16Loader in the fp8 module. Then we could add some logic to the get_loader function, along the lines of: when quantizer==None is set, enumerate over the tensors (should be cheap I think for just getting the dtypes?) and then if an FP8 weight is encountered, return the hybrid loader.

That puts the implementation nicely with the fp8 code and it wouldn't clutter UnquantizedWeight further if we e.g. also want to support bitsandbytes or eetq checkpoints in the future.

Agreed, modified the code to add another loader.

server/text_generation_server/utils/weights.py

server/text_generation_server/layers/fp8.py

danieldk · 2024-07-19T18:17:45Z

server/text_generation_server/layers/fp8.py

+        if self.weight_scale is None:
+            return get_fp8_linear().from_unquant(self.weight, bias, self.dtype)
+        return get_fp8_linear().from_fp8(
+            self.weight, self.weight_scale, self.input_scale, bias, self.dtype
+        )


Nice! Looks like a pattern we could reuse in the future for quantization that is pre-quantized or on the fly.

danieldk · 2024-07-19T18:24:28Z

server/text_generation_server/models/__init__.py

@@ -302,6 +302,9 @@ def get_model(
        if quantize in ["awq", "exl2", "gptq", "marlin"]:
            # These quantizers only work with float16 params.
            dtype = torch.float16
+        elif quantize == "fp8":
+            # gemm kernels are fp8xfp8->bf16
+            dtype = torch.bfloat16


Supports both float16 and bfloat16.

However do we want to set the default to this, since most models are float16? Is it needed for the fbgemm quantization kernel?

danieldk · 2024-07-19T18:33:49Z

server/text_generation_server/utils/weights.py

        )

+        if w.dtype == torch.float8_e4m3fn:
+            # FIXME: here to avoid circular import
+            from text_generation_server.layers.fp8 import Fp8Weight


Hmmm, this breaks the abstraction quite a bit. This class is also used by other (explicit) quantizers like eetq.

I think it would make more sense to put this implementation in something like a HybridFP8FP16Loader in the fp8 module. Then we could add some logic to the get_loader function, along the lines of: when quantizer==None is set, enumerate over the tensors (should be cheap I think for just getting the dtypes?) and then if an FP8 weight is encountered, return the hybrid loader.

That puts the implementation nicely with the fp8 code and it wouldn't clutter UnquantizedWeight further if we e.g. also want to support bitsandbytes or eetq checkpoints in the future.

server/text_generation_server/utils/weights.py

SunMarc · 2024-07-20T01:59:46Z

server/text_generation_server/utils/weights.py

+            input_scale = weights.get_tensor(f"{prefix}.input_scale", cast=False)
+            return Fp8Weight(
+                weight=w,
+                weight_scale=scale,
+                input_scale=input_scale,
+                dtype=weights.dtype,
+            )


I changed input_scale to input_scale_ub which is less ambiguous.

danieldk

Looks great!

* feat(fp8): add support for fbgemm * allow loading fp8 weights directly * update outlines * fix makefile * build fbgemm * avoid circular import and fix dockerfile * add default dtype * refactored weights loader * fix auto conversion * fix quantization config parsing * force new nccl on install * missing get_weights implementation * increase timeout

danieldk reviewed Jul 19, 2024

View reviewed changes

OlivierDehaene force-pushed the feat/fp8_fbgemm branch from 657071c to 7453b85 Compare July 19, 2024 16:27

OlivierDehaene commented Jul 19, 2024

View reviewed changes

danieldk reviewed Jul 19, 2024

View reviewed changes

SunMarc reviewed Jul 20, 2024

View reviewed changes

OlivierDehaene added 9 commits July 20, 2024 09:19

feat(fp8): add support for fbgemm

27084bb

allow loading fp8 weights directly

ee4174b

update outlines

a84373c

fix makefile

8008778

build fbgemm

985df12

avoid circular import and fix dockerfile

10cd8ab

add default dtype

081d16c

refactored weights loader

6a93a24

fix auto conversion

5789139

OlivierDehaene force-pushed the feat/fp8_fbgemm branch from af89ce0 to 5789139 Compare July 20, 2024 07:19

OlivierDehaene added 4 commits July 20, 2024 09:30

fix quantization config parsing

879ea45

force new nccl on install

b9410c3

missing get_weights implementation

c9e8b68

increase timeout

10bec16

danieldk approved these changes Jul 20, 2024

View reviewed changes

OlivierDehaene merged commit 53ec0b7 into main Jul 20, 2024
9 checks passed

OlivierDehaene deleted the feat/fp8_fbgemm branch July 20, 2024 17:02

feat(fp8): use fbgemm kernels and load fp8 weights directly #2248

feat(fp8): use fbgemm kernels and load fp8 weights directly #2248

Uh oh!

Conversation

OlivierDehaene commented Jul 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danieldk Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danieldk Jul 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

danieldk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danieldk Jul 19, 2024 •

edited

Loading

danieldk Jul 19, 2024 •

edited

Loading