v2.3.1 f16x8 Pack Elementwise

DefTruth released this 23 Sep 03:44

· 378 commits to main since this release

d43c53d

What's Changed

[FA2][Half] Add FA2 f16_mma_m16n8k16 kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/35
[Refactor][7/N] CUDA Learn Notes refactor Part-7 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/36
Clamped input range in Sigmoid kernel to prevent overflow by @Phoenix8215 in https://github.com/DefTruth/CUDA-Learn-Notes/pull/37
[Sigmoid][F16] Add f16x8_pack kernel, boost 1.5x ~ by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/39
[Elementwise][Half] support f16x8_pack kernel, boost 1.1x by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/40
[FlashAttention] replace FLOAT4 with LDST128BITS macro by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/41
[RELU][FP16] Add f16x8_pack kernel, boost 2.1x by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/42

New Contributors

@Phoenix8215 made their first contribution in https://github.com/DefTruth/CUDA-Learn-Notes/pull/37

Full Changelog: DefTruth/CUDA-Learn-Notes@v2.3...v2.3.1

Contributors

DefTruth and Phoenix8215

Assets 2