v2.3.1 f16x8 Pack Elementwise
What's Changed
- [FA2][Half] Add FA2 f16_mma_m16n8k16 kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/35
- [Refactor][7/N] CUDA Learn Notes refactor Part-7 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/36
- Clamped input range in Sigmoid kernel to prevent overflow by @Phoenix8215 in https://github.com/DefTruth/CUDA-Learn-Notes/pull/37
- [Sigmoid][F16] Add f16x8_pack kernel, boost 1.5x ~ by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/39
- [Elementwise][Half] support f16x8_pack kernel, boost 1.1x by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/40
- [FlashAttention] replace FLOAT4 with LDST128BITS macro by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/41
- [RELU][FP16] Add f16x8_pack kernel, boost 2.1x by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/42
New Contributors
- @Phoenix8215 made their first contribution in https://github.com/DefTruth/CUDA-Learn-Notes/pull/37
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.3...v2.3.1