Releases: xlite-dev/LeetCUDA
Releases · xlite-dev/LeetCUDA
v2.4.6 HGEMM Copy Async
What's Changed
- [Softmax] Add online softmax according to Nvidia Paper by @bear-zd in https://github.com/DefTruth/CUDA-Learn-Notes/pull/60
- [HGEMM][Async] support K16/32 pack+cp.async+dbuf by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/62
- [Softmax][Bugfix] fixed softmax compile error by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/63
New Contributors
- @bear-zd made their first contribution in https://github.com/DefTruth/CUDA-Learn-Notes/pull/60
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.5...v2.4.6
v2.4.5 HGEMM Double Buffers
What's Changed
- [FlashAttention] Refactor FlashAttention PyTorch bindings by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/55
- [SGEMM] test bank conflicts free with smem offset by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/56
- [HGEMM] HEGMM kernel with double buffers by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/57
- [Docs] Add docs for HGEMM/SGEMM double buffers by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/58
- [HGEMM] Add PyTorch HGEMM profile by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/59
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.4...v2.4.5
v2.4.4 Pack HGEMM
What's Changed
- [SGEMM] Add naive sgemm kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/51
- [SGEMM] bank conflicts free & double buffers by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/52
- [Misc][Benchmark] optimize benchmarks by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/53
- [HGEMM] Pack sliced_k f16x4/fp16x8 HGEMM by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/54
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.3...v2.4.4
v2.4.3 Pack Softmax
What's Changed
- [LayerNorm][FP16] support fp16x8_pack_f32 kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/48
- [Softmax][FP16] Pack f16x8 softmax kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/49
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.2...v2.4.3
v2.4.2 Pack RMSNorm
What's Changed
- [RMSNorm][FP16] Pack f16x8 rmsnorm by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/47
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.1...v2.4.2
v2.4.1 Pack LayerNorm
What's Changed
- [Nsight] Add nsys/ncu usage, ptx/sass by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/44
- [DotProd][FP16] support f16x8_pack kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/45
- [LayerNorm][FP16] Add pack support for f16x8 LD/ST by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/46
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4...v2.4.1
v2.4 Pack Reduce LDST
What's Changed
- [Reduce][Kernel] Pack f16/bf16x8 & fp8/i8x16 LD/ST by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/43
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.3.1...v2.4
v2.3.1 f16x8 Pack Elementwise
What's Changed
- [FA2][Half] Add FA2 f16_mma_m16n8k16 kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/35
- [Refactor][7/N] CUDA Learn Notes refactor Part-7 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/36
- Clamped input range in Sigmoid kernel to prevent overflow by @Phoenix8215 in https://github.com/DefTruth/CUDA-Learn-Notes/pull/37
- [Sigmoid][F16] Add f16x8_pack kernel, boost 1.5x ~ by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/39
- [Elementwise][Half] support f16x8_pack kernel, boost 1.1x by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/40
- [FlashAttention] replace FLOAT4 with LDST128BITS macro by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/41
- [RELU][FP16] Add f16x8_pack kernel, boost 2.1x by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/42
New Contributors
- @Phoenix8215 made their first contribution in https://github.com/DefTruth/CUDA-Learn-Notes/pull/37
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.3...v2.3.1
v2.3 Refactor 6/N
What's Changed
- [Refactor][6/N] CUDA Learn Notes refactor Part-6 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/17
- [Refactor][5/N] CUDA Learn Notes refactor Part-6 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/18
- [LayerNorm][Half] support fp16x8 packed LayerNorm by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/19
- [Reduce][Half] add HALF2 & BFLOAT2 macro by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/21
- [RMSNorm][Half] support fp16x8 packed RMSNorm by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/22
- [Bugfix][Kernel] fixed some kernel blocks calculate errors by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/23
- [Elementwise][Half] support fp16x8 packed Elementwise by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/24
- [Elementwise][Half] support fp16x8 packed Elementwise by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/25
- [RELU][Half] support fp16x8 RELU kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/26
- [RMSNorm] support f16x8_f32 RMSNorm by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/28
- [RMSNorm][Kernel] Add FLOAT2/HALF2_VARIANCE macro by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/29
- [LayerNorm][Kernel] Add HALF2 SUM/SUB/VAR macro by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/30
- [HGEMM] Add slicked_k&t_8x8_sliced_k_f16x4 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/31
- [HGEMV][Half] support hgemv k32/k128/f16 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/32
- [FlashAttention] Refactor flash_attn_1_fwd_f32 kernel by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/33
- Bump up to v2.3 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/34
Full Changelog: DefTruth/CUDA-Learn-Notes@v2.2...v2.3
v2.2 Refactor 5/N
What's Changed
- [Refactor][5/N] CUDA Learn Notes refactor Part-5 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/15
- Bump up to v2.2 by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/16
Full Changelog: DefTruth/CUDA-Learn-Notes@2.1...v2.2