Releases: xlite-dev/LeetCUDA
Releases · xlite-dev/LeetCUDA
v3.0.11
What's Changed
- feat: add cute hgemv implement by @kitecats in #331
- Update README.md by @DefTruth in #333
- feat: add a cute bank-free mat transpose vectorize impelment by @kitecats in #334
- bugfix: fix layernorm & rmsnorm f16 overflow by @hebangwen in #335
- Bugfix: fix a compilation error by @lixiaoquan in #336
New Contributors
- @hebangwen made their first contribution in #335
- @lixiaoquan made their first contribution in #336
Full Changelog: v3.0.10...v3.0.11
v3.0.10
What's Changed
- Update README.md by @DefTruth in #322
- Update README.md by @DefTruth in #323
- Fix: missing source by @botbw in #325
- Use 128-bit data loading by @kitecats in #326
- Create FUNDING.yml by @DefTruth in #327
- Add open-collective badge by @DefTruth in #328
- Update open-collective contributors badge by @DefTruth in #329
New Contributors
Full Changelog: v3.0.9...v3.0.10
v3.0.9
What's Changed
- feat: add some torch.distributed examples by @DefTruth in #313
- feat: add some torch.distributed examples by @DefTruth in #315
- feat: add a naive CuTe flash-attn by @botbw in #314
- fix(kernels): correct typo in LayerNorm kernel at line 73 110 346 443 by @nxdxml in #317
- misc: manually update submodules by @DefTruth in #318
- chore: add naive cute flash-attn index by @DefTruth in #319
- add triton merge_attn_states zhihu blog by @DefTruth in #320
New Contributors
Full Changelog: v3.0.8...v3.0.9
v3.0.8
LeetCUDA v3.0.7
What's Changed
- Update mat-transpose/README.md by @DefTruth in #300
- feat: add triton fused-softmax by @DefTruth in #301
- misc: add pre-commit & format by @DefTruth in #302
- misc: add developer guide by @DefTruth in #303
- misc: add developer guide by @DefTruth in #304
- misc: fix typo by @DefTruth in #305
- Update CONTRIBUTE.md by @DefTruth in #306
- feat: update pre-commit max-length=80 by @DefTruth in #307
Full Changelog: v3.0.6...v3.0.7
LeetCUDA v3.0.6
What's Changed
- misc: update merge_attn_states unit tests by @DefTruth in #281
- misc: update merge_attn_states docs by @DefTruth in #282
- misc: update merge_attn_states docs by @DefTruth in #283
- feat: remove merge_attn_states kernel help func by @DefTruth in #284
- misc: remove static flag for to/from_float by @DefTruth in #285
- misc: add new zhihu tech blog link by @DefTruth in #287
- misc: add debug flag for ncu profile by @DefTruth in #288
- bugfix: corrected theta calculation in RoPE CUDA kernel by @jiaau in #290
- docs: Add my ring-attention zhihu blog by @DefTruth in #291
- Add simple CuTe mat-transpose implementations by @botbw in #292
- Update README.md by @DefTruth in #296
- Update README.md by @DefTruth in #297
- Update README.md by @DefTruth in #298
- Rename to LeetCUDA by @DefTruth in #299
New Contributors
Full Changelog: v3.0.5...v3.0.6
v3.0.5
What's Changed
- [Misc] Automated submodule update by @DefTruth in #261
- Update README.md by @tpoisonooo in #264
- Update README.md by @DefTruth in #265
- bugfix: only export per token softmax kernels by @DefTruth in #266
- misc: update vllm latest slides by @DefTruth in #267
- feat: add triton vector_add kernel by @DefTruth in #268
- feat: add triton merge_attn_states kernel by @DefTruth in #269
- feat: add cuda merge_attn_states kernel by @DefTruth in #270
- feat: update cuda merge_attn_states kernel by @DefTruth in #271
- misc: dispatch CUDA merge_attn_states by @DefTruth in #273
- misc: add triton kernel index by @DefTruth in #274
- Fix mistake on mat trans 2d when init grid. by @bear-zd in #275
- misc: update cuda merge_attn_states kernel by @DefTruth in #276
- kernel: optimize merge_attn_states CUDA kernel dispatch by @DefTruth in #278
- feat: optimize merge_attn_states thread block dispatch by @DefTruth in #279
New Contributors
- @tpoisonooo made their first contribution in #264
Full Changelog: v3.0.4...v3.0.5
v3.0.4
What's Changed
- [Docs] Add vLLM + DeepSeek-R1 671B deploy blog by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/259
Full Changelog: DefTruth/CUDA-Learn-Notes@v3.0.3...v3.0.4
v3.0.3
What's Changed
- [Misc] Automated submodule update by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/257
Full Changelog: DefTruth/CUDA-Learn-Notes@v3.0.2...v3.0.3
v3.0.2
What's Changed
- Fix typo in block_all_reduce.cu by @wplf in https://github.com/DefTruth/CUDA-Learn-Notes/pull/247
- fix typo about enougth by @wplf in https://github.com/DefTruth/CUDA-Learn-Notes/pull/248
- [FFPA] Add FFPA tech zhihu blog by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/252
- [FFPA] Update FFPA(Split-D) blog title by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/253
- [Misc] Automated submodule update by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/254
New Contributors
- @wplf made their first contribution in https://github.com/DefTruth/CUDA-Learn-Notes/pull/247
Full Changelog: DefTruth/CUDA-Learn-Notes@v3.0.1...v3.0.2