HGEMM Up to 113 TFLOPS:L20

DefTruth released this 21 Oct 01:56

· 327 commits to main since this release

0aeb450

What's Changed

[Mat][Trans] Add f32/f32x4 row/col first kernel by @bear-zd in https://github.com/DefTruth/CUDA-Learn-Notes/pull/89
[Docs][Contribute] Add How to contribute Notes by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/90
[HGEMM] optimize SMEM padding, up to 113 TFLOPS by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/92
[Mat][Trans] Add f32x4_shared/bcf row/col first kernel. by @bear-zd in https://github.com/DefTruth/CUDA-Learn-Notes/pull/91
[Docs] rename mat_transpose -> mat-transpose by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/93
[HGEMM] Add GeForce RTX 3080 Laptop benchmark by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/94
[HGEMM] update HGEMM benchmark option by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/95
[HGEMM] Refactor HGEMM WMMA 161616 kernels by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/96
[HGEMM] Update HGEMM WMMA Benchmark by @DefTruth in https://github.com/DefTruth/CUDA-Learn-Notes/pull/97

Full Changelog: DefTruth/CUDA-Learn-Notes@v2.4.12...v2.4.13

Contributors

DefTruth and bear-zd

Assets 2