-
Notifications
You must be signed in to change notification settings - Fork 12.2k
Avoid fp32->fp16->fp32 conversion on cdna in ggml_cuda_op_mul_mat_cublas #11356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I fear any further improvements will require extensive changes to mmq and the use of __builtin_amdgcn_mfma_I32_32x32x8I8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be happy to cooperate on changes to MMQ but my expectation is that it will not at all be easy. The performance is very sensitive to the exact memory layout due to shared memory bank conflicts so it may be better to just write an explicit ROCm kernel. The correct way to utilize the AMD equivalent of tensor cores via HIP would be to modify mma.cuh
(unless this interface for some reason cannot be used). I am currently working on language model training and have a working CPU/CUDA prototype. If this is of interest of you it would make sense to at some point investigate similar performance optimizations for OUT_PROD
as you did for MUL_MAT
in this PR. Also one of my next goals will be to re-write the FlashAttention code to use primitives like in mma.cuh
instead of nvcuda::wmma
. So if you are interested in AMD support it may make sense to check ahead of time whether the mma.cuh
interface will need to be adjusted.
Yeah i know, i think eventually we would have to effectively split the cuda and hip backed but i dont expect i will ever have the bandwidth to maintain the result, so for now we need to keep things as is. Annoyingly for best performance you would need 2 sets of kernels even for hip since the difference in performance characteristics of rdna vs cdna/gcn gpus is pretty big. I will try to look into mma soon from the perspective of cdna. |
Right now I only defined int8 primitives in |
@JohannesGaessler i fixed the nit. Btw to you know offhand (have not tried profileing this yet) where this discrepancy comes from:
Feals a bit large to me. |
I could be misremembering but I think the server had a comparatively small default physical batch size of 128. If you are using cuBLAS/rocBLAS GEMM that imposes a large overhead for dequantizing the weight matrices to FP16/FP32. You can check this by comparing performance for an FP32 model. |
This PR broke FP16 GEMM, fixed by #11396 . |
This further improves on #10498 by removeing the fp32->fp16->fp32 conversion on cdna in ggml_cuda_op_mul_mat_cublas. Unlike what is stated in #10498 this actually dose improve performance, as the issue fixed by #11244 was simply hiding the change. the issue fixed by #11244 was also hideing a pessimisation in #10498 which this pr also reverts.
Master:
This pr + #11244: