-
Notifications
You must be signed in to change notification settings - Fork 12.2k
CUDA: faster large batch FA without tensor cores #7314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: faster large batch FA without tensor cores #7314
Conversation
This PR should provide a good speedup for the P100 but unfortunately I don't own one with which I could test the code. I would appreciate it if a P100 owner could post the output of
with the path to an actual model. |
I have 4 m40's if that will help. If this works I may just drop the money for 4x p100s |
Here ya' go! I added -ts 1 to restrict it to one p100. I can redo the test without it if you like - I have 5 available. I tried to use a similar model to yours. Command: ./llama-bench --model ../../mod/gguf/llama-2-7b.Q4_0.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 1,2,4,8,16,32,64,128,256,512,1024,2048,4096 -ts 1 Output:
You seem to be getting dramatically faster results with your p40 than my p100, which has me curious. |
Using Dell PowerEdge R730 with Dual Intel Xeon E5-2697 V3 2.6 GHz 14 Core
|
@dirkson @richginsberg thank you. |
Seeing about +5% on the P100, doesn't matter if 1 or 2 GPUs. However I'm getting very different P40 results from what you've posted above - I wonder did you run the test with 4xP40? I don't have 4, I only have 2: With 1xP40 I observe a large (30%) improvement at low batch sizes but past batch 512 it gets a tiny bit slower. With 2xP40 things really open up the 50% performance improvement is across the board and massive.. well done 🤯 💪 Single P100ggml_cuda_init: found 1 CUDA devices:
Dual P100Master: FA is slowerggml_cuda_init: found 2 CUDA devices:
This branch: FA is 5% faster!ggml_cuda_init: found 2 CUDA devices:
Single P40 (faster up to 256 only)ggml_cuda_init: found 1 CUDA devices:
Dual P40Master: FA slower past ctx 256ggml_cuda_init: found 2 CUDA devices:
This branch: 🤯 🐎Device 0: Tesla P40, compute capability 6.1, VMM: yes
|
The numbers are for Mistral 7b q4_0 on 1x P40, running on Linux 6.6.26-1-MANJARO. Are you using Windows? |
@JohannesGaessler I am running Ubuntu-22, the numbers I posted were for llama2-7b but switching to mistral-7b doesn't make much difference I see the same pattern, a single P40 is slower after b=256 and doesn't hit anywhere near the speeds you're reporting: ggml_cuda_init: found 1 CUDA devices:
For reference here is a RTX3060 in the same machine on the same model: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
|
Keep in mind that to increase the batch size that is submitted to the CUDA backend, you need to increase the ubatch-size alongside the batch size. Adding |
Do you have ECC memory enabled? If it's disabled Are you disabling the other GPUs via |
@slaren thank you for the clarification, in this particular case it luckily does not seem to affect the conclusions:
For very large batch sizes the performance with FlashAttention decreases but the performance seems to be optimal with a batch size of 512 anyways. |
it seems less optimal for qwen2 32B at larger batch |
The closest AMD alternative I know to NVIDIA NSight Compute would be Radeon GPU Profiler. It's still a bit different, but may be enough to get started. On the command-line,
|
Another run using Asus ESC4000 G4 with Intel Xeon Gold 6138 Processor LGA3647 1.8Ghz 20 Core 40 Thread
|
ECC is disabled.
via CUDA_VISIABLE_DEVICES, but I just tried via -ts and the results were the same. My P100 numbers match what others are reporting, but your P40 numbers are somehow ~4x mine. I guess we need another set of P40 benchmarks. |
@sorasoras I am not able to reproduce the performance issue with qwen 1.5 q4_0:
|
It could be something to do with " -DLLAMA_CUDA_DMMV_X=64 -DLLAMA_CUDA_MMV_Y=4" |
yup, it should work without -DLLAMA_CUDA_DMMV_X=64 -DLLAMA_CUDA_MMV_Y=4 when compile |
Seeing great results with this PR @JohannesGaessler thanks! Here's the numbers from a P40 that I've power limited to 130W (because it keeps the card cooler): P40
RTX 3060Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes (Power limited to 150W)
|
@JohannesGaessler Looks like you were right and there was something power limiting the P40s in my main rig to around 70W. I've moved them to the secondary and now they're >200W during these tests. My observation from the severely power-limited rig stands: with 2xP40 the performance gains here are HUGE. Single P40ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2xP40 split layerggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
2xP40 Split rowggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
Llama-3-70B-InstructNot as drastic but still some very welcome improvements, staying above 8 tok/sec: CUDA_VISIBLE_DEVICES=0,1 ./llama-bench --model /disk-0/models/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf -r 1 -fa 0,1 -b 256,512 -sm layer,row
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to block merging this, but I will point the obvious that there is a lot of code duplication here and that is going to complicate maintaining this code in the future.
@JohannesGaessler This was working great after merge but with the new Phi3 related commits, I'm now getting a crash when When Current version from master that's crashing with FA: Startup command: Phi-3 Medium gguf from here: https://huggingface.co/bartowski/Phi-3-medium-128k-instruct-GGUF Crash output:
|
There was an incorrect check for precision which is now fixed on master. However, if like Phi-2 Phi-3 is using a head size of 80 the code will still not work. |
Thanks for the quick fix @JohannesGaessler ! After merging latest changes, inference is now working well on the P40 with FA with the Phi 3 model I linked above. |
This PR adds CUDA FlashAttention kernels that do not use tensor cores and are optimized for large batch sizes. On my P40 enabling FlashAttention is now consistently faster:
On my RX 6800 these new kernels unfortunately perform quite poorly which is why I'm not enabling them for AMD. I don't know what the issue is and I cannot use NVIDIA NSight Compute to find out either. To my knowledge there is simply no equivalent AMD tool; if it turns out that I am just ignorant I would love for someone to correct me.