Skip to content

CUDA: faster large batch FA without tensor cores #7314

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 17, 2024

Conversation

JohannesGaessler
Copy link
Collaborator

This PR adds CUDA FlashAttention kernels that do not use tensor cores and are optimized for large batch sizes. On my P40 enabling FlashAttention is now consistently faster:

model backend ngl n_batch fa test t/s
llama 7B Q4_0 CUDA 99 1 0 pp4096 40.88 ± 0.00
llama 7B Q4_0 CUDA 99 1 1 pp4096 52.58 ± 0.00
llama 7B Q4_0 CUDA 99 2 0 pp4096 43.23 ± 0.00
llama 7B Q4_0 CUDA 99 2 1 pp4096 96.79 ± 0.00
llama 7B Q4_0 CUDA 99 4 0 pp4096 72.15 ± 0.00
llama 7B Q4_0 CUDA 99 4 1 pp4096 115.52 ± 0.00
llama 7B Q4_0 CUDA 99 8 0 pp4096 86.29 ± 0.00
llama 7B Q4_0 CUDA 99 8 1 pp4096 147.49 ± 0.00
llama 7B Q4_0 CUDA 99 16 0 pp4096 113.10 ± 0.00
llama 7B Q4_0 CUDA 99 16 1 pp4096 179.72 ± 0.00
llama 7B Q4_0 CUDA 99 32 0 pp4096 215.08 ± 0.00
llama 7B Q4_0 CUDA 99 32 1 pp4096 336.38 ± 0.00
llama 7B Q4_0 CUDA 99 64 0 pp4096 394.15 ± 0.00
llama 7B Q4_0 CUDA 99 64 1 pp4096 560.28 ± 0.00
llama 7B Q4_0 CUDA 99 128 0 pp4096 537.85 ± 0.00
llama 7B Q4_0 CUDA 99 128 1 pp4096 663.37 ± 0.00
llama 7B Q4_0 CUDA 99 256 0 pp4096 661.92 ± 0.00
llama 7B Q4_0 CUDA 99 256 1 pp4096 741.85 ± 0.00
llama 7B Q4_0 CUDA 99 512 0 pp4096 724.98 ± 0.00
llama 7B Q4_0 CUDA 99 512 1 pp4096 740.43 ± 0.00
llama 7B Q4_0 CUDA 99 1024 0 pp4096 711.43 ± 0.00
llama 7B Q4_0 CUDA 99 1024 1 pp4096 729.14 ± 0.00
llama 7B Q4_0 CUDA 99 2048 0 pp4096 699.79 ± 0.00
llama 7B Q4_0 CUDA 99 2048 1 pp4096 725.82 ± 0.00
llama 7B Q4_0 CUDA 99 4096 0 pp4096 705.89 ± 0.00
llama 7B Q4_0 CUDA 99 4096 1 pp4096 724.40 ± 0.00

On my RX 6800 these new kernels unfortunately perform quite poorly which is why I'm not enabling them for AMD. I don't know what the issue is and I cannot use NVIDIA NSight Compute to find out either. To my knowledge there is simply no equivalent AMD tool; if it turns out that I am just ignorant I would love for someone to correct me.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 15, 2024

This PR should provide a good speedup for the P100 but unfortunately I don't own one with which I could test the code. I would appreciate it if a P100 owner could post the output of

./llama-bench --model models/opt/${model_name}-${quantization}.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

with the path to an actual model.

@Tom-Neverwinter
Copy link

I have 4 m40's if that will help. If this works I may just drop the money for 4x p100s

@mofosyne mofosyne added Nvidia GPU Issues specific to Nvidia GPUs Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 16, 2024
@dirkson
Copy link

dirkson commented May 16, 2024

Here ya' go!

I added -ts 1 to restrict it to one p100. I can redo the test without it if you like - I have 5 available. I tried to use a similar model to yours.

Command:

./llama-bench --model ../../mod/gguf/llama-2-7b.Q4_0.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 1,2,4,8,16,32,64,128,256,512,1024,2048,4096 -ts 1

Output:

ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 5 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 2: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 3: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 4: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
| model                          |       size |     params | backend    | ngl |    n_batch |         fa | ts           |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ------------ | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |          0 | 1.00         |        pp4096 |     17.47 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |          1 | 1.00         |        pp4096 |     18.96 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |          0 | 1.00         |        pp4096 |      7.62 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |          1 | 1.00         |        pp4096 |      7.76 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          4 |          0 | 1.00         |        pp4096 |     14.67 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          4 |          1 | 1.00         |        pp4096 |     15.10 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          8 |          0 | 1.00         |        pp4096 |     27.46 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          8 |          1 | 1.00         |        pp4096 |     28.60 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         16 |          0 | 1.00         |        pp4096 |     51.89 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         16 |          1 | 1.00         |        pp4096 |     57.58 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         32 |          0 | 1.00         |        pp4096 |    109.10 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         32 |          1 | 1.00         |        pp4096 |    113.68 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         64 |          0 | 1.00         |        pp4096 |    174.80 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         64 |          1 | 1.00         |        pp4096 |    179.33 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        128 |          0 | 1.00         |        pp4096 |    257.94 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        128 |          1 | 1.00         |        pp4096 |    267.08 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        256 |          0 | 1.00         |        pp4096 |    348.06 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        256 |          1 | 1.00         |        pp4096 |    363.57 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        512 |          0 | 1.00         |        pp4096 |    419.89 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        512 |          1 | 1.00         |        pp4096 |    442.53 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       1024 |          0 | 1.00         |        pp4096 |    419.85 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       1024 |          1 | 1.00         |        pp4096 |    443.04 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       2048 |          0 | 1.00         |        pp4096 |    419.94 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       2048 |          1 | 1.00         |        pp4096 |    443.18 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       4096 |          0 | 1.00         |        pp4096 |    420.50 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       4096 |          1 | 1.00         |        pp4096 |    443.35 ± 0.00 |

build: cc0332df (2873)

You seem to be getting dramatically faster results with your p40 than my p100, which has me curious.

@richginsberg
Copy link

richginsberg commented May 16, 2024

Using Dell PowerEdge R730 with Dual Intel Xeon E5-2697 V3 2.6 GHz 14 Core
And CUDA 12.4
Using single Tesla P100-PCIE-12GB

./llama-bench --model ~/models-gguf/llama-2-7b.Q4_0.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 1,2,4,8,16,32,64,128,256,512,1024,2048,4096
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P100-PCIE-12GB, compute capability 6.0, VMM: yes

| model                          |       size |     params | backend    | ngl |    n_batch |         fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |          0 |        pp4096 |     16.93 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |          1 |        pp4096 |     18.43 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |          0 |        pp4096 |      6.28 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |          1 |        pp4096 |      6.38 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          4 |          0 |        pp4096 |     12.18 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          4 |          1 |        pp4096 |     12.47 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          8 |          0 |        pp4096 |     23.46 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          8 |          1 |        pp4096 |     24.33 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         16 |          0 |        pp4096 |     44.70 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         16 |          1 |        pp4096 |     49.00 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         32 |          0 |        pp4096 |     92.42 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         32 |          1 |        pp4096 |     96.83 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         64 |          0 |        pp4096 |    119.35 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         64 |          1 |        pp4096 |    122.89 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        128 |          0 |        pp4096 |    166.86 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        128 |          1 |        pp4096 |    174.27 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        256 |          0 |        pp4096 |    309.36 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        256 |          1 |        pp4096 |    325.39 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        512 |          0 |        pp4096 |    380.81 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        512 |          1 |        pp4096 |    412.09 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       1024 |          0 |        pp4096 |    380.95 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       1024 |          1 |        pp4096 |    412.34 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       2048 |          0 |        pp4096 |    381.21 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       2048 |          1 |        pp4096 |    412.26 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       4096 |          0 |        pp4096 |    381.38 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       4096 |          1 |        pp4096 |    412.76 ± 0.00 |

build: cc0332df (2873)

@JohannesGaessler
Copy link
Collaborator Author

@dirkson @richginsberg thank you.

Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 559 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8364.72ms p(95)=20051.49ms fails=, finish reason: stop=506 truncated=53
  • Prompt processing (pp): avg=104.54tk/s p(95)=494.32tk/s
  • Token generation (tg): avg=46.42tk/s p(95)=48.95tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=cuda-fa-no-tc-14 commit=cc0332dfd7979214d8b6dc0295396edd1b813c2c

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 559 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715859858 --> 1715860484
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 508.36, 508.36, 508.36, 508.36, 508.36, 671.01, 671.01, 671.01, 671.01, 671.01, 697.72, 697.72, 697.72, 697.72, 697.72, 778.01, 778.01, 778.01, 778.01, 778.01, 793.97, 793.97, 793.97, 793.97, 793.97, 789.33, 789.33, 789.33, 789.33, 789.33, 819.45, 819.45, 819.45, 819.45, 819.45, 820.8, 820.8, 820.8, 820.8, 820.8, 834.86, 834.86, 834.86, 834.86, 834.86, 848.85, 848.85, 848.85, 848.85, 848.85, 885.82, 885.82, 885.82, 885.82, 885.82, 894.33, 894.33, 894.33, 894.33, 894.33, 877.32, 877.32, 877.32, 877.32, 877.32, 879.49, 879.49, 879.49, 879.49, 879.49, 883.95, 883.95, 883.95, 883.95, 883.95, 883.82, 883.82, 883.82, 883.82, 883.82, 901.37, 901.37, 901.37, 901.37, 901.37, 899.96, 899.96, 899.96, 899.96, 899.96, 896.53, 896.53, 896.53, 896.53, 896.53, 902.43, 902.43, 902.43, 902.43, 902.43, 902.82, 902.82, 902.82, 902.82, 902.82, 911.62, 911.62, 911.62, 911.62, 911.62, 912.73, 912.73, 912.73, 912.73, 912.73, 914.02, 914.02, 914.02, 914.02, 914.02, 913.82, 913.82, 913.82, 913.82, 913.82, 843.99, 843.99, 843.99, 843.99, 843.99, 843.37, 843.37, 843.37, 843.37, 843.37, 843.02, 843.02, 843.02, 843.02, 843.02, 847.99, 847.99, 847.99, 847.99, 847.99, 846.98, 846.98, 846.98, 846.98, 846.98, 847.52, 847.52, 847.52, 847.52, 847.52, 854.89, 854.89, 854.89, 854.89, 854.89, 850.23, 850.23, 850.23, 850.23, 850.23, 853.43, 853.43, 853.43, 853.43, 853.43, 852.97, 852.97, 852.97, 852.97, 852.97, 849.89, 849.89, 849.89, 849.89, 849.89, 847.43, 847.43, 847.43, 847.43, 847.43, 847.11, 847.11, 847.11, 847.11, 847.11, 851.64, 851.64, 851.64, 851.64, 851.64, 852.65, 852.65, 852.65, 852.65, 852.65, 853.76, 853.76, 853.76, 853.76, 853.76, 849.59, 849.59, 849.59, 849.59, 849.59, 840.86, 840.86, 840.86, 840.86, 840.86, 839.65, 839.65, 839.65, 839.65, 839.65, 841.57, 841.57, 841.57, 841.57, 841.57, 844.22, 844.22, 844.22, 844.22, 844.22, 843.6, 843.6, 843.6, 843.6, 843.6, 849.17, 849.17, 849.17, 849.17, 849.17, 847.85, 847.85, 847.85, 847.85, 847.85, 850.96, 850.96, 850.96, 850.96, 850.96, 854.5, 854.5, 854.5, 854.5, 854.5, 853.32, 853.32, 853.32, 853.32, 853.32, 858.8, 858.8, 858.8, 858.8, 858.8, 859.49, 859.49, 859.49, 859.49, 859.49, 859.43, 859.43, 859.43, 859.43, 859.43, 856.82, 856.82, 856.82, 856.82, 856.82, 856.31, 856.31, 856.31, 856.31, 856.31, 857.53, 857.53, 857.53, 857.53, 857.53, 859.42, 859.42, 859.42, 859.42, 859.42, 858.21, 858.21, 858.21, 858.21, 858.21, 857.8, 857.8, 857.8, 857.8]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 559 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715859858 --> 1715860484
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.56, 40.56, 40.56, 40.56, 40.56, 26.95, 26.95, 26.95, 26.95, 26.95, 28.56, 28.56, 28.56, 28.56, 28.56, 30.54, 30.54, 30.54, 30.54, 30.54, 31.6, 31.6, 31.6, 31.6, 31.6, 32.64, 32.64, 32.64, 32.64, 32.64, 33.95, 33.95, 33.95, 33.95, 33.95, 34.14, 34.14, 34.14, 34.14, 34.14, 34.56, 34.56, 34.56, 34.56, 34.56, 33.84, 33.84, 33.84, 33.84, 33.84, 33.8, 33.8, 33.8, 33.8, 33.8, 32.79, 32.79, 32.79, 32.79, 32.79, 32.81, 32.81, 32.81, 32.81, 32.81, 30.83, 30.83, 30.83, 30.83, 30.83, 30.81, 30.81, 30.81, 30.81, 30.81, 29.71, 29.71, 29.71, 29.71, 29.71, 30.02, 30.02, 30.02, 30.02, 30.02, 29.72, 29.72, 29.72, 29.72, 29.72, 29.76, 29.76, 29.76, 29.76, 29.76, 29.72, 29.72, 29.72, 29.72, 29.72, 29.72, 29.72, 29.72, 29.72, 29.72, 29.77, 29.77, 29.77, 29.77, 29.77, 29.53, 29.53, 29.53, 29.53, 29.53, 29.78, 29.78, 29.78, 29.78, 29.78, 30.08, 30.08, 30.08, 30.08, 30.08, 30.15, 30.15, 30.15, 30.15, 30.15, 30.14, 30.14, 30.14, 30.14, 30.14, 30.5, 30.5, 30.5, 30.5, 30.5, 30.52, 30.52, 30.52, 30.52, 30.52, 30.75, 30.75, 30.75, 30.75, 30.75, 30.85, 30.85, 30.85, 30.85, 30.85, 31.01, 31.01, 31.01, 31.01, 31.01, 30.98, 30.98, 30.98, 30.98, 30.98, 30.68, 30.68, 30.68, 30.68, 30.68, 30.2, 30.2, 30.2, 30.2, 30.2, 29.94, 29.94, 29.94, 29.94, 29.94, 29.82, 29.82, 29.82, 29.82, 29.82, 29.94, 29.94, 29.94, 29.94, 29.94, 30.15, 30.15, 30.15, 30.15, 30.15, 30.24, 30.24, 30.24, 30.24, 30.24, 30.29, 30.29, 30.29, 30.29, 30.29, 30.06, 30.06, 30.06, 30.06, 30.06, 29.63, 29.63, 29.63, 29.63, 29.63, 28.87, 28.87, 28.87, 28.87, 28.87, 28.74, 28.74, 28.74, 28.74, 28.74, 28.74, 28.74, 28.74, 28.74, 28.74, 28.73, 28.73, 28.73, 28.73, 28.73, 28.83, 28.83, 28.83, 28.83, 28.83, 28.9, 28.9, 28.9, 28.9, 28.9, 28.97, 28.97, 28.97, 28.97, 28.97, 28.97, 28.97, 28.97, 28.97, 28.97, 28.91, 28.91, 28.91, 28.91, 28.91, 28.93, 28.93, 28.93, 28.93, 28.93, 29.06, 29.06, 29.06, 29.06, 29.06, 29.09, 29.09, 29.09, 29.09, 29.09, 29.25, 29.25, 29.25, 29.25, 29.25, 29.29, 29.29, 29.29, 29.29, 29.29, 29.39, 29.39, 29.39, 29.39, 29.39, 29.41, 29.41, 29.41, 29.41, 29.41, 29.41, 29.41, 29.41, 29.41, 29.41, 29.45, 29.45, 29.45, 29.45]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 559 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715859858 --> 1715860484
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.44, 0.44, 0.44, 0.44, 0.44, 0.23, 0.23, 0.23, 0.23, 0.23, 0.3, 0.3, 0.3, 0.3, 0.3, 0.18, 0.18, 0.18, 0.18, 0.18, 0.34, 0.34, 0.34, 0.34, 0.34, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.24, 0.24, 0.24, 0.24, 0.24, 0.23, 0.23, 0.23, 0.23, 0.23, 0.25, 0.25, 0.25, 0.25, 0.25, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.38, 0.38, 0.38, 0.38, 0.38, 0.2, 0.2, 0.2, 0.2, 0.2, 0.3, 0.3, 0.3, 0.3, 0.3, 0.18, 0.18, 0.18, 0.18, 0.18, 0.08, 0.08, 0.08, 0.08, 0.08, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.61, 0.61, 0.61, 0.61, 0.61, 0.35, 0.35, 0.35, 0.35, 0.35, 0.31, 0.31, 0.31, 0.31, 0.31, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.22, 0.22, 0.22, 0.22, 0.22, 0.3, 0.3, 0.3, 0.3, 0.3, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 559 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715859858 --> 1715860484
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0]
                    
Loading

@the-crypt-keeper
Copy link

the-crypt-keeper commented May 16, 2024

Seeing about +5% on the P100, doesn't matter if 1 or 2 GPUs.

However I'm getting very different P40 results from what you've posted above - I wonder did you run the test with 4xP40? I don't have 4, I only have 2:

With 1xP40 I observe a large (30%) improvement at low batch sizes but past batch 512 it gets a tiny bit slower.

With 2xP40 things really open up the 50% performance improvement is across the board and massive.. well done 🤯 💪

Single P100

ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes

model size params backend ngl n_batch fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 0 pp4096 174.33 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 1 pp4096 178.91 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 0 pp4096 255.56 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 1 pp4096 265.40 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 0 pp4096 343.98 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 1 pp4096 358.99 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 0 pp4096 413.50 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 1 pp4096 435.39 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 0 pp4096 413.56 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 1 pp4096 435.61 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 0 pp4096 413.62 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 1 pp4096 435.76 ± 0.00

Dual P100

Master: FA is slower

ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes

model size params backend ngl n_batch fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 0 pp4096 314.42 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 1 pp4096 300.14 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 0 pp4096 453.99 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 1 pp4096 427.89 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 0 pp4096 592.63 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 1 pp4096 545.82 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 0 pp4096 677.33 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 1 pp4096 606.22 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 0 pp4096 676.39 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 1 pp4096 606.38 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 0 pp4096 676.74 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 1 pp4096 606.43 ± 0.00

This branch: FA is 5% faster!

ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes

model size params backend ngl n_batch fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 0 pp4096 314.35 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 1 pp4096 320.56 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 0 pp4096 453.88 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 1 pp4096 467.64 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 0 pp4096 592.56 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 1 pp4096 613.77 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 0 pp4096 677.15 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 1 pp4096 706.84 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 0 pp4096 676.32 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 1 pp4096 706.52 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 0 pp4096 676.31 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 1 pp4096 706.50 ± 0.00

Single P40 (faster up to 256 only)

ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes

model size params backend ngl n_batch fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 0 pp4096 101.99 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 1 pp4096 133.83 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 0 pp4096 138.30 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 1 pp4096 155.01 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 0 pp4096 167.19 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 1 pp4096 171.62 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 0 pp4096 184.18 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 1 pp4096 181.64 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 0 pp4096 184.26 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 1 pp4096 181.65 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 0 pp4096 184.22 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 1 pp4096 181.72 ± 0.00

Dual P40

Master: FA slower past ctx 256

ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes

model size params backend ngl n_batch fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 0 pp4096 121.92 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 1 pp4096 162.49 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 0 pp4096 161.60 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 1 pp4096 180.83 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 0 pp4096 190.74 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 1 pp4096 180.72 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 0 pp4096 205.68 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 1 pp4096 170.86 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 0 pp4096 206.08 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 1 pp4096 170.89 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 0 pp4096 206.50 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 1 pp4096 170.88 ± 0.00

This branch: 🤯 🐎

Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes

model size params backend ngl n_batch fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 0 pp4096 121.98 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 1 pp4096 244.97 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 0 pp4096 161.69 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 1 pp4096 279.14 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 0 pp4096 190.61 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 1 pp4096 299.07 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 0 pp4096 205.79 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 1 pp4096 298.24 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 0 pp4096 205.97 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 1 pp4096 298.04 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 0 pp4096 206.33 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 1 pp4096 298.14 ± 0.00

@JohannesGaessler
Copy link
Collaborator Author

The numbers are for Mistral 7b q4_0 on 1x P40, running on Linux 6.6.26-1-MANJARO. Are you using Windows?

@the-crypt-keeper
Copy link

the-crypt-keeper commented May 16, 2024

@JohannesGaessler I am running Ubuntu-22, the numbers I posted were for llama2-7b but switching to mistral-7b doesn't make much difference I see the same pattern, a single P40 is slower after b=256 and doesn't hit anywhere near the speeds you're reporting:

ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes

model size params backend ngl n_batch fa test t/s
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 64 0 pp4096 101.38 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 64 1 pp4096 127.41 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 128 0 pp4096 134.83 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 128 1 pp4096 147.56 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 256 0 pp4096 163.85 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 256 1 pp4096 166.03 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 512 0 pp4096 178.55 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 512 1 pp4096 174.84 ± 0.00

For reference here is a RTX3060 in the same machine on the same model:

Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes

model size params backend ngl n_batch fa test t/s
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 64 0 pp4096 476.25 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 64 1 pp4096 514.05 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 128 0 pp4096 785.99 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 128 1 pp4096 927.54 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 256 0 pp4096 1098.38 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 256 1 pp4096 1368.08 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 512 0 pp4096 1277.03 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 512 1 pp4096 1651.68 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 1024 0 pp4096 1276.42 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 1024 1 pp4096 1650.07 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 2048 0 pp4096 1273.72 ± 0.00
llama 7B Q4_0 3.83 GiB 7.24 B CUDA 99 2048 1 pp4096 1645.75 ± 0.00

@slaren
Copy link
Member

slaren commented May 16, 2024

Keep in mind that to increase the batch size that is submitted to the CUDA backend, you need to increase the ubatch-size alongside the batch size. Adding -ub 4096 to the command line should do this

@JohannesGaessler
Copy link
Collaborator Author

Do you have ECC memory enabled? If it's disabled nvidia-smi should be reporting 24576MiB, if it's enabled it should be less.

Are you disabling the other GPUs via CUDA_VISIABLE_DEVICES or via --tensor-split?

@JohannesGaessler
Copy link
Collaborator Author

@slaren thank you for the clarification, in this particular case it luckily does not seem to affect the conclusions:

model backend ngl n_batch n_ubatch fa test t/s
llama 7B Q4_0 CUDA 99 1 4096 0 pp4096 40.87 ± 0.00
llama 7B Q4_0 CUDA 99 1 4096 1 pp4096 52.58 ± 0.00
llama 7B Q4_0 CUDA 99 2 4096 0 pp4096 43.22 ± 0.00
llama 7B Q4_0 CUDA 99 2 4096 1 pp4096 96.77 ± 0.00
llama 7B Q4_0 CUDA 99 4 4096 0 pp4096 72.14 ± 0.00
llama 7B Q4_0 CUDA 99 4 4096 1 pp4096 115.39 ± 0.00
llama 7B Q4_0 CUDA 99 8 4096 0 pp4096 86.43 ± 0.00
llama 7B Q4_0 CUDA 99 8 4096 1 pp4096 147.52 ± 0.00
llama 7B Q4_0 CUDA 99 16 4096 0 pp4096 113.30 ± 0.00
llama 7B Q4_0 CUDA 99 16 4096 1 pp4096 179.78 ± 0.00
llama 7B Q4_0 CUDA 99 32 4096 0 pp4096 215.13 ± 0.00
llama 7B Q4_0 CUDA 99 32 4096 1 pp4096 336.41 ± 0.00
llama 7B Q4_0 CUDA 99 64 4096 0 pp4096 394.21 ± 0.00
llama 7B Q4_0 CUDA 99 64 4096 1 pp4096 560.31 ± 0.00
llama 7B Q4_0 CUDA 99 128 4096 0 pp4096 537.90 ± 0.00
llama 7B Q4_0 CUDA 99 128 4096 1 pp4096 663.18 ± 0.00
llama 7B Q4_0 CUDA 99 256 4096 0 pp4096 663.99 ± 0.00
llama 7B Q4_0 CUDA 99 256 4096 1 pp4096 715.87 ± 0.00
llama 7B Q4_0 CUDA 99 512 4096 0 pp4096 709.71 ± 0.00
llama 7B Q4_0 CUDA 99 512 4096 1 pp4096 719.96 ± 0.00
llama 7B Q4_0 CUDA 99 1024 4096 0 pp4096 703.04 ± 0.00
llama 7B Q4_0 CUDA 99 1024 4096 1 pp4096 706.27 ± 0.00
llama 7B Q4_0 CUDA 99 2048 4096 0 pp4096 699.01 ± 0.00
llama 7B Q4_0 CUDA 99 2048 4096 1 pp4096 677.62 ± 0.00
llama 7B Q4_0 CUDA 99 4096 4096 0 pp4096 652.36 ± 0.00
llama 7B Q4_0 CUDA 99 4096 4096 1 pp4096 623.75 ± 0.00

For very large batch sizes the performance with FlashAttention decreases but the performance seems to be optimal with a batch size of 512 anyways.

@sorasoras
Copy link

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
| model                          |       size |     params | backend    | ngl |    n_batch |         fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ------------: | ---------------: |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          1 |          0 |        pp4096 |      9.27 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          1 |          1 |        pp4096 |     10.85 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          2 |          0 |        pp4096 |      9.27 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          2 |          1 |        pp4096 |     19.68 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          4 |          0 |        pp4096 |     15.45 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          4 |          1 |        pp4096 |     17.02 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          8 |          0 |        pp4096 |     19.39 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          8 |          1 |        pp4096 |     31.13 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         16 |          0 |        pp4096 |     28.04 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         16 |          1 |        pp4096 |     34.33 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         32 |          0 |        pp4096 |     54.27 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         32 |          1 |        pp4096 |     61.23 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         64 |          0 |        pp4096 |    103.43 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         64 |          1 |        pp4096 |     94.47 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        128 |          0 |        pp4096 |    141.43 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        128 |          1 |        pp4096 |    108.73 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        256 |          0 |        pp4096 |    163.13 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        256 |          1 |        pp4096 |    110.92 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        512 |          0 |        pp4096 |    177.99 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        512 |          1 |        pp4096 |    111.39 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       1024 |          0 |        pp4096 |    178.44 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       1024 |          1 |        pp4096 |    111.37 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       2048 |          0 |        pp4096 |    178.38 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       2048 |          1 |        pp4096 |    111.25 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       4096 |          0 |        pp4096 |    178.54 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       4096 |          1 |        pp4096 |    111.39 ± 0.00 |

build: aa9cbd76 (2853)

it seems less optimal for qwen2 32B at larger batch

@GZGavinZhao
Copy link
Contributor

GZGavinZhao commented May 16, 2024

The closest AMD alternative I know to NVIDIA NSight Compute would be Radeon GPU Profiler. It's still a bit different, but may be enough to get started.

On the command-line, rocprofiler is what I often use. You can run rocprofiler --hip-trace <program-and-flags>, and it will give you a few CSV files containing API usage percentages/runtime plus a results.json that you can drop into Perfetto to inspect the timeline.

llama-bench takes too long for my RX6600M to run, but I profiled a few runs with rocprofiler with the same model and parameters used in previous comments, and found that hipStreamSynchronize takes up 26% of the runtime, compared to the 10% in the old behavior. All other metrics look roughly the same.

@richginsberg
Copy link

Another run using Asus ESC4000 G4 with Intel Xeon Gold 6138 Processor LGA3647 1.8Ghz 20 Core 40 Thread
And CUDA 12.2
Using single P100-PCIE-16GB of 4

./llama-bench --model ~/gguf-models/llama-2-7b.Q4_0.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 1,2,4,8,16,32,64,128,256,512,1024,2048,4096 -ts 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 4 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 2: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 3: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
| model                          |       size |     params | backend    | ngl |    n_batch |         fa | ts           |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ------------ | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |          0 | 1.00         |        pp4096 |     19.19 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |          1 | 1.00         |        pp4096 |     20.99 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |          0 | 1.00         |        pp4096 |      7.63 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |          1 | 1.00         |        pp4096 |      7.77 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          4 |          0 | 1.00         |        pp4096 |     14.68 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          4 |          1 | 1.00         |        pp4096 |     15.11 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          8 |          0 | 1.00         |        pp4096 |     27.48 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          8 |          1 | 1.00         |        pp4096 |     28.62 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         16 |          0 | 1.00         |        pp4096 |     51.93 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         16 |          1 | 1.00         |        pp4096 |     57.64 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         32 |          0 | 1.00         |        pp4096 |    109.20 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         32 |          1 | 1.00         |        pp4096 |    113.79 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         64 |          0 | 1.00         |        pp4096 |    174.94 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         64 |          1 | 1.00         |        pp4096 |    179.43 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        128 |          0 | 1.00         |        pp4096 |    257.89 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        128 |          1 | 1.00         |        pp4096 |    267.46 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        256 |          0 | 1.00         |        pp4096 |    348.21 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        256 |          1 | 1.00         |        pp4096 |    363.76 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        512 |          0 | 1.00         |        pp4096 |    419.83 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        512 |          1 | 1.00         |        pp4096 |    442.68 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       1024 |          0 | 1.00         |        pp4096 |    419.91 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       1024 |          1 | 1.00         |        pp4096 |    442.85 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       2048 |          0 | 1.00         |        pp4096 |    420.06 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       2048 |          1 | 1.00         |        pp4096 |    443.13 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       4096 |          0 | 1.00         |        pp4096 |    420.49 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       4096 |          1 | 1.00         |        pp4096 |    443.28 ± 0.00 |

build: cc0332df (2873)

@the-crypt-keeper
Copy link

@JohannesGaessler

Do you have ECC memory enabled? If it's disabled nvidia-smi should be reporting 24576MiB, if it's enabled it should be less.

ECC is disabled.

Are you disabling the other GPUs via CUDA_VISIABLE_DEVICES or via --tensor-split?

via CUDA_VISIABLE_DEVICES, but I just tried via -ts and the results were the same.

My P100 numbers match what others are reporting, but your P40 numbers are somehow ~4x mine. I guess we need another set of P40 benchmarks.

@JohannesGaessler
Copy link
Collaborator Author

@sorasoras I am not able to reproduce the performance issue with qwen 1.5 q4_0:

model backend ngl n_batch n_ubatch fa test t/s
qwen2 32B Q4_0 CUDA 99 1 4096 0 pp4096 11.11 ± 0.00
qwen2 32B Q4_0 CUDA 99 1 4096 1 pp4096 13.14 ± 0.00
qwen2 32B Q4_0 CUDA 99 2 4096 0 pp4096 14.02 ± 0.00
qwen2 32B Q4_0 CUDA 99 2 4096 1 pp4096 25.01 ± 0.00
qwen2 32B Q4_0 CUDA 99 4 4096 0 pp4096 22.65 ± 0.00
qwen2 32B Q4_0 CUDA 99 4 4096 1 pp4096 31.86 ± 0.00
qwen2 32B Q4_0 CUDA 99 8 4096 0 pp4096 26.74 ± 0.00
qwen2 32B Q4_0 CUDA 99 8 4096 1 pp4096 38.82 ± 0.00
qwen2 32B Q4_0 CUDA 99 16 4096 0 pp4096 33.67 ± 0.00
qwen2 32B Q4_0 CUDA 99 16 4096 1 pp4096 44.59 ± 0.00
qwen2 32B Q4_0 CUDA 99 32 4096 0 pp4096 64.52 ± 0.00
qwen2 32B Q4_0 CUDA 99 32 4096 1 pp4096 85.34 ± 0.00
qwen2 32B Q4_0 CUDA 99 64 4096 0 pp4096 119.98 ± 0.00
qwen2 32B Q4_0 CUDA 99 64 4096 1 pp4096 151.90 ± 0.00
qwen2 32B Q4_0 CUDA 99 128 4096 0 pp4096 151.88 ± 0.00
qwen2 32B Q4_0 CUDA 99 128 4096 1 pp4096 173.24 ± 0.00
qwen2 32B Q4_0 CUDA 99 256 4096 0 pp4096 166.88 ± 0.00
qwen2 32B Q4_0 CUDA 99 256 4096 1 pp4096 176.18 ± 0.00
qwen2 32B Q4_0 CUDA 99 512 4096 0 pp4096 178.37 ± 0.00
qwen2 32B Q4_0 CUDA 99 512 4096 1 pp4096 181.52 ± 0.00
qwen2 32B Q4_0 CUDA 99 1024 4096 0 pp4096 178.97 ± 0.00
qwen2 32B Q4_0 CUDA 99 1024 4096 1 pp4096 179.72 ± 0.00
qwen2 32B Q4_0 CUDA 99 2048 4096 0 pp4096 178.99 ± 0.00
qwen2 32B Q4_0 CUDA 99 2048 4096 1 pp4096 172.05 ± 0.00
qwen2 32B Q4_0 CUDA 99 4096 4096 0 pp4096 170.47 ± 0.00
qwen2 32B Q4_0 CUDA 99 4096 4096 1 pp4096 164.67 ± 0.00

@sorasoras
Copy link

@sorasoras I am not able to reproduce the performance issue with qwen 1.5 q4_0:

model backend ngl n_batch n_ubatch fa test t/s
qwen2 32B Q4_0 CUDA 99 1 4096 0 pp4096 11.11 ± 0.00
qwen2 32B Q4_0 CUDA 99 1 4096 1 pp4096 13.14 ± 0.00
qwen2 32B Q4_0 CUDA 99 2 4096 0 pp4096 14.02 ± 0.00
qwen2 32B Q4_0 CUDA 99 2 4096 1 pp4096 25.01 ± 0.00
qwen2 32B Q4_0 CUDA 99 4 4096 0 pp4096 22.65 ± 0.00
qwen2 32B Q4_0 CUDA 99 4 4096 1 pp4096 31.86 ± 0.00
qwen2 32B Q4_0 CUDA 99 8 4096 0 pp4096 26.74 ± 0.00
qwen2 32B Q4_0 CUDA 99 8 4096 1 pp4096 38.82 ± 0.00
qwen2 32B Q4_0 CUDA 99 16 4096 0 pp4096 33.67 ± 0.00
qwen2 32B Q4_0 CUDA 99 16 4096 1 pp4096 44.59 ± 0.00
qwen2 32B Q4_0 CUDA 99 32 4096 0 pp4096 64.52 ± 0.00
qwen2 32B Q4_0 CUDA 99 32 4096 1 pp4096 85.34 ± 0.00
qwen2 32B Q4_0 CUDA 99 64 4096 0 pp4096 119.98 ± 0.00
qwen2 32B Q4_0 CUDA 99 64 4096 1 pp4096 151.90 ± 0.00
qwen2 32B Q4_0 CUDA 99 128 4096 0 pp4096 151.88 ± 0.00
qwen2 32B Q4_0 CUDA 99 128 4096 1 pp4096 173.24 ± 0.00
qwen2 32B Q4_0 CUDA 99 256 4096 0 pp4096 166.88 ± 0.00
qwen2 32B Q4_0 CUDA 99 256 4096 1 pp4096 176.18 ± 0.00
qwen2 32B Q4_0 CUDA 99 512 4096 0 pp4096 178.37 ± 0.00
qwen2 32B Q4_0 CUDA 99 512 4096 1 pp4096 181.52 ± 0.00
qwen2 32B Q4_0 CUDA 99 1024 4096 0 pp4096 178.97 ± 0.00
qwen2 32B Q4_0 CUDA 99 1024 4096 1 pp4096 179.72 ± 0.00
qwen2 32B Q4_0 CUDA 99 2048 4096 0 pp4096 178.99 ± 0.00
qwen2 32B Q4_0 CUDA 99 2048 4096 1 pp4096 172.05 ± 0.00
qwen2 32B Q4_0 CUDA 99 4096 4096 0 pp4096 170.47 ± 0.00
qwen2 32B Q4_0 CUDA 99 4096 4096 1 pp4096 164.67 ± 0.00

It could be something to do with " -DLLAMA_CUDA_DMMV_X=64 -DLLAMA_CUDA_MMV_Y=4"
I would try it again latter.

@sorasoras
Copy link

@sorasoras I am not able to reproduce the performance issue with qwen 1.5 q4_0:

model backend ngl n_batch n_ubatch fa test t/s
qwen2 32B Q4_0 CUDA 99 1 4096 0 pp4096 11.11 ± 0.00
qwen2 32B Q4_0 CUDA 99 1 4096 1 pp4096 13.14 ± 0.00
qwen2 32B Q4_0 CUDA 99 2 4096 0 pp4096 14.02 ± 0.00
qwen2 32B Q4_0 CUDA 99 2 4096 1 pp4096 25.01 ± 0.00
qwen2 32B Q4_0 CUDA 99 4 4096 0 pp4096 22.65 ± 0.00
qwen2 32B Q4_0 CUDA 99 4 4096 1 pp4096 31.86 ± 0.00
qwen2 32B Q4_0 CUDA 99 8 4096 0 pp4096 26.74 ± 0.00
qwen2 32B Q4_0 CUDA 99 8 4096 1 pp4096 38.82 ± 0.00
qwen2 32B Q4_0 CUDA 99 16 4096 0 pp4096 33.67 ± 0.00
qwen2 32B Q4_0 CUDA 99 16 4096 1 pp4096 44.59 ± 0.00
qwen2 32B Q4_0 CUDA 99 32 4096 0 pp4096 64.52 ± 0.00
qwen2 32B Q4_0 CUDA 99 32 4096 1 pp4096 85.34 ± 0.00
qwen2 32B Q4_0 CUDA 99 64 4096 0 pp4096 119.98 ± 0.00
qwen2 32B Q4_0 CUDA 99 64 4096 1 pp4096 151.90 ± 0.00
qwen2 32B Q4_0 CUDA 99 128 4096 0 pp4096 151.88 ± 0.00
qwen2 32B Q4_0 CUDA 99 128 4096 1 pp4096 173.24 ± 0.00
qwen2 32B Q4_0 CUDA 99 256 4096 0 pp4096 166.88 ± 0.00
qwen2 32B Q4_0 CUDA 99 256 4096 1 pp4096 176.18 ± 0.00
qwen2 32B Q4_0 CUDA 99 512 4096 0 pp4096 178.37 ± 0.00
qwen2 32B Q4_0 CUDA 99 512 4096 1 pp4096 181.52 ± 0.00
qwen2 32B Q4_0 CUDA 99 1024 4096 0 pp4096 178.97 ± 0.00
qwen2 32B Q4_0 CUDA 99 1024 4096 1 pp4096 179.72 ± 0.00
qwen2 32B Q4_0 CUDA 99 2048 4096 0 pp4096 178.99 ± 0.00
qwen2 32B Q4_0 CUDA 99 2048 4096 1 pp4096 172.05 ± 0.00
qwen2 32B Q4_0 CUDA 99 4096 4096 0 pp4096 170.47 ± 0.00
qwen2 32B Q4_0 CUDA 99 4096 4096 1 pp4096 164.67 ± 0.00

  Device 0: Tesla P40, compute capability 6.1, VMM: yes
| model                          |       size |     params | backend    | ngl |    n_batch |         fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ------------: | ---------------: |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          1 |          0 |        pp4096 |      9.47 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          1 |          1 |        pp4096 |     11.00 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          2 |          0 |        pp4096 |      9.55 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          2 |          1 |        pp4096 |     19.84 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          4 |          0 |        pp4096 |     15.55 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          4 |          1 |        pp4096 |     25.31 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          8 |          0 |        pp4096 |     19.80 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          8 |          1 |        pp4096 |     30.91 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         16 |          0 |        pp4096 |     28.34 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         16 |          1 |        pp4096 |     41.38 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         32 |          0 |        pp4096 |     54.80 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         32 |          1 |        pp4096 |     79.62 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         64 |          0 |        pp4096 |    104.02 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         64 |          1 |        pp4096 |    142.88 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        128 |          0 |        pp4096 |    140.53 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        128 |          1 |        pp4096 |    171.10 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        256 |          0 |        pp4096 |    161.36 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        256 |          1 |        pp4096 |    176.00 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        512 |          0 |        pp4096 |    175.13 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        512 |          1 |        pp4096 |    181.64 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       1024 |          0 |        pp4096 |    175.06 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       1024 |          1 |        pp4096 |    181.66 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       2048 |          0 |        pp4096 |    174.81 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       2048 |          1 |        pp4096 |    181.31 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       4096 |          0 |        pp4096 |    174.57 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       4096 |          1 |        pp4096 |    181.30 ± 0.00 |

yup, it should work without -DLLAMA_CUDA_DMMV_X=64 -DLLAMA_CUDA_MMV_Y=4 when compile

@LoopControl
Copy link

LoopControl commented May 16, 2024

Seeing great results with this PR @JohannesGaessler thanks!

Here's the numbers from a P40 that I've power limited to 130W (because it keeps the card cooler):

P40

./llama-bench --model ./models-all/Meta-Llama-3-8B.Q5_K_M.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 16,32,64,128,256,512,1024
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
model size params backend ngl n_batch fa test t/s
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 16 0 pp4096 93.88 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 16 1 pp4096 127.49 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 32 0 pp4096 179.10 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 32 1 pp4096 240.82 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 64 0 pp4096 328.36 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 64 1 pp4096 417.04 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 128 0 pp4096 408.50 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 128 1 pp4096 467.78 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 256 0 pp4096 470.77 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 256 1 pp4096 505.75 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 512 0 pp4096 509.06 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 512 1 pp4096 522.42 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 1024 0 pp4096 509.93 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 1024 1 pp4096 523.24 ± 0.00

RTX 3060

Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes (Power limited to 150W)

model size params backend ngl n_batch fa test t/s
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 16 0 pp4096 129.29 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 16 1 pp4096 154.47 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 32 0 pp4096 267.50 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 32 1 pp4096 279.46 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 64 0 pp4096 498.66 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 64 1 pp4096 549.97 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 128 0 pp4096 578.19 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 128 1 pp4096 642.22 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 256 0 pp4096 580.37 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 256 1 pp4096 661.37 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 512 0 pp4096 590.96 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 512 1 pp4096 676.29 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 1024 0 pp4096 591.65 ± 0.00
llama 8B Q5_K - Medium 5.33 GiB 8.03 B CUDA 99 1024 1 pp4096 675.74 ± 0.00

@the-crypt-keeper
Copy link

the-crypt-keeper commented May 17, 2024

@JohannesGaessler Looks like you were right and there was something power limiting the P40s in my main rig to around 70W. I've moved them to the secondary and now they're >200W during these tests. My observation from the severely power-limited rig stands: with 2xP40 the performance gains here are HUGE.

Single P40

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance

model size params backend ngl n_batch fa ts test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 0 1.00 pp4096 388.86 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 1 1.00 pp4096 583.08 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 0 1.00 pp4096 543.88 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 1 1.00 pp4096 693.23 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 0 1.00 pp4096 674.47 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 1 1.00 pp4096 763.50 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 0 1.00 pp4096 771.49 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 1 1.00 pp4096 810.98 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 0 1.00 pp4096 771.96 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 1 1.00 pp4096 811.34 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 0 1.00 pp4096 771.81 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 1 1.00 pp4096 812.09 ± 0.00

2xP40 split layer

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes

model size params backend ngl n_batch fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 0 pp4096 440.58 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 1 pp4096 1066.93 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 0 pp4096 612.96 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 1 pp4096 1240.84 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 0 pp4096 754.97 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 1 pp4096 1321.64 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 0 pp4096 854.14 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 1 pp4096 1323.12 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 0 pp4096 855.33 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 1 pp4096 1322.56 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 0 pp4096 856.13 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 1 pp4096 1322.95 ± 0.00

2xP40 Split row

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes

model size params backend ngl n_batch sm fa test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 row 0 pp4096 310.54 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 64 row 1 pp4096 430.14 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 row 0 pp4096 415.10 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 128 row 1 pp4096 503.06 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 row 0 pp4096 528.24 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 256 row 1 pp4096 634.75 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 row 0 pp4096 701.42 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 512 row 1 pp4096 794.70 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 row 0 pp4096 703.84 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1024 row 1 pp4096 795.85 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 row 0 pp4096 706.40 ± 0.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 2048 row 1 pp4096 800.63 ± 0.00

Llama-3-70B-Instruct

Not as drastic but still some very welcome improvements, staying above 8 tok/sec:

CUDA_VISIBLE_DEVICES=0,1 ./llama-bench --model /disk-0/models/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf -r 1 -fa 0,1 -b 256,512 -sm layer,row
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance

model size params backend ngl n_batch sm fa test t/s
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 256 layer 0 pp512 94.07 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 256 layer 0 tg128 5.46 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 256 layer 0 pp512+tg128 21.27 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 256 layer 1 pp512 125.49 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 256 layer 1 tg128 5.55 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 256 layer 1 pp512+tg128 23.42 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 512 layer 0 pp512 95.40 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 512 layer 0 tg128 5.46 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 512 layer 0 pp512+tg128 21.33 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 512 layer 1 pp512 97.77 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 512 layer 1 tg128 5.55 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 512 layer 1 pp512+tg128 22.46 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 256 row 0 pp512 109.72 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 256 row 0 tg128 7.84 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 256 row 0 pp512+tg128 28.82 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 256 row 1 pp512 121.51 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 256 row 1 tg128 8.14 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 256 row 1 pp512+tg128 31.76 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 512 row 0 pp512 133.60 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 512 row 0 tg128 7.86 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 512 row 0 pp512+tg128 29.98 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 512 row 1 pp512 143.73 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 512 row 1 tg128 8.14 ± 0.00
llama 70B Q4_K - Medium 39.59 GiB 70.55 B CUDA 99 512 row 1 pp512+tg128 32.86 ± 0.00

Copy link
Member

@slaren slaren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to block merging this, but I will point the obvious that there is a lot of code duplication here and that is going to complicate maintaining this code in the future.

@JohannesGaessler JohannesGaessler merged commit 0fc1e82 into ggml-org:master May 17, 2024
66 checks passed
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request May 17, 2024
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request May 18, 2024
@LoopControl
Copy link

LoopControl commented May 22, 2024

@JohannesGaessler This was working great after merge but with the new Phi3 related commits, I'm now getting a crash when -fa is enabled on my Tesla P40.

When -fa is off (or when I use my RTX 3060 with -fa), it works fine.

Current version from master that's crashing with FA: 201cc11afa0a1950e1f632390b2ac6c937a0d8f0

Startup command:
./server -fa --batch-size 1024 --ubatch-size 1024 -c 32768 -m Phi-3-medium-128k-instruct-Q4_K_M.gguf

Phi-3 Medium gguf from here: https://huggingface.co/bartowski/Phi-3-medium-128k-instruct-GGUF

Crash output:

{"tid":"140132145360896","timestamp":1716337795,"level":"INFO","function":"update_slots","line":2091,"msg":"kv cache rm [p0, end)","id_slot":0,"id_task":0,"p0":0}
GGML_ASSERT: ggml-cuda/fattn-tile-f32.cu:290: precision == GGML_PREC_DEFAULT
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
./server-phi3-medium-128k.sh: line 4: 506539 Aborted 

@JohannesGaessler
Copy link
Collaborator Author

There was an incorrect check for precision which is now fixed on master. However, if like Phi-2 Phi-3 is using a head size of 80 the code will still not work.

@LoopControl
Copy link

There was an incorrect check for precision which is now fixed on master. However, if like Phi-2 Phi-3 is using a head size of 80 the code will still not work.

Thanks for the quick fix @JohannesGaessler !

After merging latest changes, inference is now working well on the P40 with FA with the Phi 3 model I linked above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Nvidia GPU Issues specific to Nvidia GPUs Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants