CUDA: faster large batch FA without tensor cores #7314

JohannesGaessler · 2024-05-15T22:04:52Z

This PR adds CUDA FlashAttention kernels that do not use tensor cores and are optimized for large batch sizes. On my P40 enabling FlashAttention is now consistently faster:

model	backend	ngl	n_batch	fa	test	t/s
llama 7B Q4_0	CUDA	99	1	0	pp4096	40.88 ± 0.00
llama 7B Q4_0	CUDA	99	1	1	pp4096	52.58 ± 0.00
llama 7B Q4_0	CUDA	99	2	0	pp4096	43.23 ± 0.00
llama 7B Q4_0	CUDA	99	2	1	pp4096	96.79 ± 0.00
llama 7B Q4_0	CUDA	99	4	0	pp4096	72.15 ± 0.00
llama 7B Q4_0	CUDA	99	4	1	pp4096	115.52 ± 0.00
llama 7B Q4_0	CUDA	99	8	0	pp4096	86.29 ± 0.00
llama 7B Q4_0	CUDA	99	8	1	pp4096	147.49 ± 0.00
llama 7B Q4_0	CUDA	99	16	0	pp4096	113.10 ± 0.00
llama 7B Q4_0	CUDA	99	16	1	pp4096	179.72 ± 0.00
llama 7B Q4_0	CUDA	99	32	0	pp4096	215.08 ± 0.00
llama 7B Q4_0	CUDA	99	32	1	pp4096	336.38 ± 0.00
llama 7B Q4_0	CUDA	99	64	0	pp4096	394.15 ± 0.00
llama 7B Q4_0	CUDA	99	64	1	pp4096	560.28 ± 0.00
llama 7B Q4_0	CUDA	99	128	0	pp4096	537.85 ± 0.00
llama 7B Q4_0	CUDA	99	128	1	pp4096	663.37 ± 0.00
llama 7B Q4_0	CUDA	99	256	0	pp4096	661.92 ± 0.00
llama 7B Q4_0	CUDA	99	256	1	pp4096	741.85 ± 0.00
llama 7B Q4_0	CUDA	99	512	0	pp4096	724.98 ± 0.00
llama 7B Q4_0	CUDA	99	512	1	pp4096	740.43 ± 0.00
llama 7B Q4_0	CUDA	99	1024	0	pp4096	711.43 ± 0.00
llama 7B Q4_0	CUDA	99	1024	1	pp4096	729.14 ± 0.00
llama 7B Q4_0	CUDA	99	2048	0	pp4096	699.79 ± 0.00
llama 7B Q4_0	CUDA	99	2048	1	pp4096	725.82 ± 0.00
llama 7B Q4_0	CUDA	99	4096	0	pp4096	705.89 ± 0.00
llama 7B Q4_0	CUDA	99	4096	1	pp4096	724.40 ± 0.00

On my RX 6800 these new kernels unfortunately perform quite poorly which is why I'm not enabling them for AMD. I don't know what the issue is and I cannot use NVIDIA NSight Compute to find out either. To my knowledge there is simply no equivalent AMD tool; if it turns out that I am just ignorant I would love for someone to correct me.

JohannesGaessler · 2024-05-15T22:07:14Z

This PR should provide a good speedup for the P100 but unfortunately I don't own one with which I could test the code. I would appreciate it if a P100 owner could post the output of

./llama-bench --model models/opt/${model_name}-${quantization}.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 1,2,4,8,16,32,64,128,256,512,1024,2048,4096

with the path to an actual model.

Tom-Neverwinter · 2024-05-15T23:59:48Z

I have 4 m40's if that will help. If this works I may just drop the money for 4x p100s

dirkson · 2024-05-16T04:28:11Z

Here ya' go!

I added -ts 1 to restrict it to one p100. I can redo the test without it if you like - I have 5 available. I tried to use a similar model to yours.

Command:

./llama-bench --model ../../mod/gguf/llama-2-7b.Q4_0.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 1,2,4,8,16,32,64,128,256,512,1024,2048,4096 -ts 1

Output:

ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 5 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 2: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 3: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 4: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
| model                          |       size |     params | backend    | ngl |    n_batch |         fa | ts           |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ------------ | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |          0 | 1.00         |        pp4096 |     17.47 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |          1 | 1.00         |        pp4096 |     18.96 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |          0 | 1.00         |        pp4096 |      7.62 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |          1 | 1.00         |        pp4096 |      7.76 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          4 |          0 | 1.00         |        pp4096 |     14.67 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          4 |          1 | 1.00         |        pp4096 |     15.10 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          8 |          0 | 1.00         |        pp4096 |     27.46 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          8 |          1 | 1.00         |        pp4096 |     28.60 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         16 |          0 | 1.00         |        pp4096 |     51.89 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         16 |          1 | 1.00         |        pp4096 |     57.58 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         32 |          0 | 1.00         |        pp4096 |    109.10 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         32 |          1 | 1.00         |        pp4096 |    113.68 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         64 |          0 | 1.00         |        pp4096 |    174.80 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         64 |          1 | 1.00         |        pp4096 |    179.33 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        128 |          0 | 1.00         |        pp4096 |    257.94 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        128 |          1 | 1.00         |        pp4096 |    267.08 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        256 |          0 | 1.00         |        pp4096 |    348.06 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        256 |          1 | 1.00         |        pp4096 |    363.57 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        512 |          0 | 1.00         |        pp4096 |    419.89 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        512 |          1 | 1.00         |        pp4096 |    442.53 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       1024 |          0 | 1.00         |        pp4096 |    419.85 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       1024 |          1 | 1.00         |        pp4096 |    443.04 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       2048 |          0 | 1.00         |        pp4096 |    419.94 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       2048 |          1 | 1.00         |        pp4096 |    443.18 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       4096 |          0 | 1.00         |        pp4096 |    420.50 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       4096 |          1 | 1.00         |        pp4096 |    443.35 ± 0.00 |

build: cc0332df (2873)

You seem to be getting dramatically faster results with your p40 than my p100, which has me curious.

richginsberg · 2024-05-16T04:39:36Z

Using Dell PowerEdge R730 with Dual Intel Xeon E5-2697 V3 2.6 GHz 14 Core
And CUDA 12.4
Using single Tesla P100-PCIE-12GB

./llama-bench --model ~/models-gguf/llama-2-7b.Q4_0.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 1,2,4,8,16,32,64,128,256,512,1024,2048,4096
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P100-PCIE-12GB, compute capability 6.0, VMM: yes

| model                          |       size |     params | backend    | ngl |    n_batch |         fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |          0 |        pp4096 |     16.93 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |          1 |        pp4096 |     18.43 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |          0 |        pp4096 |      6.28 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |          1 |        pp4096 |      6.38 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          4 |          0 |        pp4096 |     12.18 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          4 |          1 |        pp4096 |     12.47 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          8 |          0 |        pp4096 |     23.46 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          8 |          1 |        pp4096 |     24.33 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         16 |          0 |        pp4096 |     44.70 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         16 |          1 |        pp4096 |     49.00 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         32 |          0 |        pp4096 |     92.42 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         32 |          1 |        pp4096 |     96.83 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         64 |          0 |        pp4096 |    119.35 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         64 |          1 |        pp4096 |    122.89 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        128 |          0 |        pp4096 |    166.86 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        128 |          1 |        pp4096 |    174.27 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        256 |          0 |        pp4096 |    309.36 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        256 |          1 |        pp4096 |    325.39 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        512 |          0 |        pp4096 |    380.81 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        512 |          1 |        pp4096 |    412.09 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       1024 |          0 |        pp4096 |    380.95 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       1024 |          1 |        pp4096 |    412.34 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       2048 |          0 |        pp4096 |    381.21 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       2048 |          1 |        pp4096 |    412.26 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       4096 |          0 |        pp4096 |    381.38 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       4096 |          1 |        pp4096 |    412.76 ± 0.00 |

build: cc0332df (2873)

JohannesGaessler · 2024-05-16T06:34:45Z

@dirkson @richginsberg thank you.

github-actions · 2024-05-16T11:54:50Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 559 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8364.72ms p(95)=20051.49ms fails=, finish reason: stop=506 truncated=53
Prompt processing (pp): avg=104.54tk/s p(95)=494.32tk/s
Token generation (tg): avg=46.42tk/s p(95)=48.95tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=cuda-fa-no-tc-14 commit=cc0332dfd7979214d8b6dc0295396edd1b813c2c

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 559 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715859858 --> 1715860484
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 508.36, 508.36, 508.36, 508.36, 508.36, 671.01, 671.01, 671.01, 671.01, 671.01, 697.72, 697.72, 697.72, 697.72, 697.72, 778.01, 778.01, 778.01, 778.01, 778.01, 793.97, 793.97, 793.97, 793.97, 793.97, 789.33, 789.33, 789.33, 789.33, 789.33, 819.45, 819.45, 819.45, 819.45, 819.45, 820.8, 820.8, 820.8, 820.8, 820.8, 834.86, 834.86, 834.86, 834.86, 834.86, 848.85, 848.85, 848.85, 848.85, 848.85, 885.82, 885.82, 885.82, 885.82, 885.82, 894.33, 894.33, 894.33, 894.33, 894.33, 877.32, 877.32, 877.32, 877.32, 877.32, 879.49, 879.49, 879.49, 879.49, 879.49, 883.95, 883.95, 883.95, 883.95, 883.95, 883.82, 883.82, 883.82, 883.82, 883.82, 901.37, 901.37, 901.37, 901.37, 901.37, 899.96, 899.96, 899.96, 899.96, 899.96, 896.53, 896.53, 896.53, 896.53, 896.53, 902.43, 902.43, 902.43, 902.43, 902.43, 902.82, 902.82, 902.82, 902.82, 902.82, 911.62, 911.62, 911.62, 911.62, 911.62, 912.73, 912.73, 912.73, 912.73, 912.73, 914.02, 914.02, 914.02, 914.02, 914.02, 913.82, 913.82, 913.82, 913.82, 913.82, 843.99, 843.99, 843.99, 843.99, 843.99, 843.37, 843.37, 843.37, 843.37, 843.37, 843.02, 843.02, 843.02, 843.02, 843.02, 847.99, 847.99, 847.99, 847.99, 847.99, 846.98, 846.98, 846.98, 846.98, 846.98, 847.52, 847.52, 847.52, 847.52, 847.52, 854.89, 854.89, 854.89, 854.89, 854.89, 850.23, 850.23, 850.23, 850.23, 850.23, 853.43, 853.43, 853.43, 853.43, 853.43, 852.97, 852.97, 852.97, 852.97, 852.97, 849.89, 849.89, 849.89, 849.89, 849.89, 847.43, 847.43, 847.43, 847.43, 847.43, 847.11, 847.11, 847.11, 847.11, 847.11, 851.64, 851.64, 851.64, 851.64, 851.64, 852.65, 852.65, 852.65, 852.65, 852.65, 853.76, 853.76, 853.76, 853.76, 853.76, 849.59, 849.59, 849.59, 849.59, 849.59, 840.86, 840.86, 840.86, 840.86, 840.86, 839.65, 839.65, 839.65, 839.65, 839.65, 841.57, 841.57, 841.57, 841.57, 841.57, 844.22, 844.22, 844.22, 844.22, 844.22, 843.6, 843.6, 843.6, 843.6, 843.6, 849.17, 849.17, 849.17, 849.17, 849.17, 847.85, 847.85, 847.85, 847.85, 847.85, 850.96, 850.96, 850.96, 850.96, 850.96, 854.5, 854.5, 854.5, 854.5, 854.5, 853.32, 853.32, 853.32, 853.32, 853.32, 858.8, 858.8, 858.8, 858.8, 858.8, 859.49, 859.49, 859.49, 859.49, 859.49, 859.43, 859.43, 859.43, 859.43, 859.43, 856.82, 856.82, 856.82, 856.82, 856.82, 856.31, 856.31, 856.31, 856.31, 856.31, 857.53, 857.53, 857.53, 857.53, 857.53, 859.42, 859.42, 859.42, 859.42, 859.42, 858.21, 858.21, 858.21, 858.21, 858.21, 857.8, 857.8, 857.8, 857.8]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 559 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715859858 --> 1715860484
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.56, 40.56, 40.56, 40.56, 40.56, 26.95, 26.95, 26.95, 26.95, 26.95, 28.56, 28.56, 28.56, 28.56, 28.56, 30.54, 30.54, 30.54, 30.54, 30.54, 31.6, 31.6, 31.6, 31.6, 31.6, 32.64, 32.64, 32.64, 32.64, 32.64, 33.95, 33.95, 33.95, 33.95, 33.95, 34.14, 34.14, 34.14, 34.14, 34.14, 34.56, 34.56, 34.56, 34.56, 34.56, 33.84, 33.84, 33.84, 33.84, 33.84, 33.8, 33.8, 33.8, 33.8, 33.8, 32.79, 32.79, 32.79, 32.79, 32.79, 32.81, 32.81, 32.81, 32.81, 32.81, 30.83, 30.83, 30.83, 30.83, 30.83, 30.81, 30.81, 30.81, 30.81, 30.81, 29.71, 29.71, 29.71, 29.71, 29.71, 30.02, 30.02, 30.02, 30.02, 30.02, 29.72, 29.72, 29.72, 29.72, 29.72, 29.76, 29.76, 29.76, 29.76, 29.76, 29.72, 29.72, 29.72, 29.72, 29.72, 29.72, 29.72, 29.72, 29.72, 29.72, 29.77, 29.77, 29.77, 29.77, 29.77, 29.53, 29.53, 29.53, 29.53, 29.53, 29.78, 29.78, 29.78, 29.78, 29.78, 30.08, 30.08, 30.08, 30.08, 30.08, 30.15, 30.15, 30.15, 30.15, 30.15, 30.14, 30.14, 30.14, 30.14, 30.14, 30.5, 30.5, 30.5, 30.5, 30.5, 30.52, 30.52, 30.52, 30.52, 30.52, 30.75, 30.75, 30.75, 30.75, 30.75, 30.85, 30.85, 30.85, 30.85, 30.85, 31.01, 31.01, 31.01, 31.01, 31.01, 30.98, 30.98, 30.98, 30.98, 30.98, 30.68, 30.68, 30.68, 30.68, 30.68, 30.2, 30.2, 30.2, 30.2, 30.2, 29.94, 29.94, 29.94, 29.94, 29.94, 29.82, 29.82, 29.82, 29.82, 29.82, 29.94, 29.94, 29.94, 29.94, 29.94, 30.15, 30.15, 30.15, 30.15, 30.15, 30.24, 30.24, 30.24, 30.24, 30.24, 30.29, 30.29, 30.29, 30.29, 30.29, 30.06, 30.06, 30.06, 30.06, 30.06, 29.63, 29.63, 29.63, 29.63, 29.63, 28.87, 28.87, 28.87, 28.87, 28.87, 28.74, 28.74, 28.74, 28.74, 28.74, 28.74, 28.74, 28.74, 28.74, 28.74, 28.73, 28.73, 28.73, 28.73, 28.73, 28.83, 28.83, 28.83, 28.83, 28.83, 28.9, 28.9, 28.9, 28.9, 28.9, 28.97, 28.97, 28.97, 28.97, 28.97, 28.97, 28.97, 28.97, 28.97, 28.97, 28.91, 28.91, 28.91, 28.91, 28.91, 28.93, 28.93, 28.93, 28.93, 28.93, 29.06, 29.06, 29.06, 29.06, 29.06, 29.09, 29.09, 29.09, 29.09, 29.09, 29.25, 29.25, 29.25, 29.25, 29.25, 29.29, 29.29, 29.29, 29.29, 29.29, 29.39, 29.39, 29.39, 29.39, 29.39, 29.41, 29.41, 29.41, 29.41, 29.41, 29.41, 29.41, 29.41, 29.41, 29.41, 29.45, 29.45, 29.45, 29.45]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 559 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715859858 --> 1715860484
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.16, 0.16, 0.16, 0.16, 0.16, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.19, 0.19, 0.19, 0.19, 0.19, 0.2, 0.2, 0.2, 0.2, 0.2, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.27, 0.27, 0.27, 0.27, 0.27, 0.21, 0.21, 0.21, 0.21, 0.21, 0.44, 0.44, 0.44, 0.44, 0.44, 0.23, 0.23, 0.23, 0.23, 0.23, 0.3, 0.3, 0.3, 0.3, 0.3, 0.18, 0.18, 0.18, 0.18, 0.18, 0.34, 0.34, 0.34, 0.34, 0.34, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.24, 0.24, 0.24, 0.24, 0.24, 0.23, 0.23, 0.23, 0.23, 0.23, 0.25, 0.25, 0.25, 0.25, 0.25, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26, 0.26, 0.26, 0.26, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.38, 0.38, 0.38, 0.38, 0.38, 0.2, 0.2, 0.2, 0.2, 0.2, 0.3, 0.3, 0.3, 0.3, 0.3, 0.18, 0.18, 0.18, 0.18, 0.18, 0.08, 0.08, 0.08, 0.08, 0.08, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.3, 0.3, 0.3, 0.3, 0.3, 0.61, 0.61, 0.61, 0.61, 0.61, 0.35, 0.35, 0.35, 0.35, 0.35, 0.31, 0.31, 0.31, 0.31, 0.31, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.22, 0.22, 0.22, 0.22, 0.22, 0.3, 0.3, 0.3, 0.3, 0.3, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.18, 0.18, 0.18, 0.18, 0.18, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27, 0.27, 0.27, 0.27, 0.27, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 559 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715859858 --> 1715860484
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0, 1.0, 1.0, 1.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0]

the-crypt-keeper · 2024-05-16T12:48:05Z

Seeing about +5% on the P100, doesn't matter if 1 or 2 GPUs.

However I'm getting very different P40 results from what you've posted above - I wonder did you run the test with 4xP40? I don't have 4, I only have 2:

With 1xP40 I observe a large (30%) improvement at low batch sizes but past batch 512 it gets a tiny bit slower.

With 2xP40 things really open up the 50% performance improvement is across the board and massive.. well done 🤯 💪

Single P100

ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes

model	size	params	backend	ngl	n_batch	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	0	pp4096	174.33 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	1	pp4096	178.91 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	0	pp4096	255.56 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	1	pp4096	265.40 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	0	pp4096	343.98 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	1	pp4096	358.99 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	0	pp4096	413.50 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	1	pp4096	435.39 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	0	pp4096	413.56 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	1	pp4096	435.61 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	0	pp4096	413.62 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	1	pp4096	435.76 ± 0.00

Dual P100

Master: FA is slower

ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes

model	size	params	backend	ngl	n_batch	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	0	pp4096	314.42 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	1	pp4096	300.14 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	0	pp4096	453.99 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	1	pp4096	427.89 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	0	pp4096	592.63 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	1	pp4096	545.82 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	0	pp4096	677.33 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	1	pp4096	606.22 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	0	pp4096	676.39 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	1	pp4096	606.38 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	0	pp4096	676.74 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	1	pp4096	606.43 ± 0.00

This branch: FA is 5% faster!

ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes

model	size	params	backend	ngl	n_batch	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	0	pp4096	314.35 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	1	pp4096	320.56 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	0	pp4096	453.88 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	1	pp4096	467.64 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	0	pp4096	592.56 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	1	pp4096	613.77 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	0	pp4096	677.15 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	1	pp4096	706.84 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	0	pp4096	676.32 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	1	pp4096	706.52 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	0	pp4096	676.31 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	1	pp4096	706.50 ± 0.00

Single P40 (faster up to 256 only)

ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	n_batch	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	0	pp4096	101.99 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	1	pp4096	133.83 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	0	pp4096	138.30 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	1	pp4096	155.01 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	0	pp4096	167.19 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	1	pp4096	171.62 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	0	pp4096	184.18 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	1	pp4096	181.64 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	0	pp4096	184.26 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	1	pp4096	181.65 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	0	pp4096	184.22 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	1	pp4096	181.72 ± 0.00

Dual P40

Master: FA slower past ctx 256

ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	n_batch	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	0	pp4096	121.92 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	1	pp4096	162.49 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	0	pp4096	161.60 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	1	pp4096	180.83 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	0	pp4096	190.74 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	1	pp4096	180.72 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	0	pp4096	205.68 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	1	pp4096	170.86 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	0	pp4096	206.08 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	1	pp4096	170.89 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	0	pp4096	206.50 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	1	pp4096	170.88 ± 0.00

This branch: 🤯 🐎

Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	n_batch	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	0	pp4096	121.98 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	1	pp4096	244.97 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	0	pp4096	161.69 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	1	pp4096	279.14 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	0	pp4096	190.61 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	1	pp4096	299.07 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	0	pp4096	205.79 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	1	pp4096	298.24 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	0	pp4096	205.97 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	1	pp4096	298.04 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	0	pp4096	206.33 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	1	pp4096	298.14 ± 0.00

JohannesGaessler · 2024-05-16T13:29:49Z

The numbers are for Mistral 7b q4_0 on 1x P40, running on Linux 6.6.26-1-MANJARO. Are you using Windows?

the-crypt-keeper · 2024-05-16T13:50:37Z

@JohannesGaessler I am running Ubuntu-22, the numbers I posted were for llama2-7b but switching to mistral-7b doesn't make much difference I see the same pattern, a single P40 is slower after b=256 and doesn't hit anywhere near the speeds you're reporting:

ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	n_batch	fa	test	t/s
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	64	0	pp4096	101.38 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	64	1	pp4096	127.41 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	128	0	pp4096	134.83 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	128	1	pp4096	147.56 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	256	0	pp4096	163.85 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	256	1	pp4096	166.03 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	512	0	pp4096	178.55 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	512	1	pp4096	174.84 ± 0.00

For reference here is a RTX3060 in the same machine on the same model:

Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	n_batch	fa	test	t/s
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	64	0	pp4096	476.25 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	64	1	pp4096	514.05 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	128	0	pp4096	785.99 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	128	1	pp4096	927.54 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	256	0	pp4096	1098.38 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	256	1	pp4096	1368.08 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	512	0	pp4096	1277.03 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	512	1	pp4096	1651.68 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	1024	0	pp4096	1276.42 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	1024	1	pp4096	1650.07 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	2048	0	pp4096	1273.72 ± 0.00
llama 7B Q4_0	3.83 GiB	7.24 B	CUDA	99	2048	1	pp4096	1645.75 ± 0.00

slaren · 2024-05-16T13:56:35Z

Keep in mind that to increase the batch size that is submitted to the CUDA backend, you need to increase the ubatch-size alongside the batch size. Adding -ub 4096 to the command line should do this

JohannesGaessler · 2024-05-16T13:58:24Z

Do you have ECC memory enabled? If it's disabled nvidia-smi should be reporting 24576MiB, if it's enabled it should be less.

Are you disabling the other GPUs via CUDA_VISIABLE_DEVICES or via --tensor-split?

JohannesGaessler · 2024-05-16T14:36:36Z

@slaren thank you for the clarification, in this particular case it luckily does not seem to affect the conclusions:

model	backend	ngl	n_batch	n_ubatch	fa	test	t/s
llama 7B Q4_0	CUDA	99	1	4096	0	pp4096	40.87 ± 0.00
llama 7B Q4_0	CUDA	99	1	4096	1	pp4096	52.58 ± 0.00
llama 7B Q4_0	CUDA	99	2	4096	0	pp4096	43.22 ± 0.00
llama 7B Q4_0	CUDA	99	2	4096	1	pp4096	96.77 ± 0.00
llama 7B Q4_0	CUDA	99	4	4096	0	pp4096	72.14 ± 0.00
llama 7B Q4_0	CUDA	99	4	4096	1	pp4096	115.39 ± 0.00
llama 7B Q4_0	CUDA	99	8	4096	0	pp4096	86.43 ± 0.00
llama 7B Q4_0	CUDA	99	8	4096	1	pp4096	147.52 ± 0.00
llama 7B Q4_0	CUDA	99	16	4096	0	pp4096	113.30 ± 0.00
llama 7B Q4_0	CUDA	99	16	4096	1	pp4096	179.78 ± 0.00
llama 7B Q4_0	CUDA	99	32	4096	0	pp4096	215.13 ± 0.00
llama 7B Q4_0	CUDA	99	32	4096	1	pp4096	336.41 ± 0.00
llama 7B Q4_0	CUDA	99	64	4096	0	pp4096	394.21 ± 0.00
llama 7B Q4_0	CUDA	99	64	4096	1	pp4096	560.31 ± 0.00
llama 7B Q4_0	CUDA	99	128	4096	0	pp4096	537.90 ± 0.00
llama 7B Q4_0	CUDA	99	128	4096	1	pp4096	663.18 ± 0.00
llama 7B Q4_0	CUDA	99	256	4096	0	pp4096	663.99 ± 0.00
llama 7B Q4_0	CUDA	99	256	4096	1	pp4096	715.87 ± 0.00
llama 7B Q4_0	CUDA	99	512	4096	0	pp4096	709.71 ± 0.00
llama 7B Q4_0	CUDA	99	512	4096	1	pp4096	719.96 ± 0.00
llama 7B Q4_0	CUDA	99	1024	4096	0	pp4096	703.04 ± 0.00
llama 7B Q4_0	CUDA	99	1024	4096	1	pp4096	706.27 ± 0.00
llama 7B Q4_0	CUDA	99	2048	4096	0	pp4096	699.01 ± 0.00
llama 7B Q4_0	CUDA	99	2048	4096	1	pp4096	677.62 ± 0.00
llama 7B Q4_0	CUDA	99	4096	4096	0	pp4096	652.36 ± 0.00
llama 7B Q4_0	CUDA	99	4096	4096	1	pp4096	623.75 ± 0.00

For very large batch sizes the performance with FlashAttention decreases but the performance seems to be optimal with a batch size of 512 anyways.

sorasoras · 2024-05-16T14:48:27Z

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes
| model                          |       size |     params | backend    | ngl |    n_batch |         fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ------------: | ---------------: |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          1 |          0 |        pp4096 |      9.27 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          1 |          1 |        pp4096 |     10.85 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          2 |          0 |        pp4096 |      9.27 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          2 |          1 |        pp4096 |     19.68 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          4 |          0 |        pp4096 |     15.45 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          4 |          1 |        pp4096 |     17.02 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          8 |          0 |        pp4096 |     19.39 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          8 |          1 |        pp4096 |     31.13 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         16 |          0 |        pp4096 |     28.04 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         16 |          1 |        pp4096 |     34.33 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         32 |          0 |        pp4096 |     54.27 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         32 |          1 |        pp4096 |     61.23 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         64 |          0 |        pp4096 |    103.43 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         64 |          1 |        pp4096 |     94.47 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        128 |          0 |        pp4096 |    141.43 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        128 |          1 |        pp4096 |    108.73 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        256 |          0 |        pp4096 |    163.13 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        256 |          1 |        pp4096 |    110.92 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        512 |          0 |        pp4096 |    177.99 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        512 |          1 |        pp4096 |    111.39 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       1024 |          0 |        pp4096 |    178.44 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       1024 |          1 |        pp4096 |    111.37 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       2048 |          0 |        pp4096 |    178.38 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       2048 |          1 |        pp4096 |    111.25 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       4096 |          0 |        pp4096 |    178.54 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       4096 |          1 |        pp4096 |    111.39 ± 0.00 |

build: aa9cbd76 (2853)

it seems less optimal for qwen2 32B at larger batch

GZGavinZhao · 2024-05-16T14:53:23Z

The closest AMD alternative I know to NVIDIA NSight Compute would be Radeon GPU Profiler. It's still a bit different, but may be enough to get started.

On the command-line, rocprofiler is what I often use. You can run rocprofiler --hip-trace <program-and-flags>, and it will give you a few CSV files containing API usage percentages/runtime plus a results.json that you can drop into Perfetto to inspect the timeline.

llama-bench takes too long for my RX6600M to run, but I profiled a few runs with rocprofiler with the same model and parameters used in previous comments, and found that hipStreamSynchronize takes up 26% of the runtime, compared to the 10% in the old behavior. All other metrics look roughly the same.

richginsberg · 2024-05-16T15:58:12Z

Another run using Asus ESC4000 G4 with Intel Xeon Gold 6138 Processor LGA3647 1.8Ghz 20 Core 40 Thread
And CUDA 12.2
Using single P100-PCIE-16GB of 4

./llama-bench --model ~/gguf-models/llama-2-7b.Q4_0.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 1,2,4,8,16,32,64,128,256,512,1024,2048,4096 -ts 1
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 4 CUDA devices:
  Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 2: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
  Device 3: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
| model                          |       size |     params | backend    | ngl |    n_batch |         fa | ts           |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ------------ | ------------: | ---------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |          0 | 1.00         |        pp4096 |     19.19 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          1 |          1 | 1.00         |        pp4096 |     20.99 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |          0 | 1.00         |        pp4096 |      7.63 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          2 |          1 | 1.00         |        pp4096 |      7.77 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          4 |          0 | 1.00         |        pp4096 |     14.68 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          4 |          1 | 1.00         |        pp4096 |     15.11 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          8 |          0 | 1.00         |        pp4096 |     27.48 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |          8 |          1 | 1.00         |        pp4096 |     28.62 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         16 |          0 | 1.00         |        pp4096 |     51.93 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         16 |          1 | 1.00         |        pp4096 |     57.64 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         32 |          0 | 1.00         |        pp4096 |    109.20 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         32 |          1 | 1.00         |        pp4096 |    113.79 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         64 |          0 | 1.00         |        pp4096 |    174.94 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |         64 |          1 | 1.00         |        pp4096 |    179.43 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        128 |          0 | 1.00         |        pp4096 |    257.89 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        128 |          1 | 1.00         |        pp4096 |    267.46 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        256 |          0 | 1.00         |        pp4096 |    348.21 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        256 |          1 | 1.00         |        pp4096 |    363.76 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        512 |          0 | 1.00         |        pp4096 |    419.83 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |        512 |          1 | 1.00         |        pp4096 |    442.68 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       1024 |          0 | 1.00         |        pp4096 |    419.91 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       1024 |          1 | 1.00         |        pp4096 |    442.85 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       2048 |          0 | 1.00         |        pp4096 |    420.06 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       2048 |          1 | 1.00         |        pp4096 |    443.13 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       4096 |          0 | 1.00         |        pp4096 |    420.49 ± 0.00 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CUDA       |  99 |       4096 |          1 | 1.00         |        pp4096 |    443.28 ± 0.00 |

build: cc0332df (2873)

the-crypt-keeper · 2024-05-16T16:26:26Z

@JohannesGaessler

Do you have ECC memory enabled? If it's disabled nvidia-smi should be reporting 24576MiB, if it's enabled it should be less.

ECC is disabled.

Are you disabling the other GPUs via CUDA_VISIABLE_DEVICES or via --tensor-split?

via CUDA_VISIABLE_DEVICES, but I just tried via -ts and the results were the same.

My P100 numbers match what others are reporting, but your P40 numbers are somehow ~4x mine. I guess we need another set of P40 benchmarks.

JohannesGaessler · 2024-05-16T17:24:19Z

@sorasoras I am not able to reproduce the performance issue with qwen 1.5 q4_0:

model	backend	ngl	n_batch	n_ubatch	fa	test	t/s
qwen2 32B Q4_0	CUDA	99	1	4096	0	pp4096	11.11 ± 0.00
qwen2 32B Q4_0	CUDA	99	1	4096	1	pp4096	13.14 ± 0.00
qwen2 32B Q4_0	CUDA	99	2	4096	0	pp4096	14.02 ± 0.00
qwen2 32B Q4_0	CUDA	99	2	4096	1	pp4096	25.01 ± 0.00
qwen2 32B Q4_0	CUDA	99	4	4096	0	pp4096	22.65 ± 0.00
qwen2 32B Q4_0	CUDA	99	4	4096	1	pp4096	31.86 ± 0.00
qwen2 32B Q4_0	CUDA	99	8	4096	0	pp4096	26.74 ± 0.00
qwen2 32B Q4_0	CUDA	99	8	4096	1	pp4096	38.82 ± 0.00
qwen2 32B Q4_0	CUDA	99	16	4096	0	pp4096	33.67 ± 0.00
qwen2 32B Q4_0	CUDA	99	16	4096	1	pp4096	44.59 ± 0.00
qwen2 32B Q4_0	CUDA	99	32	4096	0	pp4096	64.52 ± 0.00
qwen2 32B Q4_0	CUDA	99	32	4096	1	pp4096	85.34 ± 0.00
qwen2 32B Q4_0	CUDA	99	64	4096	0	pp4096	119.98 ± 0.00
qwen2 32B Q4_0	CUDA	99	64	4096	1	pp4096	151.90 ± 0.00
qwen2 32B Q4_0	CUDA	99	128	4096	0	pp4096	151.88 ± 0.00
qwen2 32B Q4_0	CUDA	99	128	4096	1	pp4096	173.24 ± 0.00
qwen2 32B Q4_0	CUDA	99	256	4096	0	pp4096	166.88 ± 0.00
qwen2 32B Q4_0	CUDA	99	256	4096	1	pp4096	176.18 ± 0.00
qwen2 32B Q4_0	CUDA	99	512	4096	0	pp4096	178.37 ± 0.00
qwen2 32B Q4_0	CUDA	99	512	4096	1	pp4096	181.52 ± 0.00
qwen2 32B Q4_0	CUDA	99	1024	4096	0	pp4096	178.97 ± 0.00
qwen2 32B Q4_0	CUDA	99	1024	4096	1	pp4096	179.72 ± 0.00
qwen2 32B Q4_0	CUDA	99	2048	4096	0	pp4096	178.99 ± 0.00
qwen2 32B Q4_0	CUDA	99	2048	4096	1	pp4096	172.05 ± 0.00
qwen2 32B Q4_0	CUDA	99	4096	4096	0	pp4096	170.47 ± 0.00
qwen2 32B Q4_0	CUDA	99	4096	4096	1	pp4096	164.67 ± 0.00

sorasoras · 2024-05-16T18:01:40Z

@sorasoras I am not able to reproduce the performance issue with qwen 1.5 q4_0:

model backend ngl n_batch n_ubatch fa test t/s
qwen2 32B Q4_0 CUDA 99 1 4096 0 pp4096 11.11 ± 0.00
qwen2 32B Q4_0 CUDA 99 1 4096 1 pp4096 13.14 ± 0.00
qwen2 32B Q4_0 CUDA 99 2 4096 0 pp4096 14.02 ± 0.00
qwen2 32B Q4_0 CUDA 99 2 4096 1 pp4096 25.01 ± 0.00
qwen2 32B Q4_0 CUDA 99 4 4096 0 pp4096 22.65 ± 0.00
qwen2 32B Q4_0 CUDA 99 4 4096 1 pp4096 31.86 ± 0.00
qwen2 32B Q4_0 CUDA 99 8 4096 0 pp4096 26.74 ± 0.00
qwen2 32B Q4_0 CUDA 99 8 4096 1 pp4096 38.82 ± 0.00
qwen2 32B Q4_0 CUDA 99 16 4096 0 pp4096 33.67 ± 0.00
qwen2 32B Q4_0 CUDA 99 16 4096 1 pp4096 44.59 ± 0.00
qwen2 32B Q4_0 CUDA 99 32 4096 0 pp4096 64.52 ± 0.00
qwen2 32B Q4_0 CUDA 99 32 4096 1 pp4096 85.34 ± 0.00
qwen2 32B Q4_0 CUDA 99 64 4096 0 pp4096 119.98 ± 0.00
qwen2 32B Q4_0 CUDA 99 64 4096 1 pp4096 151.90 ± 0.00
qwen2 32B Q4_0 CUDA 99 128 4096 0 pp4096 151.88 ± 0.00
qwen2 32B Q4_0 CUDA 99 128 4096 1 pp4096 173.24 ± 0.00
qwen2 32B Q4_0 CUDA 99 256 4096 0 pp4096 166.88 ± 0.00
qwen2 32B Q4_0 CUDA 99 256 4096 1 pp4096 176.18 ± 0.00
qwen2 32B Q4_0 CUDA 99 512 4096 0 pp4096 178.37 ± 0.00
qwen2 32B Q4_0 CUDA 99 512 4096 1 pp4096 181.52 ± 0.00
qwen2 32B Q4_0 CUDA 99 1024 4096 0 pp4096 178.97 ± 0.00
qwen2 32B Q4_0 CUDA 99 1024 4096 1 pp4096 179.72 ± 0.00
qwen2 32B Q4_0 CUDA 99 2048 4096 0 pp4096 178.99 ± 0.00
qwen2 32B Q4_0 CUDA 99 2048 4096 1 pp4096 172.05 ± 0.00
qwen2 32B Q4_0 CUDA 99 4096 4096 0 pp4096 170.47 ± 0.00
qwen2 32B Q4_0 CUDA 99 4096 4096 1 pp4096 164.67 ± 0.00

It could be something to do with " -DLLAMA_CUDA_DMMV_X=64 -DLLAMA_CUDA_MMV_Y=4"
I would try it again latter.

sorasoras · 2024-05-16T20:02:46Z

@sorasoras I am not able to reproduce the performance issue with qwen 1.5 q4_0:

model backend ngl n_batch n_ubatch fa test t/s
qwen2 32B Q4_0 CUDA 99 1 4096 0 pp4096 11.11 ± 0.00
qwen2 32B Q4_0 CUDA 99 1 4096 1 pp4096 13.14 ± 0.00
qwen2 32B Q4_0 CUDA 99 2 4096 0 pp4096 14.02 ± 0.00
qwen2 32B Q4_0 CUDA 99 2 4096 1 pp4096 25.01 ± 0.00
qwen2 32B Q4_0 CUDA 99 4 4096 0 pp4096 22.65 ± 0.00
qwen2 32B Q4_0 CUDA 99 4 4096 1 pp4096 31.86 ± 0.00
qwen2 32B Q4_0 CUDA 99 8 4096 0 pp4096 26.74 ± 0.00
qwen2 32B Q4_0 CUDA 99 8 4096 1 pp4096 38.82 ± 0.00
qwen2 32B Q4_0 CUDA 99 16 4096 0 pp4096 33.67 ± 0.00
qwen2 32B Q4_0 CUDA 99 16 4096 1 pp4096 44.59 ± 0.00
qwen2 32B Q4_0 CUDA 99 32 4096 0 pp4096 64.52 ± 0.00
qwen2 32B Q4_0 CUDA 99 32 4096 1 pp4096 85.34 ± 0.00
qwen2 32B Q4_0 CUDA 99 64 4096 0 pp4096 119.98 ± 0.00
qwen2 32B Q4_0 CUDA 99 64 4096 1 pp4096 151.90 ± 0.00
qwen2 32B Q4_0 CUDA 99 128 4096 0 pp4096 151.88 ± 0.00
qwen2 32B Q4_0 CUDA 99 128 4096 1 pp4096 173.24 ± 0.00
qwen2 32B Q4_0 CUDA 99 256 4096 0 pp4096 166.88 ± 0.00
qwen2 32B Q4_0 CUDA 99 256 4096 1 pp4096 176.18 ± 0.00
qwen2 32B Q4_0 CUDA 99 512 4096 0 pp4096 178.37 ± 0.00
qwen2 32B Q4_0 CUDA 99 512 4096 1 pp4096 181.52 ± 0.00
qwen2 32B Q4_0 CUDA 99 1024 4096 0 pp4096 178.97 ± 0.00
qwen2 32B Q4_0 CUDA 99 1024 4096 1 pp4096 179.72 ± 0.00
qwen2 32B Q4_0 CUDA 99 2048 4096 0 pp4096 178.99 ± 0.00
qwen2 32B Q4_0 CUDA 99 2048 4096 1 pp4096 172.05 ± 0.00
qwen2 32B Q4_0 CUDA 99 4096 4096 0 pp4096 170.47 ± 0.00
qwen2 32B Q4_0 CUDA 99 4096 4096 1 pp4096 164.67 ± 0.00

  Device 0: Tesla P40, compute capability 6.1, VMM: yes
| model                          |       size |     params | backend    | ngl |    n_batch |         fa |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---------: | ---------: | ------------: | ---------------: |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          1 |          0 |        pp4096 |      9.47 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          1 |          1 |        pp4096 |     11.00 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          2 |          0 |        pp4096 |      9.55 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          2 |          1 |        pp4096 |     19.84 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          4 |          0 |        pp4096 |     15.55 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          4 |          1 |        pp4096 |     25.31 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          8 |          0 |        pp4096 |     19.80 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |          8 |          1 |        pp4096 |     30.91 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         16 |          0 |        pp4096 |     28.34 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         16 |          1 |        pp4096 |     41.38 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         32 |          0 |        pp4096 |     54.80 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         32 |          1 |        pp4096 |     79.62 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         64 |          0 |        pp4096 |    104.02 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |         64 |          1 |        pp4096 |    142.88 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        128 |          0 |        pp4096 |    140.53 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        128 |          1 |        pp4096 |    171.10 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        256 |          0 |        pp4096 |    161.36 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        256 |          1 |        pp4096 |    176.00 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        512 |          0 |        pp4096 |    175.13 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |        512 |          1 |        pp4096 |    181.64 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       1024 |          0 |        pp4096 |    175.06 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       1024 |          1 |        pp4096 |    181.66 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       2048 |          0 |        pp4096 |    174.81 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       2048 |          1 |        pp4096 |    181.31 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       4096 |          0 |        pp4096 |    174.57 ± 0.00 |
| qwen2 ?B Q4_K - Medium         |  18.34 GiB |    32.51 B | CUDA       |  99 |       4096 |          1 |        pp4096 |    181.30 ± 0.00 |

yup, it should work without -DLLAMA_CUDA_DMMV_X=64 -DLLAMA_CUDA_MMV_Y=4 when compile

LoopControl · 2024-05-16T22:13:34Z

Seeing great results with this PR @JohannesGaessler thanks!

Here's the numbers from a P40 that I've power limited to 130W (because it keeps the card cooler):

P40

./llama-bench --model ./models-all/Meta-Llama-3-8B.Q5_K_M.gguf -r 1 -fa 0,1 -n 0 -pg 0,0 -p 4096 -b 16,32,64,128,256,512,1024
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla P40, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	n_batch	fa	test	t/s
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	16	0	pp4096	93.88 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	16	1	pp4096	127.49 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	32	0	pp4096	179.10 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	32	1	pp4096	240.82 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	64	0	pp4096	328.36 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	64	1	pp4096	417.04 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	128	0	pp4096	408.50 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	128	1	pp4096	467.78 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	256	0	pp4096	470.77 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	256	1	pp4096	505.75 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	512	0	pp4096	509.06 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	512	1	pp4096	522.42 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	1024	0	pp4096	509.93 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	1024	1	pp4096	523.24 ± 0.00

RTX 3060

Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes (Power limited to 150W)

model	size	params	backend	ngl	n_batch	fa	test	t/s
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	16	0	pp4096	129.29 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	16	1	pp4096	154.47 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	32	0	pp4096	267.50 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	32	1	pp4096	279.46 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	64	0	pp4096	498.66 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	64	1	pp4096	549.97 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	128	0	pp4096	578.19 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	128	1	pp4096	642.22 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	256	0	pp4096	580.37 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	256	1	pp4096	661.37 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	512	0	pp4096	590.96 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	512	1	pp4096	676.29 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	1024	0	pp4096	591.65 ± 0.00
llama 8B Q5_K - Medium	5.33 GiB	8.03 B	CUDA	99	1024	1	pp4096	675.74 ± 0.00

the-crypt-keeper · 2024-05-17T00:13:49Z

@JohannesGaessler Looks like you were right and there was something power limiting the P40s in my main rig to around 70W. I've moved them to the secondary and now they're >200W during these tests. My observation from the severely power-limited rig stands: with 2xP40 the performance gains here are HUGE.

Single P40

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance

model	size	params	backend	ngl	n_batch	fa	ts	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	0	1.00	pp4096	388.86 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	1	1.00	pp4096	583.08 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	0	1.00	pp4096	543.88 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	1	1.00	pp4096	693.23 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	0	1.00	pp4096	674.47 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	1	1.00	pp4096	763.50 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	0	1.00	pp4096	771.49 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	1	1.00	pp4096	810.98 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	0	1.00	pp4096	771.96 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	1	1.00	pp4096	811.34 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	0	1.00	pp4096	771.81 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	1	1.00	pp4096	812.09 ± 0.00

2xP40 split layer

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	n_batch	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	0	pp4096	440.58 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	1	pp4096	1066.93 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	0	pp4096	612.96 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	1	pp4096	1240.84 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	0	pp4096	754.97 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	1	pp4096	1321.64 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	0	pp4096	854.14 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	1	pp4096	1323.12 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	0	pp4096	855.33 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	1	pp4096	1322.56 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	0	pp4096	856.13 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	1	pp4096	1322.95 ± 0.00

2xP40 Split row

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes

model	size	params	backend	ngl	n_batch	sm	fa	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	row	0	pp4096	310.54 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	64	row	1	pp4096	430.14 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	row	0	pp4096	415.10 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	128	row	1	pp4096	503.06 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	row	0	pp4096	528.24 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	256	row	1	pp4096	634.75 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	row	0	pp4096	701.42 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	512	row	1	pp4096	794.70 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	row	0	pp4096	703.84 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	1024	row	1	pp4096	795.85 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	row	0	pp4096	706.40 ± 0.00
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	2048	row	1	pp4096	800.63 ± 0.00

Llama-3-70B-Instruct

Not as drastic but still some very welcome improvements, staying above 8 tok/sec:

CUDA_VISIBLE_DEVICES=0,1 ./llama-bench --model /disk-0/models/Meta-Llama-3-70B-Instruct.Q4_K_M.gguf -r 1 -fa 0,1 -b 256,512 -sm layer,row
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P40, compute capability 6.1, VMM: yes
Device 1: Tesla P40, compute capability 6.1, VMM: yes
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance

model	size	params	backend	ngl	n_batch	sm	fa	test	t/s
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	256	layer	0	pp512	94.07 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	256	layer	0	tg128	5.46 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	256	layer	0	pp512+tg128	21.27 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	256	layer	1	pp512	125.49 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	256	layer	1	tg128	5.55 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	256	layer	1	pp512+tg128	23.42 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	512	layer	0	pp512	95.40 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	512	layer	0	tg128	5.46 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	512	layer	0	pp512+tg128	21.33 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	512	layer	1	pp512	97.77 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	512	layer	1	tg128	5.55 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	512	layer	1	pp512+tg128	22.46 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	256	row	0	pp512	109.72 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	256	row	0	tg128	7.84 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	256	row	0	pp512+tg128	28.82 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	256	row	1	pp512	121.51 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	256	row	1	tg128	8.14 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	256	row	1	pp512+tg128	31.76 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	512	row	0	pp512	133.60 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	512	row	0	tg128	7.86 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	512	row	0	pp512+tg128	29.98 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	512	row	1	pp512	143.73 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	512	row	1	tg128	8.14 ± 0.00
llama 70B Q4_K - Medium	39.59 GiB	70.55 B	CUDA	99	512	row	1	pp512+tg128	32.86 ± 0.00

slaren

I don't want to block merging this, but I will point the obvious that there is a lot of code duplication here and that is going to complicate maintaining this code in the future.

LoopControl · 2024-05-22T00:34:54Z

@JohannesGaessler This was working great after merge but with the new Phi3 related commits, I'm now getting a crash when -fa is enabled on my Tesla P40.

When -fa is off (or when I use my RTX 3060 with -fa), it works fine.

Current version from master that's crashing with FA: 201cc11afa0a1950e1f632390b2ac6c937a0d8f0

Startup command:
./server -fa --batch-size 1024 --ubatch-size 1024 -c 32768 -m Phi-3-medium-128k-instruct-Q4_K_M.gguf

Phi-3 Medium gguf from here: https://huggingface.co/bartowski/Phi-3-medium-128k-instruct-GGUF

Crash output:

{"tid":"140132145360896","timestamp":1716337795,"level":"INFO","function":"update_slots","line":2091,"msg":"kv cache rm [p0, end)","id_slot":0,"id_task":0,"p0":0}
GGML_ASSERT: ggml-cuda/fattn-tile-f32.cu:290: precision == GGML_PREC_DEFAULT
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.
./server-phi3-medium-128k.sh: line 4: 506539 Aborted

JohannesGaessler · 2024-05-22T08:25:32Z

There was an incorrect check for precision which is now fixed on master. However, if like Phi-2 Phi-3 is using a head size of 80 the code will still not work.

LoopControl · 2024-05-22T17:11:40Z

There was an incorrect check for precision which is now fixed on master. However, if like Phi-2 Phi-3 is using a head size of 80 the code will still not work.

Thanks for the quick fix @JohannesGaessler !

After merging latest changes, inference is now working well on the P40 with FA with the Phi 3 model I linked above.

CUDA: faster large batch FA without tensor cores

cc0332d

mofosyne added Nvidia GPU Issues specific to Nvidia GPUs Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 16, 2024

slaren approved these changes May 17, 2024

View reviewed changes

JohannesGaessler merged commit 0fc1e82 into ggml-org:master May 17, 2024
66 checks passed

JohannesGaessler mentioned this pull request May 17, 2024

CUDA: deduplicate FlashAttention code #7352

Merged

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request May 17, 2024

CUDA: faster large batch FA without tensor cores (ggml-org#7314)

733faeb

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request May 18, 2024

CUDA: faster large batch FA without tensor cores (ggml-org#7314)

7dd012c

JohannesGaessler mentioned this pull request May 22, 2024

CUDA: remove incorrect precision check #7454

Merged

CUDA: faster large batch FA without tensor cores #7314

CUDA: faster large batch FA without tensor cores #7314

Uh oh!

Conversation

JohannesGaessler commented May 15, 2024

Uh oh!

JohannesGaessler commented May 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tom-Neverwinter commented May 15, 2024

Uh oh!

dirkson commented May 16, 2024

Uh oh!

richginsberg commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented May 16, 2024

Uh oh!

github-actions bot commented May 16, 2024

Uh oh!

the-crypt-keeper commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Single P100

Dual P100

Master: FA is slower

This branch: FA is 5% faster!

Single P40 (faster up to 256 only)

Dual P40

Master: FA slower past ctx 256

This branch: 🤯 🐎

Uh oh!

JohannesGaessler commented May 16, 2024

Uh oh!

the-crypt-keeper commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented May 16, 2024

Uh oh!

JohannesGaessler commented May 16, 2024

Uh oh!

JohannesGaessler commented May 16, 2024

Uh oh!

sorasoras commented May 16, 2024

Uh oh!

GZGavinZhao commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

richginsberg commented May 16, 2024

Uh oh!

the-crypt-keeper commented May 16, 2024

Uh oh!

JohannesGaessler commented May 16, 2024

Uh oh!

sorasoras commented May 16, 2024

Uh oh!

sorasoras commented May 16, 2024

Uh oh!

LoopControl commented May 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

P40

RTX 3060

Uh oh!

the-crypt-keeper commented May 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Single P40

2xP40 split layer

2xP40 Split row

Llama-3-70B-Instruct

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LoopControl commented May 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented May 22, 2024

Uh oh!

LoopControl commented May 22, 2024

Uh oh!

JohannesGaessler commented May 15, 2024 •

edited

Loading

richginsberg commented May 16, 2024 •

edited

Loading

the-crypt-keeper commented May 16, 2024 •

edited

Loading

the-crypt-keeper commented May 16, 2024 •

edited

Loading

GZGavinZhao commented May 16, 2024 •

edited

Loading

LoopControl commented May 16, 2024 •

edited

Loading

the-crypt-keeper commented May 17, 2024 •

edited

Loading

LoopControl commented May 22, 2024 •

edited

Loading