ggml-opencl, llama: using reserve() if count already known #7272

GermanAizek · 2024-05-14T01:35:01Z

It affects a lot ggml_cl_mul_mat_q_f32 function.

github-actions · 2024-05-14T02:04:31Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 547 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8563.44ms p(95)=20815.89ms fails=, finish reason: stop=478 truncated=69
Prompt processing (pp): avg=105.13tk/s p(95)=469.6tk/s
Token generation (tg): avg=33.15tk/s p(95)=46.6tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=reserve-vec commit=4ee29e5e1caf29e1bc7b094226faa890ae0e98d6

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716181689 --> 1716182321
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 472.06, 472.06, 472.06, 472.06, 472.06, 525.02, 525.02, 525.02, 525.02, 525.02, 550.42, 550.42, 550.42, 550.42, 550.42, 588.73, 588.73, 588.73, 588.73, 588.73, 662.71, 662.71, 662.71, 662.71, 662.71, 665.31, 665.31, 665.31, 665.31, 665.31, 669.78, 669.78, 669.78, 669.78, 669.78, 698.16, 698.16, 698.16, 698.16, 698.16, 709.11, 709.11, 709.11, 709.11, 709.11, 725.4, 725.4, 725.4, 725.4, 725.4, 758.25, 758.25, 758.25, 758.25, 758.25, 770.08, 770.08, 770.08, 770.08, 770.08, 788.8, 788.8, 788.8, 788.8, 788.8, 841.7, 841.7, 841.7, 841.7, 841.7, 837.58, 837.58, 837.58, 837.58, 837.58, 840.07, 840.07, 840.07, 840.07, 840.07, 837.49, 837.49, 837.49, 837.49, 837.49, 853.86, 853.86, 853.86, 853.86, 853.86, 856.06, 856.06, 856.06, 856.06, 856.06, 862.22, 862.22, 862.22, 862.22, 862.22, 861.55, 861.55, 861.55, 861.55, 861.55, 866.24, 866.24, 866.24, 866.24, 866.24, 880.66, 880.66, 880.66, 880.66, 880.66, 882.05, 882.05, 882.05, 882.05, 882.05, 883.99, 883.99, 883.99, 883.99, 883.99, 895.05, 895.05, 895.05, 895.05, 895.05, 890.86, 890.86, 890.86, 890.86, 890.86, 886.13, 886.13, 886.13, 886.13, 886.13, 884.42, 884.42, 884.42, 884.42, 884.42, 888.12, 888.12, 888.12, 888.12, 888.12, 888.61, 888.61, 888.61, 888.61, 888.61, 886.54, 886.54, 886.54, 886.54, 886.54, 883.37, 883.37, 883.37, 883.37, 883.37, 893.4, 893.4, 893.4, 893.4, 893.4, 901.59, 901.59, 901.59, 901.59, 901.59, 909.05, 909.05, 909.05, 909.05, 909.05, 908.93, 908.93, 908.93, 908.93, 908.93, 902.53, 902.53, 902.53, 902.53, 902.53, 901.31, 901.31, 901.31, 901.31, 901.31, 902.46, 902.46, 902.46, 902.46, 902.46, 900.35, 900.35, 900.35, 900.35, 900.35, 893.79, 893.79, 893.79, 893.79, 893.79, 865.16, 865.16, 865.16, 865.16, 865.16, 864.17, 864.17, 864.17, 864.17, 864.17, 861.86, 861.86, 861.86, 861.86, 861.86, 860.66, 860.66, 860.66, 860.66, 860.66, 864.27, 864.27, 864.27, 864.27, 864.27, 866.95, 866.95, 866.95, 866.95, 866.95, 866.3, 866.3, 866.3, 866.3, 866.3, 870.9, 870.9, 870.9, 870.9, 870.9, 870.1, 870.1, 870.1, 870.1, 870.1, 875.37, 875.37, 875.37, 875.37, 875.37, 876.07, 876.07, 876.07, 876.07, 876.07, 874.88, 874.88, 874.88, 874.88, 874.88, 875.38, 875.38, 875.38, 875.38, 875.38, 875.58, 875.58, 875.58, 875.58, 875.58, 875.61, 875.61, 875.61, 875.61, 875.61, 875.77, 875.77, 875.77, 875.77, 875.77, 877.0, 877.0, 877.0, 877.0, 877.0, 877.51, 877.51, 877.51, 877.51, 877.51, 879.33, 879.33, 879.33, 879.33, 879.33, 879.33, 879.33]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716181689 --> 1716182321
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.23, 41.23, 41.23, 41.23, 41.23, 42.01, 42.01, 42.01, 42.01, 42.01, 37.81, 37.81, 37.81, 37.81, 37.81, 36.5, 36.5, 36.5, 36.5, 36.5, 36.12, 36.12, 36.12, 36.12, 36.12, 35.43, 35.43, 35.43, 35.43, 35.43, 35.54, 35.54, 35.54, 35.54, 35.54, 36.17, 36.17, 36.17, 36.17, 36.17, 36.33, 36.33, 36.33, 36.33, 36.33, 35.78, 35.78, 35.78, 35.78, 35.78, 35.75, 35.75, 35.75, 35.75, 35.75, 35.62, 35.62, 35.62, 35.62, 35.62, 34.82, 34.82, 34.82, 34.82, 34.82, 34.24, 34.24, 34.24, 34.24, 34.24, 33.15, 33.15, 33.15, 33.15, 33.15, 33.34, 33.34, 33.34, 33.34, 33.34, 33.65, 33.65, 33.65, 33.65, 33.65, 33.46, 33.46, 33.46, 33.46, 33.46, 33.01, 33.01, 33.01, 33.01, 33.01, 32.91, 32.91, 32.91, 32.91, 32.91, 32.8, 32.8, 32.8, 32.8, 32.8, 32.88, 32.88, 32.88, 32.88, 32.88, 32.7, 32.7, 32.7, 32.7, 32.7, 32.87, 32.87, 32.87, 32.87, 32.87, 32.87, 32.87, 32.87, 32.87, 32.87, 32.75, 32.75, 32.75, 32.75, 32.75, 32.09, 32.09, 32.09, 32.09, 32.09, 31.87, 31.87, 31.87, 31.87, 31.87, 31.85, 31.85, 31.85, 31.85, 31.85, 32.02, 32.02, 32.02, 32.02, 32.02, 32.16, 32.16, 32.16, 32.16, 32.16, 32.26, 32.26, 32.26, 32.26, 32.26, 32.31, 32.31, 32.31, 32.31, 32.31, 32.33, 32.33, 32.33, 32.33, 32.33, 32.17, 32.17, 32.17, 32.17, 32.17, 32.01, 32.01, 32.01, 32.01, 32.01, 31.66, 31.66, 31.66, 31.66, 31.66, 31.64, 31.64, 31.64, 31.64, 31.64, 31.77, 31.77, 31.77, 31.77, 31.77, 31.96, 31.96, 31.96, 31.96, 31.96, 31.98, 31.98, 31.98, 31.98, 31.98, 32.12, 32.12, 32.12, 32.12, 32.12, 31.94, 31.94, 31.94, 31.94, 31.94, 31.28, 31.28, 31.28, 31.28, 31.28, 31.21, 31.21, 31.21, 31.21, 31.21, 30.23, 30.23, 30.23, 30.23, 30.23, 29.92, 29.92, 29.92, 29.92, 29.92, 29.95, 29.95, 29.95, 29.95, 29.95, 30.1, 30.1, 30.1, 30.1, 30.1, 30.13, 30.13, 30.13, 30.13, 30.13, 30.23, 30.23, 30.23, 30.23, 30.23, 30.3, 30.3, 30.3, 30.3, 30.3, 30.28, 30.28, 30.28, 30.28, 30.28, 30.07, 30.07, 30.07, 30.07, 30.07, 30.05, 30.05, 30.05, 30.05, 30.05, 30.04, 30.04, 30.04, 30.04, 30.04, 30.2, 30.2, 30.2, 30.2, 30.2, 30.33, 30.33, 30.33, 30.33, 30.33, 30.39, 30.39, 30.39, 30.39, 30.39, 30.45, 30.45, 30.45, 30.45, 30.45, 30.55, 30.55, 30.55, 30.55, 30.55, 30.58, 30.58]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716181689 --> 1716182321
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.17, 0.17, 0.17, 0.17, 0.17, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.18, 0.18, 0.18, 0.18, 0.18, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.15, 0.15, 0.15, 0.15, 0.15, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.15, 0.15, 0.15, 0.15, 0.15, 0.38, 0.38, 0.38, 0.38, 0.38, 0.36, 0.36, 0.36, 0.36, 0.36, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.16, 0.16, 0.16, 0.16, 0.16, 0.08, 0.08, 0.08, 0.08, 0.08, 0.11, 0.11, 0.11, 0.11, 0.11, 0.14, 0.14, 0.14, 0.14, 0.14, 0.34, 0.34, 0.34, 0.34, 0.34, 0.56, 0.56, 0.56, 0.56, 0.56, 0.62, 0.62, 0.62, 0.62, 0.62, 0.48, 0.48, 0.48, 0.48, 0.48, 0.42, 0.42, 0.42, 0.42, 0.42, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.24, 0.24, 0.24, 0.24, 0.24, 0.27, 0.27, 0.27, 0.27, 0.27, 0.11, 0.11, 0.11, 0.11, 0.11, 0.26, 0.26, 0.26, 0.26, 0.26, 0.14, 0.14, 0.14, 0.14, 0.14, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.22, 0.22]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 547 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716181689 --> 1716182321
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0]

ggerganov · 2024-05-14T07:01:47Z

llama.cpp

@@ -6116,6 +6116,7 @@ static bool llm_load_tensors(
                mlock_buf->init   (ggml_backend_buffer_get_base(buf));
                mlock_buf->grow_to(ggml_backend_buffer_get_size(buf));
            }
+            bufs.reserve(ml.files.size());


Already reserved on line 6060

fix it 4ee29e5

ggerganov · 2024-05-14T07:01:59Z

ggml-opencl.cpp

+                int64_t i12 = i02 * r2;
+                int64_t e12 = i12 + r2;
+                events.reserve(e12 - i12);
+                while (i12 < e12) {


Better to keep the for loop

fix it 4ee29e5

cebtenzzre · 2024-07-05T19:39:10Z

ggml-opencl.cpp

-                for (int64_t i12 = i02 * r2, e12 = i12 + r2; i12 < e12; i12++) {
+                int64_t i12 = i02 * r2;
+                int64_t e12 = i12 + r2;
+                events.reserve(e12 - i12);


For future reference: events is cleared at the end of this inner loop, so its actual maximum capacity is 3. Even ignoring the clear(), reserve() does not grow the vector by the specified amount, it increases the capacity to the specified amount—so you would need to reserve events.size() + e12 - i12 instead, if you were to even bother.

Luckily, this file is gone now, so this particular instance doesn't matter. But we should be more careful going forward.

@cebtenzzre, good catch. More reviewers there are, lower chance making a mistake, you're right.

mofosyne added refactoring Refactoring Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 14, 2024

ggerganov reviewed May 14, 2024

View reviewed changes

mofosyne marked this pull request as draft May 14, 2024 07:32

GermanAizek force-pushed the reserve-vec branch from f5aef46 to 4ee29e5 Compare May 20, 2024 02:25

ggml-opencl, llama: using reserve() if count already known

4ee29e5

GermanAizek marked this pull request as ready for review May 20, 2024 02:25

ggerganov merged commit 213e90e into ggml-org:master May 20, 2024
66 checks passed

cebtenzzre reviewed Jul 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml-opencl, llama: using reserve() if count already known #7272

ggml-opencl, llama: using reserve() if count already known #7272

Uh oh!

GermanAizek commented May 14, 2024

Uh oh!

github-actions bot commented May 14, 2024 •

edited

Loading

Uh oh!

ggerganov May 14, 2024

Uh oh!

GermanAizek May 20, 2024

Uh oh!

ggerganov May 14, 2024

Uh oh!

GermanAizek May 20, 2024

Uh oh!

Uh oh!

cebtenzzre Jul 5, 2024

Uh oh!

GermanAizek Jul 5, 2024

Uh oh!

Uh oh!

ggml-opencl, llama: using reserve() if count already known #7272

ggml-opencl, llama: using reserve() if count already known #7272

Uh oh!

Conversation

GermanAizek commented May 14, 2024

Uh oh!

github-actions bot commented May 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov May 14, 2024

Choose a reason for hiding this comment

Uh oh!

GermanAizek May 20, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov May 14, 2024

Choose a reason for hiding this comment

Uh oh!

GermanAizek May 20, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cebtenzzre Jul 5, 2024

Choose a reason for hiding this comment

Uh oh!

GermanAizek Jul 5, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented May 14, 2024 •

edited

Loading