Skip to content

Allow multiple copy function pointers for CUDA graph kernel updates #7565

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

agray3
Copy link
Contributor

@agray3 agray3 commented May 27, 2024

CUDA graphs require parameter updates to kernels associated with GGML_OP_CPY nodes. Previously the implementation only checked for a single CUDA kernel in such nodes, but this caused a bug in cases where 2 such kernels exist. This fixes the issue by using a vector to allow multiple function pointers to be stored and checked against.

Fixes #7492

…ates

CUDA graphs require parameter updates to kernels associated with
GGML_OP_CPY nodes. Previously the implementation only checked for a
single CUDA kernel in such nodes, but this caused a bug in cases where
2 such kernels exist. This fixes the issue by using a vector to allow
multiple function pointers to be stored and checked against.

Fixes ggml-org#7942
@agray3
Copy link
Contributor Author

agray3 commented May 27, 2024

@JohannesGaessler Can you check if this works for #7527 ?

Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 529 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8873.23ms p(95)=21807.9ms fails=, finish reason: stop=476 truncated=53
  • Prompt processing (pp): avg=105.19tk/s p(95)=468.51tk/s
  • Token generation (tg): avg=58.91tk/s p(95)=46.72tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=ag_allow_multiple_cuda_cpy_fn_ptrs commit=21826514dfac9237a32cad6d1f2312298800ebf9

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716820583 --> 1716821207
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 447.26, 447.26, 447.26, 447.26, 447.26, 690.48, 690.48, 690.48, 690.48, 690.48, 684.42, 684.42, 684.42, 684.42, 684.42, 699.25, 699.25, 699.25, 699.25, 699.25, 788.71, 788.71, 788.71, 788.71, 788.71, 787.03, 787.03, 787.03, 787.03, 787.03, 791.06, 791.06, 791.06, 791.06, 791.06, 815.97, 815.97, 815.97, 815.97, 815.97, 815.27, 815.27, 815.27, 815.27, 815.27, 827.36, 827.36, 827.36, 827.36, 827.36, 830.6, 830.6, 830.6, 830.6, 830.6, 861.16, 861.16, 861.16, 861.16, 861.16, 882.34, 882.34, 882.34, 882.34, 882.34, 905.07, 905.07, 905.07, 905.07, 905.07, 910.81, 910.81, 910.81, 910.81, 910.81, 910.32, 910.32, 910.32, 910.32, 910.32, 913.05, 913.05, 913.05, 913.05, 913.05, 910.64, 910.64, 910.64, 910.64, 910.64, 917.3, 917.3, 917.3, 917.3, 917.3, 930.07, 930.07, 930.07, 930.07, 930.07, 927.55, 927.55, 927.55, 927.55, 927.55, 931.7, 931.7, 931.7, 931.7, 931.7, 931.69, 931.69, 931.69, 931.69, 931.69, 920.1, 920.1, 920.1, 920.1, 920.1, 917.93, 917.93, 917.93, 917.93, 917.93, 919.82, 919.82, 919.82, 919.82, 919.82, 933.43, 933.43, 933.43, 933.43, 933.43, 929.53, 929.53, 929.53, 929.53, 929.53, 925.89, 925.89, 925.89, 925.89, 925.89, 924.95, 924.95, 924.95, 924.95, 924.95, 927.88, 927.88, 927.88, 927.88, 927.88, 927.28, 927.28, 927.28, 927.28, 927.28, 924.2, 924.2, 924.2, 924.2, 924.2, 926.38, 926.38, 926.38, 926.38, 926.38, 934.53, 934.53, 934.53, 934.53, 934.53, 937.01, 937.01, 937.01, 937.01, 937.01, 935.63, 935.63, 935.63, 935.63, 935.63, 933.67, 933.67, 933.67, 933.67, 933.67, 930.17, 930.17, 930.17, 930.17, 930.17, 928.36, 928.36, 928.36, 928.36, 928.36, 931.63, 931.63, 931.63, 931.63, 931.63, 930.67, 930.67, 930.67, 930.67, 930.67, 927.58, 927.58, 927.58, 927.58, 927.58, 901.39, 901.39, 901.39, 901.39, 901.39, 900.6, 900.6, 900.6, 900.6, 900.6, 897.45, 897.45, 897.45, 897.45, 897.45, 894.94, 894.94, 894.94, 894.94, 894.94, 892.24, 892.24, 892.24, 892.24, 892.24, 893.78, 893.78, 893.78, 893.78, 893.78, 895.0, 895.0, 895.0, 895.0, 895.0, 893.24, 893.24, 893.24, 893.24, 893.24, 896.37, 896.37, 896.37, 896.37, 896.37, 894.8, 894.8, 894.8, 894.8, 894.8, 894.4, 894.4, 894.4, 894.4, 894.4, 896.21, 896.21, 896.21, 896.21, 896.21, 895.12, 895.12, 895.12, 895.12, 895.12, 892.14, 892.14, 892.14, 892.14, 892.14, 893.71, 893.71, 893.71, 893.71, 893.71, 893.02, 893.02, 893.02, 893.02, 893.02, 892.89, 892.89, 892.89, 892.89, 892.89, 892.49, 892.49, 892.49]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716820583 --> 1716821207
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 44.53, 44.53, 44.53, 44.53, 44.53, 44.57, 44.57, 44.57, 44.57, 44.57, 28.21, 28.21, 28.21, 28.21, 28.21, 29.32, 29.32, 29.32, 29.32, 29.32, 31.07, 31.07, 31.07, 31.07, 31.07, 31.09, 31.09, 31.09, 31.09, 31.09, 32.1, 32.1, 32.1, 32.1, 32.1, 32.86, 32.86, 32.86, 32.86, 32.86, 32.96, 32.96, 32.96, 32.96, 32.96, 32.95, 32.95, 32.95, 32.95, 32.95, 33.14, 33.14, 33.14, 33.14, 33.14, 33.31, 33.31, 33.31, 33.31, 33.31, 32.69, 32.69, 32.69, 32.69, 32.69, 32.25, 32.25, 32.25, 32.25, 32.25, 31.99, 31.99, 31.99, 31.99, 31.99, 30.6, 30.6, 30.6, 30.6, 30.6, 29.31, 29.31, 29.31, 29.31, 29.31, 29.64, 29.64, 29.64, 29.64, 29.64, 29.73, 29.73, 29.73, 29.73, 29.73, 29.5, 29.5, 29.5, 29.5, 29.5, 29.78, 29.78, 29.78, 29.78, 29.78, 29.86, 29.86, 29.86, 29.86, 29.86, 30.13, 30.13, 30.13, 30.13, 30.13, 30.34, 30.34, 30.34, 30.34, 30.34, 30.15, 30.15, 30.15, 30.15, 30.15, 30.47, 30.47, 30.47, 30.47, 30.47, 30.54, 30.54, 30.54, 30.54, 30.54, 30.29, 30.29, 30.29, 30.29, 30.29, 30.37, 30.37, 30.37, 30.37, 30.37, 30.63, 30.63, 30.63, 30.63, 30.63, 30.83, 30.83, 30.83, 30.83, 30.83, 30.84, 30.84, 30.84, 30.84, 30.84, 31.05, 31.05, 31.05, 31.05, 31.05, 31.1, 31.1, 31.1, 31.1, 31.1, 31.03, 31.03, 31.03, 31.03, 31.03, 30.78, 30.78, 30.78, 30.78, 30.78, 30.45, 30.45, 30.45, 30.45, 30.45, 30.24, 30.24, 30.24, 30.24, 30.24, 30.3, 30.3, 30.3, 30.3, 30.3, 30.5, 30.5, 30.5, 30.5, 30.5, 30.58, 30.58, 30.58, 30.58, 30.58, 30.6, 30.6, 30.6, 30.6, 30.6, 30.76, 30.76, 30.76, 30.76, 30.76, 30.63, 30.63, 30.63, 30.63, 30.63, 30.5, 30.5, 30.5, 30.5, 30.5, 29.95, 29.95, 29.95, 29.95, 29.95, 28.96, 28.96, 28.96, 28.96, 28.96, 28.67, 28.67, 28.67, 28.67, 28.67, 28.64, 28.64, 28.64, 28.64, 28.64, 28.63, 28.63, 28.63, 28.63, 28.63, 28.62, 28.62, 28.62, 28.62, 28.62, 28.65, 28.65, 28.65, 28.65, 28.65, 28.68, 28.68, 28.68, 28.68, 28.68, 28.72, 28.72, 28.72, 28.72, 28.72, 28.74, 28.74, 28.74, 28.74, 28.74, 28.75, 28.75, 28.75, 28.75, 28.75, 28.76, 28.76, 28.76, 28.76, 28.76, 28.81, 28.81, 28.81, 28.81, 28.81, 29.01, 29.01, 29.01, 29.01, 29.01, 29.13, 29.13, 29.13, 29.13, 29.13, 29.18, 29.18, 29.18]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716820583 --> 1716821207
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.15, 0.15, 0.15, 0.15, 0.15, 0.42, 0.42, 0.42, 0.42, 0.42, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.23, 0.1, 0.1, 0.1, 0.1, 0.1, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.2, 0.2, 0.2, 0.2, 0.2, 0.12, 0.12, 0.12, 0.12, 0.12, 0.27, 0.27, 0.27, 0.27, 0.27, 0.32, 0.32, 0.32, 0.32, 0.32, 0.35, 0.35, 0.35, 0.35, 0.35, 0.44, 0.44, 0.44, 0.44, 0.44, 0.34, 0.34, 0.34, 0.34, 0.34, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.21, 0.21, 0.21, 0.21, 0.21, 0.13, 0.13, 0.13, 0.13, 0.13, 0.31, 0.31, 0.31, 0.31, 0.31, 0.09, 0.09, 0.09, 0.09, 0.09, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.28, 0.28, 0.28, 0.28, 0.28, 0.34, 0.34, 0.34, 0.34, 0.34, 0.22, 0.22, 0.22, 0.22, 0.22, 0.25, 0.25, 0.25, 0.25, 0.25, 0.17, 0.17, 0.17, 0.17, 0.17, 0.14, 0.14, 0.14, 0.14, 0.14, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.34, 0.34, 0.34, 0.34, 0.34, 0.51, 0.51, 0.51, 0.51, 0.51, 0.64, 0.64, 0.64, 0.64, 0.64, 0.6, 0.6, 0.6, 0.6, 0.6, 0.41, 0.41, 0.41, 0.41, 0.41, 0.21, 0.21, 0.21, 0.21, 0.21, 0.25, 0.25, 0.25, 0.25, 0.25, 0.23, 0.23, 0.23, 0.23, 0.23, 0.24, 0.24, 0.24, 0.24, 0.24, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.07, 0.07, 0.07, 0.07, 0.07, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.18, 0.18, 0.18]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 529 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716820583 --> 1716821207
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0, 1.0, 1.0, 1.0, 1.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0]
                    
Loading

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that this fixes the issue both on master and for my PR.

@JohannesGaessler JohannesGaessler merged commit 197c006 into ggml-org:master May 27, 2024
71 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CUDA graphs break quantized K cache
2 participants