cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (#10976) #12000

gcp · 2025-02-21T09:32:39Z

Using templates and reusing the dequant_qX_Y functions.

JohannesGaessler · 2025-02-21T11:07:20Z

ggml/src/ggml-cuda/cpy.cu

+static void ggml_cpy_q5_1_f32_cuda(
+    const char * cx, char * cdst, const int ne,
+    const int ne00, const int ne01, const int ne02,
+    const int nb00, const int nb01, const int nb02,
+    const int nb03, const int ne10, const int ne11, const int ne12,
+    const int nb10, const int nb11, const int nb12, const int nb13,
+    cudaStream_t stream) {
+    const int num_blocks = ne;
+    cpy_q_f32<cpy_blck_q_f32<dequantize_q5_1, QK5_1>, QK5_1><<<num_blocks, 1, 0, stream>>>(
+        cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03,
+        ne10, ne11, ne12, nb10, nb11, nb12, nb13);
+}
+
+static void ggml_cpy_q5_0_f32_cuda(
+    const char * cx, char * cdst, const int ne,
+    const int ne00, const int ne01, const int ne02,
+    const int nb00, const int nb01, const int nb02,
+    const int nb03, const int ne10, const int ne11, const int ne12,
+    const int nb10, const int nb11, const int nb12, const int nb13,
+    cudaStream_t stream) {
+    const int num_blocks = ne;
+    cpy_q_f32<cpy_blck_q_f32<dequantize_q5_0, QK5_0>, QK5_0><<<num_blocks, 1, 0, stream>>>(
+        cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03,
+        ne10, ne11, ne12, nb10, nb11, nb12, nb13);
+}
+
+static void ggml_cpy_q4_1_f32_cuda(
+    const char * cx, char * cdst, const int ne,
+    const int ne00, const int ne01, const int ne02,
+    const int nb00, const int nb01, const int nb02,
+    const int nb03, const int ne10, const int ne11, const int ne12,
+    const int nb10, const int nb11, const int nb12, const int nb13,
+    cudaStream_t stream) {
+    const int num_blocks = ne;
+    cpy_q_f32<cpy_blck_q_f32<dequantize_q4_1, QK4_1>, QK4_1><<<num_blocks, 1, 0, stream>>>(
+        cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03,
+         ne10, ne11, ne12, nb10, nb11, nb12, nb13);
+}
+
+static void ggml_cpy_q4_0_f32_cuda(
+    const char * cx, char * cdst, const int ne,
+    const int ne00, const int ne01, const int ne02,
+    const int nb00, const int nb01, const int nb02,
+    const int nb03, const int ne10, const int ne11, const int ne12,
+    const int nb10, const int nb11, const int nb12, const int nb13,
+    cudaStream_t stream) {
+    const int num_blocks = ne;
+    cpy_q_f32<cpy_blck_q_f32<dequantize_q4_0, QK4_0>, QK4_0><<<num_blocks, 1, 0, stream>>>(
+        cx, cdst, ne, ne00, ne01, ne02, nb00, nb01, nb02, nb03,
+         ne10, ne11, ne12, nb10, nb11, nb12, nb13);
+}
+


Please keep the order of quants consistent. The order that I usually use for CUDA code is q4_0, q4_1, q5_0, q5_1, q8_0.

Some of the existing code doesn't respect this order. I think it's better to not clean that (the existing code) up in the same patch though, as it would just add noise for reviewing. Can be done in a follow-up if you want.

ggml/src/ggml-cuda/cpy.cu

…org#10976)

gcp · 2025-02-21T23:55:20Z

@JohannesGaessler Incorporated your comments, plus reorder the newly added functions as requested. The order in the file is not totally consistent now, but as said above I think it's better to address that in a follow-up.

…org#12000)

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Feb 21, 2025

JohannesGaessler reviewed Feb 21, 2025

View reviewed changes

gcp force-pushed the cpy_cuda_quants branch from c86c16d to fa6aabc Compare February 21, 2025 23:14

cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (ggml-…

295573f

…org#10976)

gcp force-pushed the cpy_cuda_quants branch from fa6aabc to 295573f Compare February 21, 2025 23:52

JohannesGaessler approved these changes Feb 22, 2025

View reviewed changes

JohannesGaessler merged commit d709084 into ggml-org:master Feb 22, 2025
46 checks passed

orca-zhang pushed a commit to orca-zhang/llama.cpp that referenced this pull request Feb 26, 2025

cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (ggml-…

ffb67d4

…org#12000)

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025

cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (ggml-…

6438b72

…org#12000)

mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025

cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (ggml-…

1f66d38

…org#12000)

mostlyuseful pushed a commit to mostlyuseful/llama.cpp that referenced this pull request May 12, 2025

cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (ggml-…

cc178c8

…org#12000)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (#10976) #12000

cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (#10976) #12000

Uh oh!

gcp commented Feb 21, 2025

Uh oh!

JohannesGaessler Feb 21, 2025

Uh oh!

gcp Feb 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gcp commented Feb 21, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (#10976) #12000

cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (#10976) #12000

Uh oh!

Conversation

gcp commented Feb 21, 2025

Uh oh!

JohannesGaessler Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

gcp Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gcp commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gcp commented Feb 21, 2025 •

edited

Loading