Skip to content

SYCL: Introducing memory host pool #11251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 19, 2025
Merged

Conversation

s-Nick
Copy link
Collaborator

@s-Nick s-Nick commented Jan 15, 2025

This patch adds a host memory pool dedicated to matrix_info_t. This struct is used when calling gemm_batch_impl and right now forces an host_task synchronization to free memory. After this patch it is not necessary anymore. The pool ensures that the memory will be usable long enough to call the kernel and it will be freed at the end of the program.

Pool's design is naive and right now usable only with above reported type, further improvement are possible but not necessary at the moment.

Full test pass on A100 and Intel PVC.
Benchmark results for NVIDIA:

FP32

Current

model size params backend ngl threads sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 pp512 759.05 ± 4.70
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 tg128 89.10 ± 0.16
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 pp512 72.93 ± 0.31
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 tg128 18.17 ± 0.02
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 pp512 743.26 ± 1.85
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 tg128 91.47 ± 0.07

build: a29f087 (4473)

This patch

model size params backend ngl threads sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 pp512 969.44 ± 7.14
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 tg128 89.98 ± 0.18
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 pp512 104.87 ± 0.21
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 tg128 18.27 ± 0.01
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 pp512 957.31 ± 2.29
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 tg128 92.51 ± 0.05

build: e878c29 (4476)

FP16

Current

model size params backend ngl threads sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 pp512 5557.98 ± 18.23
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 tg128 88.98 ± 0.19
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 pp512 694.68 ± 1.90
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 tg128 18.15 ± 0.02
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 pp512 5375.49 ± 23.61
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 tg128 91.56 ± 0.06

build: a29f087 (4473)

This patch

model size params backend ngl threads sm mmap test t/s
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 pp512 5606.51 ± 79.11
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 tg128 87.32 ± 6.34
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 pp512 707.81 ± 3.43
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 tg128 18.35 ± 0.06
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 pp512 5404.64 ± 16.42
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 tg128 92.62 ± 0.07

build: e878c29 (4476)

Creating a new memory pool on the host to store memory location for
matrix_info needed to launch gemm_batch from oneMKL/oneMath.
Removing complex support in gemm_batch since it is not used in llama.cpp

Signed-off-by: nscipione <[email protected]>
@s-Nick s-Nick requested review from Alcpz and Rbiessy January 15, 2025 10:25
{
if (scaling_type == library_data_t::real_float &&
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complex type doesn't seem to be used anywhere or in any other backend so I removed it to simplify/clean things

@@ -3363,6 +3449,7 @@ static void ggml_sycl_mul_mat_batched_sycl(ggml_backend_sycl_context & ctx,

ggml_sycl_pool_alloc<const void *> ptrs_src(ctx.pool(), 2*ne23);
ggml_sycl_pool_alloc< void *> ptrs_dst(ctx.pool(), 1*ne23);
ggml_sycl_pool_alloc<matrix_info_t<float>> matrix_info(ctx.host_pool(),1);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type passed to matrix_info_t is float because it allows to be casted to the other necessary types to call oneMath functions. Unfortunately it needs to be set here given the current design and include files necessary

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Jan 15, 2025
Copy link
Collaborator

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall!

Copy link
Collaborator

@Alcpz Alcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for the contribution! I think the host_pool is a good idea, since the use of host_tasks can introduce performance overheads that are not easy to pinpoint.

I left a comment to discuss a bit the design. Let me know your thoughts there.

Copy link
Collaborator

@qnixsynapse qnixsynapse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

device(device_) should be initialized after qptr(qptr_) in ggml_sycl_pool_host I think.

/home/runner/work/llama.cpp/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:1193:9: warning: field 'qptr' will be initialized after field 'device' [-Wreorder-ctor]
 1193 |         qptr(qptr_),
      |         ^~~~~~~~~~~
      |         device(device_)
 1194 |         device(device_) {
      |         ~~~~~~~~~~~~~~~
      |         qptr(qptr_)

@s-Nick
Copy link
Collaborator Author

s-Nick commented Jan 15, 2025

Thank you for the comment @qnixsynapse I address it in 6b77639

@s-Nick s-Nick marked this pull request as ready for review January 15, 2025 16:55
@qnixsynapse
Copy link
Collaborator

@s-Nick Sorry for being a bit fussy about compiler warnings. There is one tiny issue left which can be fixed by adding a return nullptr; before the closing brackets of the alloc function which I have put out a comment.

So far everything else LGTM.

@s-Nick
Copy link
Collaborator Author

s-Nick commented Jan 16, 2025

@qnixsynapse don't worry, more eyes are always better, thank you for reviewing it. I addressed it in 963b685

@qnixsynapse
Copy link
Collaborator

@s-Nick Thanks. LGTM!

@airMeng airMeng merged commit 99487b5 into ggml-org:master Jan 19, 2025
48 checks passed
anagri pushed a commit to BodhiSearch/llama.cpp that referenced this pull request Jan 26, 2025
* Implement host pool for matrix_info

Creating a new memory pool on the host to store memory location for
matrix_info needed to launch gemm_batch from oneMKL/oneMath.
Removing complex support in gemm_batch since it is not used in llama.cpp

* Remove unnecessary headers and cast

* Reorder member variable to avoid warning on initialization

* Formatting

* Remove unused variable

* Address PR review feedback - remove warning

---------

Signed-off-by: nscipione <[email protected]>
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Feb 13, 2025
* Implement host pool for matrix_info

Creating a new memory pool on the host to store memory location for
matrix_info needed to launch gemm_batch from oneMKL/oneMath.
Removing complex support in gemm_batch since it is not used in llama.cpp

* Remove unnecessary headers and cast

* Reorder member variable to avoid warning on initialization

* Formatting

* Remove unused variable

* Address PR review feedback - remove warning

---------

Signed-off-by: nscipione <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Feb 26, 2025
* Implement host pool for matrix_info

Creating a new memory pool on the host to store memory location for
matrix_info needed to launch gemm_batch from oneMKL/oneMath.
Removing complex support in gemm_batch since it is not used in llama.cpp

* Remove unnecessary headers and cast

* Reorder member variable to avoid warning on initialization

* Formatting

* Remove unused variable

* Address PR review feedback - remove warning

---------

Signed-off-by: nscipione <[email protected]>
mglambda pushed a commit to mglambda/llama.cpp that referenced this pull request Mar 8, 2025
* Implement host pool for matrix_info

Creating a new memory pool on the host to store memory location for
matrix_info needed to launch gemm_batch from oneMKL/oneMath.
Removing complex support in gemm_batch since it is not used in llama.cpp

* Remove unnecessary headers and cast

* Reorder member variable to avoid warning on initialization

* Formatting

* Remove unused variable

* Address PR review feedback - remove warning

---------

Signed-off-by: nscipione <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants