Skip to content

SYCL : Move to compile time oneMKL interface backend selection for NVIDIA backend #10584

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 4, 2024

Conversation

s-Nick
Copy link
Collaborator

@s-Nick s-Nick commented Nov 29, 2024

This patch move oneMKL interface calls to gemm from real time to compile time for NVIDIA backend, bringing improvements specially in text generation.

Tested on A100:
Current

model size params backend ngl sm mmap test t/s
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 none 0 pp512 705.51 ± 2.23
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 none 0 tg128 14.28 ± 0.05
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none 0 pp512 5426.17 ± 29.64
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 none 0 tg128 81.36 ± 1.29
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 pp512 5592.87 ± 89.22
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 none 0 tg128 72.96 ± 0.91

build: 0f77aae (20)

With changes

model size params backend ngl threads sm mmap test t/s
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 pp512 720.68 ± 1.62
llama 70B Q4_K - Small 37.57 GiB 70.55 B SYCL 99 8 none 0 tg128 18.52 ± 0.07
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 pp512 5489.17 ± 30.44
llama 8B Q4_K - Medium 4.58 GiB 8.03 B SYCL 99 8 none 0 tg128 91.99 ± 0.05
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 pp512 5439.13 ± 216.28
llama 8B Q8_0 7.95 GiB 8.03 B SYCL 99 8 none 0 tg128 89.34 ± 0.05

build: ffd0a99 (4222)

… NVIDIA backend

Move to compile time selection to backend to avoid latency at run time.
Add it to all mkl gemm calls and only for NVIDIA backend.

Signed-off-by: nscipione <[email protected]>
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Nov 29, 2024
@s-Nick
Copy link
Collaborator Author

s-Nick commented Nov 29, 2024

@Alcpz Could you check it out?

@Alcpz
Copy link
Collaborator

Alcpz commented Nov 29, 2024

@Rbiessy Feel free to give it a look, since you have experience working with oneMKL interface

Copy link
Collaborator

@NeoZhangJianyu NeoZhangJianyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments for other update.

@@ -1690,8 +1690,12 @@ namespace dpct
auto data_b = get_memory<const Tb>(b);
auto data_c = get_memory<Tc>(c);
oneapi::mkl::blas::column_major::gemm(
q, a_trans, b_trans, m, n, k, alpha_value, data_a, lda,
data_b, ldb, beta_value, data_c, ldc);
#ifdef GGML_SYCL_NVIDIA
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The macro make the code is hard to understand.
I suggest:

#ifdef GGML_SYCL_NVIDIA
        oneapi::mkl::blas::column_major::gemm(
                oneapi::mkl::backend_selector<oneapi::mkl::backend::cublas>{ q },
                a_trans, b_trans, m, n, k, alpha_value, data_a, lda,
                data_b, ldb, beta_value, data_c, ldc);
        }
#else
        oneapi::mkl::blas::column_major::gemm(
                q, 
                a_trans, b_trans, m, n, k, alpha_value, data_a, lda,
                data_b, ldb, beta_value, data_c, ldc);
        }
#endif

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we start adding support for Intel GPU as well I think it would make more sense to have a helper function that returns either a backend_selector or a queue based on the backend.
It would avoid duplicating the call to gemm which I think is a risk.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remember, the SYCL backend is initiated to support Intel GPU. :)
Support more vendor GPUs is added later.
The default code path should be optimized for Intel GPU.

It's OK to set special queue for other vendor GPUs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code update for readability in f6e6fc4

@NeoZhangJianyu
Copy link
Collaborator

NeoZhangJianyu commented Dec 2, 2024

@s-Nick
I guess the same method could help for Intel GPU.
Is it possible to test for Intel GPU too? like oneapi::mkl::backend::mklgpu.

@s-Nick
Copy link
Collaborator Author

s-Nick commented Dec 2, 2024

Thank you for your review @NeoZhangJianyu
Currently Intel GPU implementation uses oneMKL closed source library directly and it doesn't have nor need a backend_selector, therefore these changes aren't required or useful

Copy link
Collaborator

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The oneMKL Interface changes look good to me.

@NeoZhangJianyu
Copy link
Collaborator

Thank you for your review @NeoZhangJianyu Currently Intel GPU implementation uses oneMKL closed source library directly and it doesn't have nor need a backend_selector, therefore these changes aren't required or useful

OK, I see!

Copy link
Collaborator

@Alcpz Alcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes lgtm. Let's wait for the remaining thread to be resolved before merging.

@NeoZhangJianyu NeoZhangJianyu merged commit 40c6d79 into ggml-org:master Dec 4, 2024
44 checks passed
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Dec 7, 2024
…IDIA backend (ggml-org#10584)

* [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend

Move to compile time selection to backend to avoid latency at run time.
Add it to all mkl gemm calls and only for NVIDIA backend.

Signed-off-by: nscipione <[email protected]>

* Formatting

* Address PR comments to increase readibility

---------

Signed-off-by: nscipione <[email protected]>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
…IDIA backend (ggml-org#10584)

* [SYCL] Move to Compile Time backend selection on oneMKL Interface for NVIDIA backend

Move to compile time selection to backend to avoid latency at run time.
Add it to all mkl gemm calls and only for NVIDIA backend.

Signed-off-by: nscipione <[email protected]>

* Formatting

* Address PR comments to increase readibility

---------

Signed-off-by: nscipione <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants