[SYCL] Revert "use the correct SYCL context for host USM allocations" #7858

AidanBeltonS · 2024-06-10T16:31:38Z

Reverts #7777. This PR broke llama-bench and main as when pinned memory is allocated during the models creating the backend is not initialized. This means the g_sycl_gpu_mgr is not constructed with the relevant devices. Causing a segfault as no devices exist within the manager.

I think we should try to reintroduce #7777 in a more suitable way that addresses this issue.

AidanBeltonS · 2024-06-10T16:33:16Z

Ping @bashbaug, @joeatodd

bashbaug · 2024-06-10T16:50:51Z

Sorry about that, how can I reproduce this issue?

abhilash1910

LGTM !

OuadiElfarouki · 2024-06-11T08:29:51Z

Sorry about that, how can I reproduce this issue?

We've encountered this on Nvidia GPUs for both llama-bench & main, instructions to build SYCL backend for Nvidia devices can be found here : https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md#nvidia-gpu

abhilash1910 · 2024-06-11T12:03:52Z

@AidanBeltonS could you rebase to fix CI? Thanks

airMeng · 2024-06-12T14:32:43Z

We've encountered this on Nvidia GPUs for both llama-bench & main, instructions to build SYCL backend for Nvidia devices can be found here : https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md#nvidia-gpu

I can't reproduce this on Intel GPU. could you have a deep dive why only issues exist on NVIDIA GPU? Maybe an issue to Intel SYCL team is more appropriate.

cc some SYCL mates @Nuullll

AidanBeltonS · 2024-06-12T15:26:47Z

We've encountered this on Nvidia GPUs for both llama-bench & main, instructions to build SYCL backend for Nvidia devices can be found here : https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md#nvidia-gpu

I can't reproduce this on Intel GPU. could you have a deep dive why only issues exist on NVIDIA GPU? Maybe an issue to Intel SYCL team is more appropriate.

cc some SYCL mates @Nuullll

Currently working on making a reproducer. It requires a model which uses pinned memory, it should not be a backend/hardware specific problem

AidanBeltonS · 2024-06-12T16:19:55Z

@airMeng the problem also effects intel devices. I have reproduced the error on a Data Max 1100.

To reproduce:
./bin/llama-bench -m ~/llama_models/Llama-2-7b-chat-Q4_K.gguf -ngl 77 --mmap 0

Backtrace:

| model                          |       size |     params | backend    | ngl | mmap |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ---: | ------------: | ---------------: |
[SYCL] call ggml_init_sycl
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: yes
found 4 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|         Intel Data Center GPU Max 1100|    1.3|    448|    1024|   32| 51539M|            1.3.29138|
| 1|     [opencl:gpu:0]|         Intel Data Center GPU Max 1100|    3.0|    448|    1024|   32| 48946M|       24.13.29138.29|
| 2|     [opencl:cpu:0]|                  Intel Xeon Gold 5418Y|    3.0|      2|    8192|   64|201419M|2024.17.3.0.08_160000|
| 3|     [opencl:acc:0]|            Intel FPGA Emulation Device|    1.2|      2|67108864|   64|201419M|2024.17.3.0.08_160000|
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:448
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory

Thread 1 "llama-bench" received signal SIGSEGV, Segmentation fault.
0x00007fffead4e644 in sycl::_V1::queue::get_context() const () from /opt/slurm/intel/oneapi/2024.1.0.596/compiler/2024.1/lib/libsycl.so.7
(gdb) bt
#0  0x00007fffead4e644 in sycl::_V1::queue::get_context() const () from /opt/slurm/intel/oneapi/2024.1.0.596/compiler/2024.1/lib/libsycl.so.7
#1  0x00007fffeacfd46e in sycl::_V1::malloc_host(unsigned long, sycl::_V1::queue const&, sycl::_V1::detail::code_location const&) ()
   from /opt/slurm/intel/oneapi/2024.1.0.596/compiler/2024.1/lib/libsycl.so.7
#2  0x000000000055587a in ggml_sycl_host_malloc(unsigned long) ()
#3  0x00000000005e7f42 in ggml_backend_sycl_host_buffer_type_alloc_buffer(ggml_backend_buffer_type*, unsigned long) ()
#4  0x00000000006eeafa in alloc_tensor_range ()
#5  0x00000000006eea40 in ggml_backend_alloc_ctx_tensors_from_buft ()
#6  0x00000000006669bf in llm_load_tensors(llama_model_loader&, llama_model&, int, llama_split_mode, int, float const*, bool, bool (*)(float, void*), void*) ()
#7  0x0000000000636eb2 in llama_load_model_from_file ()
#8  0x000000000043768d in main ()

bashbaug · 2024-06-12T17:08:02Z

the problem also effects intel devices. I have reproduced the error on a Data Max 1100.

Thanks, I can reproduce the error with these steps on an A750 also. Looking now...

bashbaug · 2024-06-12T21:55:31Z

I suspect this change will fix the problem: #7909.

To be clear: I'm fine merging this PR (to revert #7777) if needed to get things moving again, especially if it's going to take some time to review #7909 - thanks!

Manually reverting: #7858 Signed-off-by: Joe Todd <[email protected]>

Manually reverting: ggml-org#7858 Signed-off-by: Joe Todd <[email protected]>

github-actions bot added the SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language label Jun 10, 2024

AidanBeltonS requested review from abhilash1910 and NeoZhangJianyu June 10, 2024 16:32

abhilash1910 approved these changes Jun 10, 2024

View reviewed changes

OuadiElfarouki approved these changes Jun 11, 2024

View reviewed changes

mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label Jun 12, 2024

AidanBeltonS closed this Jun 12, 2024

AidanBeltonS force-pushed the revert-7777-host-usm-context-fix branch from 4e4ff76 to a9cae48 Compare June 12, 2024 15:08

Revert 7777

4632523

AidanBeltonS reopened this Jun 12, 2024

bashbaug mentioned this pull request Jun 12, 2024

sycl: always set the main device after initialization #7909

Closed

4 tasks

joeatodd added a commit that referenced this pull request Jun 13, 2024

Revert "use the correct SYCL context for host USM allocations"

18133ca

Manually reverting: #7858 Signed-off-by: Joe Todd <[email protected]>

joeatodd mentioned this pull request Jun 13, 2024

Revert "use the correct SYCL context for host USM allocations" #7920

Merged

4 tasks

airMeng closed this Jun 17, 2024

Alcpz pushed a commit to Alcpz/llama.cpp that referenced this pull request Jun 20, 2024

Revert "use the correct SYCL context for host USM allocations"

a4467e0

Manually reverting: ggml-org#7858 Signed-off-by: Joe Todd <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL] Revert "use the correct SYCL context for host USM allocations" #7858

[SYCL] Revert "use the correct SYCL context for host USM allocations" #7858

Uh oh!

AidanBeltonS commented Jun 10, 2024

Uh oh!

AidanBeltonS commented Jun 10, 2024

Uh oh!

bashbaug commented Jun 10, 2024

Uh oh!

abhilash1910 left a comment

Uh oh!

OuadiElfarouki commented Jun 11, 2024 •

edited

Loading

Uh oh!

abhilash1910 commented Jun 11, 2024

Uh oh!

airMeng commented Jun 12, 2024 •

edited

Loading

Uh oh!

AidanBeltonS commented Jun 12, 2024

Uh oh!

AidanBeltonS commented Jun 12, 2024

Uh oh!

bashbaug commented Jun 12, 2024

Uh oh!

bashbaug commented Jun 12, 2024

Uh oh!

Uh oh!

[SYCL] Revert "use the correct SYCL context for host USM allocations" #7858

[SYCL] Revert "use the correct SYCL context for host USM allocations" #7858

Uh oh!

Conversation

AidanBeltonS commented Jun 10, 2024

Uh oh!

AidanBeltonS commented Jun 10, 2024

Uh oh!

bashbaug commented Jun 10, 2024

Uh oh!

abhilash1910 left a comment

Choose a reason for hiding this comment

Uh oh!

OuadiElfarouki commented Jun 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhilash1910 commented Jun 11, 2024

Uh oh!

airMeng commented Jun 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AidanBeltonS commented Jun 12, 2024

Uh oh!

AidanBeltonS commented Jun 12, 2024

Uh oh!

bashbaug commented Jun 12, 2024

Uh oh!

bashbaug commented Jun 12, 2024

Uh oh!

Uh oh!

OuadiElfarouki commented Jun 11, 2024 •

edited

Loading

airMeng commented Jun 12, 2024 •

edited

Loading