Skip to content

Commit ee55b94

Browse files
committed
update guide (ggml-org#8909)
Co-authored-by: Neo Zhang <>
1 parent 5693595 commit ee55b94

File tree

1 file changed

+107
-38
lines changed

1 file changed

+107
-38
lines changed

docs/backend/SYCL.md

Lines changed: 107 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -85,15 +85,22 @@ For CI and performance test summary, please refer to [llama.cpp CI for SYCL Back
8585

8686
### Intel GPU
8787

88-
**Verified devices**
88+
SYCL backend supports Intel GPU Family:
89+
90+
- Intel Data Center Max Series
91+
- Intel Flex Series, Arc Series
92+
- Intel Built-in Arc GPU
93+
- Intel iGPU in Core CPU (11th Generation Core CPU and newer, refer to [oneAPI supported GPU](https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-toolkit-system-requirements.html#inpage-nav-1-1)).
94+
95+
#### Verified devices
8996

9097
| Intel GPU | Status | Verified Model |
9198
|-------------------------------|---------|---------------------------------------|
9299
| Intel Data Center Max Series | Support | Max 1550, 1100 |
93100
| Intel Data Center Flex Series | Support | Flex 170 |
94101
| Intel Arc Series | Support | Arc 770, 730M, Arc A750 |
95102
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake |
96-
| Intel iGPU | Support | iGPU in i5-1250P, i7-1260P, i7-1165G7 |
103+
| Intel iGPU | Support | iGPU in 13700k, i5-1250P, i7-1260P, i7-1165G7 |
97104

98105
*Notes:*
99106

@@ -242,6 +249,13 @@ Similarly, user targeting Nvidia GPUs should expect at least one SYCL-CUDA devic
242249
### II. Build llama.cpp
243250

244251
#### Intel GPU
252+
253+
```
254+
./examples/sycl/build.sh
255+
```
256+
257+
or
258+
245259
```sh
246260
# Export relevant ENV variables
247261
source /opt/intel/oneapi/setvars.sh
@@ -281,24 +295,27 @@ cmake --build build --config Release -j -v
281295

282296
### III. Run the inference
283297

284-
1. Retrieve and prepare model
298+
#### Retrieve and prepare model
285299

286300
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.
287301

288-
2. Enable oneAPI running environment
302+
##### Check device
303+
304+
1. Enable oneAPI running environment
289305

290306
```sh
291307
source /opt/intel/oneapi/setvars.sh
292308
```
293309

294-
3. List devices information
310+
2. List devices information
295311

296312
Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
297313

298314
```sh
299315
./build/bin/llama-ls-sycl-device
300316
```
301317
A example of such log in a system with 1 *intel CPU* and 1 *intel GPU* can look like the following:
318+
302319
```
303320
found 6 SYCL devices:
304321
Part1:
@@ -327,7 +344,33 @@ Part2:
327344
| compute capability 1.3 | Level-zero driver/runtime, recommended |
328345
| compute capability 3.0 | OpenCL driver/runtime, slower than level-zero in most cases |
329346

330-
4. Launch inference
347+
#### Choose level-zero devices
348+
349+
|Chosen Device ID|Setting|
350+
|-|-|
351+
|0|`export ONEAPI_DEVICE_SELECTOR="level_zero:1"` or no action|
352+
|1|`export ONEAPI_DEVICE_SELECTOR="level_zero:1"`|
353+
|0 & 1|`export ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"`|
354+
355+
#### Execute
356+
357+
Choose one of following methods to run.
358+
359+
1. Script
360+
361+
- Use device 0:
362+
363+
```sh
364+
./examples/sycl/run_llama2.sh 0
365+
```
366+
- Use multiple devices:
367+
368+
```sh
369+
./examples/sycl/run_llama2.sh
370+
```
371+
372+
2. Command line
373+
Launch inference
331374

332375
There are two device selection modes:
333376

@@ -346,24 +389,13 @@ Examples:
346389
```sh
347390
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm none -mg 0
348391
```
349-
or run by script:
350-
351-
```sh
352-
./examples/sycl/run_llama2.sh 0
353-
```
354392

355393
- Use multiple devices:
356394

357395
```sh
358396
ZES_ENABLE_SYSMAN=1 ./build/bin/llama-cli -m models/llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -sm layer
359397
```
360398

361-
Otherwise, you can run the script:
362-
363-
```sh
364-
./examples/sycl/run_llama2.sh
365-
```
366-
367399
*Notes:*
368400

369401
- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
@@ -410,7 +442,7 @@ c. Verify installation
410442
In the oneAPI command line, run the following to print the available SYCL devices:
411443

412444
```
413-
sycl-ls
445+
sycl-ls.exe
414446
```
415447

416448
There should be one or more *level-zero* GPU devices displayed as **[ext_oneapi_level_zero:gpu]**. Below is example of such output detecting an *intel Iris Xe* GPU as a Level-zero SYCL device:
@@ -431,6 +463,18 @@ b. The new Visual Studio will install Ninja as default. (If not, please install
431463

432464
### II. Build llama.cpp
433465

466+
You could download the release package for Windows directly, which including binary files and depended oneAPI dll files.
467+
468+
Choose one of following methods to build from source code.
469+
470+
1. Script
471+
472+
```sh
473+
.\examples\sycl\win-build-sycl.bat
474+
```
475+
476+
2. CMake
477+
434478
On the oneAPI command line window, step into the llama.cpp main directory and run the following:
435479

436480
```
@@ -445,12 +489,8 @@ cmake -B build -G "Ninja" -DGGML_SYCL=ON -DCMAKE_C_COMPILER=cl -DCMAKE_CXX_COMPI
445489
cmake --build build --config Release -j
446490
```
447491

448-
Otherwise, run the `win-build-sycl.bat` wrapper which encapsulates the former instructions:
449-
```sh
450-
.\examples\sycl\win-build-sycl.bat
451-
```
452-
453492
Or, use CMake presets to build:
493+
454494
```sh
455495
cmake --preset x64-windows-sycl-release
456496
cmake --build build-x64-windows-sycl-release -j --target llama-cli
@@ -462,31 +502,35 @@ cmake --preset x64-windows-sycl-debug
462502
cmake --build build-x64-windows-sycl-debug -j --target llama-cli
463503
```
464504

465-
Or, you can use Visual Studio to open llama.cpp folder as a CMake project. Choose the sycl CMake presets (`x64-windows-sycl-release` or `x64-windows-sycl-debug`) before you compile the project.
505+
3. Visual Studio
506+
507+
You can use Visual Studio to open llama.cpp folder as a CMake project. Choose the sycl CMake presets (`x64-windows-sycl-release` or `x64-windows-sycl-debug`) before you compile the project.
466508

467509
*Notes:*
468510

469511
- In case of a minimal experimental setup, the user can build the inference executable only through `cmake --build build --config Release -j --target llama-cli`.
470512

471513
### III. Run the inference
472514

473-
1. Retrieve and prepare model
515+
#### Retrieve and prepare model
516+
517+
You can refer to the general [*Prepare and Quantize*](README.md#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.
474518

475-
You can refer to the general [*Prepare and Quantize*](README#prepare-and-quantize) guide for model prepration, or simply download [llama-2-7b.Q4_0.gguf](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf) model as example.
519+
##### Check device
476520

477-
2. Enable oneAPI running environment
521+
1. Enable oneAPI running environment
478522

479523
On the oneAPI command line window, run the following and step into the llama.cpp directory:
480524
```
481525
"C:\Program Files (x86)\Intel\oneAPI\setvars.bat" intel64
482526
```
483527

484-
3. List devices information
528+
2. List devices information
485529

486530
Similar to the native `sycl-ls`, available SYCL devices can be queried as follow:
487531

488532
```
489-
build\bin\ls-sycl-device.exe
533+
build\bin\llama-ls-sycl-device.exe
490534
```
491535

492536
The output of this command in a system with 1 *intel CPU* and 1 *intel GPU* would look like the following:
@@ -518,8 +562,27 @@ Part2:
518562
| compute capability 1.3 | Level-zero running time, recommended |
519563
| compute capability 3.0 | OpenCL running time, slower than level-zero in most cases |
520564

565+
#### Choose level-zero devices
566+
567+
|Chosen Device ID|Setting|
568+
|-|-|
569+
|0|`set ONEAPI_DEVICE_SELECTOR="level_zero:1"` or no action|
570+
|1|`set ONEAPI_DEVICE_SELECTOR="level_zero:1"`|
571+
|0 & 1|`set ONEAPI_DEVICE_SELECTOR="level_zero:0;level_zero:1"`|
572+
573+
#### Execute
574+
575+
Choose one of following methods to run.
576+
577+
1. Script
578+
579+
```
580+
examples\sycl\win-run-llama2.bat
581+
```
582+
583+
2. Command line
521584

522-
4. Launch inference
585+
Launch inference
523586

524587
There are two device selection modes:
525588

@@ -544,11 +607,7 @@ build\bin\llama-cli.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website ca
544607
```
545608
build\bin\llama-cli.exe -m models\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 -sm layer
546609
```
547-
Otherwise, run the following wrapper script:
548610

549-
```
550-
.\examples\sycl\win-run-llama2.bat
551-
```
552611

553612
Note:
554613

@@ -562,17 +621,18 @@ Or
562621
use 1 SYCL GPUs: [0] with Max compute units:512
563622
```
564623

624+
565625
## Environment Variable
566626

567627
#### Build
568628

569629
| Name | Value | Function |
570630
|--------------------|-----------------------------------|---------------------------------------------|
571-
| GGML_SYCL | ON (mandatory) | Enable build with SYCL code path. |
631+
| GGML_SYCL | ON (mandatory) | Enable build with SYCL code path.<br>FP32 path - recommended for better perforemance than FP16 on quantized model|
572632
| GGML_SYCL_TARGET | INTEL *(default)* \| NVIDIA | Set the SYCL target device type. |
573633
| GGML_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. |
574-
| CMAKE_C_COMPILER | icx | Set *icx* compiler for SYCL code path. |
575-
| CMAKE_CXX_COMPILER | icpx *(Linux)*, icx *(Windows)* | Set `icpx/icx` compiler for SYCL code path. |
634+
| CMAKE_C_COMPILER | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path. |
635+
| CMAKE_CXX_COMPILER | `icpx` *(Linux)*, `icx` *(Windows)* | Set `icpx/icx` compiler for SYCL code path. |
576636

577637
#### Runtime
578638

@@ -634,9 +694,18 @@ The parameters about device choose of llama.cpp works with SYCL backend rule to
634694
```
635695
Otherwise, please double-check the GPU driver installation steps.
636696

697+
- Can I report Ollama issue on Intel GPU to llama.cpp SYCL backend?
698+
699+
No. We can't support Ollama issue directly, because we aren't familiar with Ollama.
700+
701+
Sugguest reproducing on llama.cpp and report similar issue to llama.cpp. We will surpport it.
702+
703+
It's same for other projects including llama.cpp SYCL backend.
704+
705+
637706
### **GitHub contribution**:
638707
Please add the **[SYCL]** prefix/tag in issues/PRs titles to help the SYCL-team check/address them without delay.
639708

640709
## TODO
641710

642-
- Support row layer split for multiple card runs.
711+
- NA

0 commit comments

Comments
 (0)