Skip to content

Commit f9a6723

Browse files
NeoZhangJianyutybalex
authored andcommitted
fix memcpy() crash, add missed cmd in guide, fix softmax (ggml-org#6622)
* disable mmap to fix memcpy crash, add missed cmd in guide, fix softmax * refactor to disable mmap for SYCL backend * fix compile error in other os * refactor the solution, use host buf to fix it, instead of disable mmap * keep to support mmap() * use host buff to reduce malloc times * revert to malloc/free solution, for threaad safe
1 parent 4f22851 commit f9a6723

File tree

4 files changed

+32
-48
lines changed

4 files changed

+32
-48
lines changed

README-sycl.md

Lines changed: 16 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@ It has the similar design of other llama.cpp BLAS-based paths such as *OpenBLAS,
6868

6969
| Intel GPU | Status | Verified Model |
7070
|-------------------------------|---------|---------------------------------------|
71-
| Intel Data Center Max Series | Support | Max 1550 |
71+
| Intel Data Center Max Series | Support | Max 1550, 1100 |
7272
| Intel Data Center Flex Series | Support | Flex 170 |
7373
| Intel Arc Series | Support | Arc 770, 730M |
7474
| Intel built-in Arc GPU | Support | built-in Arc GPU in Meteor Lake |
@@ -84,8 +84,7 @@ It has the similar design of other llama.cpp BLAS-based paths such as *OpenBLAS,
8484
- **Execution Unit (EU)**
8585
- If the iGPU has less than 80 EUs, the inference speed will likely be too slow for practical use.
8686

87-
### Nvidia GPU
88-
The BLAS acceleration on Nvidia GPU through oneAPI can be obtained using the Nvidia plugins for oneAPI and the cuBLAS backend of the upstream oneMKL library. Details and instructions on how to setup the runtime and library can be found in [this section](#i-setup-environment)
87+
### Other Vendor GPU
8988

9089
**Verified devices**
9190

@@ -94,14 +93,9 @@ The BLAS acceleration on Nvidia GPU through oneAPI can be obtained using the Nvi
9493
| Ampere Series | Support | A100, A4000 |
9594
| Ampere Series *(Mobile)* | Support | RTX 40 Series |
9695

97-
*Notes:*
98-
- Support for Nvidia targets through oneAPI is currently limited to Linux platforms.
99-
100-
- Please make sure the native oneAPI MKL *(dedicated to intel CPUs and GPUs)* is not "visible" at this stage to properly setup and use the built-from-source oneMKL with cuBLAS backend in llama.cpp for Nvidia GPUs.
101-
102-
10396
## Docker
10497
The docker build option is currently limited to *intel GPU* targets.
98+
10599
### Build image
106100
```sh
107101
# Using FP16
@@ -168,29 +162,10 @@ Platform #0: Intel(R) OpenCL HD Graphics
168162
- **Nvidia GPU**
169163

170164
In order to target Nvidia GPUs through SYCL, please make sure the CUDA/CUBLAS native requirements *-found [here](README.md#cuda)-* are installed.
171-
Installation can be verified by running the following:
172-
```sh
173-
nvidia-smi
174-
```
175-
Please make sure at least one CUDA device is available, which can be displayed like this *(here an A100-40GB Nvidia GPU)*:
176-
```
177-
+---------------------------------------------------------------------------------------+
178-
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
179-
|-----------------------------------------+----------------------+----------------------+
180-
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
181-
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
182-
| | | MIG M. |
183-
|=========================================+======================+======================|
184-
| 0 NVIDIA A100-PCIE-40GB On | 00000000:8D:00.0 Off | 0 |
185-
| N/A 36C P0 57W / 250W | 4MiB / 40960MiB | 0% Default |
186-
| | | Disabled |
187-
+-----------------------------------------+----------------------+----------------------+
188-
```
189-
190165

191166
2. **Install Intel® oneAPI Base toolkit**
192167

193-
- **Base installation**
168+
- **For Intel GPU**
194169

195170
The base toolkit can be obtained from the official [Intel® oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) page.
196171

@@ -202,10 +177,10 @@ Upon a successful installation, SYCL is enabled for the available intel devices,
202177

203178
- **Adding support to Nvidia GPUs**
204179

205-
**oneAPI**: In order to enable SYCL support on Nvidia GPUs, please install the [Codeplay oneAPI Plugin for Nvidia GPUs](https://developer.codeplay.com/products/oneapi/nvidia/download). User should also make sure the plugin version matches the installed base toolkit one *(previous step)* for a seamless "oneAPI on Nvidia GPU" setup.
180+
**oneAPI Plugin**: In order to enable SYCL support on Nvidia GPUs, please install the [Codeplay oneAPI Plugin for Nvidia GPUs](https://developer.codeplay.com/products/oneapi/nvidia/download). User should also make sure the plugin version matches the installed base toolkit one *(previous step)* for a seamless "oneAPI on Nvidia GPU" setup.
206181

207182

208-
**oneMKL**: The current oneMKL releases *(shipped with the oneAPI base-toolkit)* do not contain the cuBLAS backend. A build from source of the upstream [oneMKL](https://github.com/oneapi-src/oneMKL) with the *cuBLAS* backend enabled is thus required to run it on Nvidia GPUs.
183+
**oneMKL for cuBlas**: The current oneMKL releases *(shipped with the oneAPI base-toolkit)* do not contain the cuBLAS backend. A build from source of the upstream [oneMKL](https://github.com/oneapi-src/oneMKL) with the *cuBLAS* backend enabled is thus required to run it on Nvidia GPUs.
209184

210185
```sh
211186
git clone https://github.com/oneapi-src/oneMKL
@@ -237,7 +212,7 @@ When targeting an intel GPU, the user should expect one or more level-zero devic
237212

238213
- **Nvidia GPU**
239214

240-
Similarly, user targetting Nvidia GPUs should expect at least one SYCL-CUDA device [`ext_oneapi_cuda:gpu`] as bellow:
215+
Similarly, user targeting Nvidia GPUs should expect at least one SYCL-CUDA device [`ext_oneapi_cuda:gpu`] as bellow:
241216
```
242217
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]
243218
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
@@ -260,6 +235,9 @@ cmake --build .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icp
260235

261236
# Option 2: Use FP32 by default
262237
cmake --build .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
238+
239+
#build all binary
240+
cmake --build . --config Release -j -v
263241
```
264242

265243
#### Nvidia GPU
@@ -278,6 +256,10 @@ cmake --build .. -DLLAMA_SYCL=ON -DLLAMA_SYCL_TARGET=NVIDIA -DCMAKE_C_COMPILER=i
278256

279257
# Option 2: Use FP32 by default
280258
cmake --build .. -DLLAMA_SYCL=ON -DLLAMA_SYCL_TARGET=NVIDIA -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
259+
260+
#build all binary
261+
cmake --build . --config Release -j -v
262+
281263
```
282264

283265
### III. Run the inference
@@ -357,7 +339,6 @@ Otherwise, you can run the script:
357339

358340
*Notes:*
359341

360-
- By default, `mmap` is used to read the model file. In some cases, it causes runtime hang issues. Please disable it by passing `--no-mmap` to the `/bin/main` if faced with the issue.
361342
- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
362343

363344
```sh
@@ -438,7 +419,7 @@ cd build
438419
439420
cmake -G "MinGW Makefiles" .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icx -DCMAKE_BUILD_TYPE=Release -DLLAMA_SYCL_F16=ON
440421
441-
make
422+
make -j
442423
```
443424

444425
Otherwise, run the `win-build-sycl.bat` wrapper which encapsulates the former instructions:
@@ -525,7 +506,6 @@ Otherwise, run the following wrapper script:
525506

526507
Note:
527508

528-
- By default, `mmap` is used to read the model file. In some cases, it causes runtime hang issues. Please disable it by passing `--no-mmap` to the `main.exe` if faced with the issue.
529509
- Upon execution, verify the selected device(s) ID(s) in the output log, which can for instance be displayed as follow:
530510

531511
```sh
@@ -557,12 +537,6 @@ use 1 SYCL GPUs: [0] with Max compute units:512
557537

558538
## Known Issues
559539

560-
- Hanging during startup
561-
562-
llama.cpp uses *mmap* as the default mode for reading the model file and copying it to the GPU. In some systems, `memcpy` might behave abnormally and therefore hang.
563-
564-
- **Solution**: add `--no-mmap` or `--mmap 0` flag to the `main` executable.
565-
566540
- `Split-mode:[row]` is not supported.
567541

568542
## Q&A
@@ -574,7 +548,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512
574548

575549
- General compiler error:
576550

577-
- Remove build folder or try a clean-build.
551+
- Remove **build** folder or try a clean-build.
578552

579553
- I can **not** see `[ext_oneapi_level_zero:gpu]` afer installing the GPU driver on Linux.
580554

examples/sycl/build.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,4 @@ cmake .. -DLLAMA_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx
2020
#cmake --build . --config Release --target llama-bench
2121

2222
#build all binary
23-
cmake --build . --config Release -v
23+
cmake --build . --config Release -j -v

examples/sycl/run-llama2.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,7 @@ if [ $# -gt 0 ]; then
1212
GGML_SYCL_SINGLE_GPU=1
1313
else
1414
GGML_SYCL_DEVICE=0
15+
GGML_SYCL_SINGLE_GPU=0
1516
fi
1617

1718
#export GGML_SYCL_DEBUG=1

ggml-sycl.cpp

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3154,7 +3154,6 @@ typedef float (*vec_dot_q_mul_mat_sycl_t)(
31543154
#define SYCL_SCALE_BLOCK_SIZE 256
31553155
#define SYCL_CLAMP_BLOCK_SIZE 256
31563156
#define SYCL_ROPE_BLOCK_SIZE 256
3157-
#define SYCL_SOFT_MAX_BLOCK_SIZE 1024
31583157
#define SYCL_ALIBI_BLOCK_SIZE 32
31593158
#define SYCL_DIAG_MASK_INF_BLOCK_SIZE 32
31603159
#define SYCL_QUANTIZE_BLOCK_SIZE 256
@@ -13080,11 +13079,13 @@ static void soft_max_f32_sycl(const float * x, const float * mask, const float *
1308013079
const int nrows_y, const float scale, const float max_bias,
1308113080
dpct::queue_ptr stream) {
1308213081
int nth = WARP_SIZE;
13083-
while (nth < ncols_x && nth < SYCL_SOFT_MAX_BLOCK_SIZE) nth *= 2;
13082+
int max_block_size = g_work_group_size;
13083+
while (nth < ncols_x && nth < max_block_size) nth *= 2;
13084+
if (nth>max_block_size) nth = max_block_size;
13085+
1308413086
const sycl::range<3> block_dims(1, 1, nth);
1308513087
const sycl::range<3> block_nums(1, 1, nrows_x);
1308613088
const size_t n_local_scratch = (GGML_PAD(ncols_x, WARP_SIZE) + WARP_SIZE);
13087-
static_assert(SYCL_SOFT_MAX_BLOCK_SIZE == 1024, "These values need to be adjusted.");
1308813089

1308913090
const uint32_t n_head_kv = nrows_x/nrows_y;
1309013091
const uint32_t n_head_log2 = 1u << (uint32_t) floorf(log2f((float) n_head_kv));
@@ -13094,6 +13095,12 @@ static void soft_max_f32_sycl(const float * x, const float * mask, const float *
1309413095

1309513096
const size_t local_mem_size = stream->get_device().get_info<sycl::info::device::local_mem_size>();
1309613097
if (n_local_scratch*sizeof(float) < local_mem_size) {
13098+
if (ncols_x > max_block_size) {
13099+
soft_max_f32_submitter<true, 0, 0>(x, mask, pos, dst, ncols_x, nrows_y, scale,
13100+
max_bias, m0, m1, n_head_log2, block_nums,
13101+
block_dims, n_local_scratch, stream);
13102+
return;
13103+
}
1309713104
switch (ncols_x) {
1309813105
case 32:
1309913106
soft_max_f32_submitter<true, 32, 32>(x, mask, pos, dst, ncols_x, nrows_y, scale,
@@ -16814,11 +16821,13 @@ static void ggml_backend_sycl_buffer_set_tensor(ggml_backend_buffer_t buffer,
1681416821
const dpct::queue_ptr stream = g_syclStreams[ctx->device][0];
1681516822
SYCL_CHECK(
1681616823
CHECK_TRY_ERROR(dpct::dev_mgr::instance().get_device(ctx->device).queues_wait_and_throw()));
16817-
16824+
char* host_buf = (char*)malloc(size);
16825+
memcpy(host_buf, data, size);
1681816826
SYCL_CHECK(
1681916827
CHECK_TRY_ERROR((*stream)
16820-
.memcpy((char *)tensor->data + offset, data, size)
16828+
.memcpy((char *)tensor->data + offset, host_buf, size)
1682116829
.wait()));
16830+
free(host_buf);
1682216831
}
1682316832
catch (sycl::exception const &exc) {
1682416833
std::cerr << exc.what() << "Exception caught at file:" << __FILE__

0 commit comments

Comments
 (0)