Skip to content

Commit d1c6664

Browse files
committed
correct fix terms CBLAS, GPU; update README.md; remove assets
1 parent 9554527 commit d1c6664

30 files changed

+141
-2052
lines changed

examples/mulmat-tune/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
analyze/
2+
bench-out/

examples/mulmat-tune/README.md

Lines changed: 93 additions & 82 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
To readers: this is a length introduction to this big PR. Recommend you review in
44
the following steps:
55

6-
1. read this file first and browse the files in dirs bench-out and analyze.
6+
1. read this file first.
77
2. pull and checkout codes, build and evaluate `mulmat-tune`.
88
3. optionally run `tests/test-mulmat-tune`.
99
4. place file `mulmat-tune.txt` (generated from `mulmat-tune`) into the parent
@@ -15,14 +15,14 @@ the following steps:
1515
GGML defines three task types(stages): INIT, COMPUTE, FINALIZE. All nodes has
1616
COMPUTE stage, some has INIT stage, the FINALIZE is never used.
1717

18-
GGML supports pure CPU, CL, CUDA and optional CBLAS (Accelerate/OpenBLAS/BLIS).
18+
GGML supports pure CPU, CUDA, CL and BLAS (Accelerate/OpenBLAS/BLIS).
1919
In recent days, the compute framework keep evolving, introduced a lot of new
2020
things: backend, vendor. In this CL, I follow the `backend` definition, and
21-
defined new backend `CBLAS` for the Accelerate etc.
21+
defined new backend `BLAS` for the Accelerate etc.
2222

2323
General speaking, as of my tests, it's very slow to run `cblas_sgemm` with multi
2424
OS threads. From master codes I saw that the `ggml_graph_compute` sets `node->n_tasks = 1`
25-
to when compute `mul_mat` in `GPU` or `CBLAS`.
25+
to when compute `mul_mat` in `GPU` or `BLAS`.
2626

2727
To speedup large prompt and avoid spinning, the master set `n_threads` to 1 when
2828
token size >= 32. `ggml_compute_forward_mul_mat_use_blas` checks:
@@ -36,12 +36,12 @@ two simple pull requests because they are not solutions but noises. In the secon
3636
pull request, @gerganov hinted me the `1-thread blas` problem, so I followed this
3737
direction since then. My benchmarks on 7B/13B and all Qx_x shows that: for all
3838
N/K (e.g, 4096x4096), Accelerate always run faster than pure GPU at `M = 32`,
39-
the difference may vary up to `30%` as of 1 thread. I will show you image for this.
39+
the difference may vary up to `30%` as of 1 thread. <del>I will show you image for this.</del>
4040

4141
When I observed that the de-quantization takes about half of total time, I think
4242
it's a good chance to start from here. At first I implemented the new threading
4343
framework that supports `wait/notify`, subtle and fragile to dead lock. I'm happy
44-
that it works. I had tried to bench online by comparing CPU/GPU time, finally I
44+
that it works. I had tried to bench online by comparing the time between with or without BLAS, finally I
4545
replaced that with offline bench. To explicitly control details (how to parallel,
4646
when to wait, how to select the best executing plan), I had to define task config,
4747
task profiles. Finally after over a month busy coding I got the demo solution.
@@ -77,18 +77,18 @@ In current `master` branch, the `mul_mat` codes run in several implicit profiles
7777

7878
- pure cpu: INIT: very fast, COMPUTE: the computation time is proportional to M.
7979
- CUDA/CL: COMPUTE: de-quantization and mul_mat in GPU.
80-
- Accelerate/OpenBLAS/BLIS: COMPUTE: de-quantization in CPU, mul_mat in GPU.
80+
- CPU use BLAS: Accelerate/OpenBLAS/BLIS: COMPUTE: mul_mat with BLAS.
8181

8282
I observed the following "facts" on Accelerate/OpenBLAS.
8383

8484
- Whatever the M is, given N and K, the de-quantization time is constant (in theory).
85-
- The mul_mat time in GPU is heavy (tens to hundreds ms), goes up very slow when
85+
- The mul_mat time with BLAS is heavy (tens to hundreds ms), goes up very slow when
8686
M doubles.
8787
- In the large M range, the de-quantization time accounts for a large proportion
8888
of the total calculation time. For example, for 7B, Q4_0, NxK=4096x4096, the
8989
proportion of de-quantization time exceeds or near `50%` for large M range (up
90-
to 128). Other NxK combinations have similar situation. You may look at dirs
91-
[bench-out](./bench-out/) and [analyze](./analyze/) for more examples.
90+
to 128). Other NxK combinations have similar situation. <del>You may look at dirs
91+
[bench-out](./bench-out/) and [analyze](./analyze/) for more examples.</del>
9292

9393
In theory, if we split COMPUTE stage as INIT + COMPUTE, we MAY speedup prompt
9494
eval time a lot: up to 50% for large M range.
@@ -103,40 +103,31 @@ selecting. The latter, although secondary, is necessary for time estimation.
103103
In conclusion, let me list the key point how does it work.
104104

105105
1. Explicitly task config and profiles:
106-
* define conf profiles (for example, init in CPU, compute in GPU);
106+
* define conf profiles for controlling which part of code to run.
107+
for example, run COMPUTE stage with or without BLAS.
107108
* define for any stage: compute in parallel or not, idle wait or not.
108109
* non-existing compute stages are not called.
109110
2. New threading framework: combine `spin` + `wait/notify`. Without wait, workers
110111
busy spinning may causes overheat and slow down the overall speed. The mul_mat
111112
compute time is long enough (often tens of ms), so the wait/notify overhead
112113
(at most tens of us) is OK.
113-
3. Update mul_mat BLAS codes to support the new task profile.
114+
3. Update mul_mat codes to support the new task profile.
114115
4. A tune tool for benching. With bench data, given N/K and n_threads, we could
115116
estimate total computing time for any M (even if out of bench range), thus
116117
could select the fastest profile.
117118
5. On llama start, it loads the bench data from file (if exists). Before computing
118-
node, we select the fastest profile. When compute, the `dst->task_conf` along
119-
with `params` controls which part of the codes to run.
119+
node, we select the fastest profile. When compute, both `dst->task_conf` and
120+
`params` control which part of the codes to run.
120121

121-
Further more, if the de-quantization time in CUDA/CL COULD NOT compete multiple
122-
CPU threads on some devices, we can add profiles for them (just like Accelerate)
123-
to run de-quantization in CPU and run mul_mat in GPU.
124-
125-
With explicit task config profiles and bench data, we are able to run any task
126-
stage in any backend. For example: for q4_0, we could run INIT in CUDA and COMPUTE
127-
in Accelerate -- if that makes sense.
128-
129-
Too much changes to explain. Not enough time to write them in details at present
130-
when codes are still unstable, so just list changes in ggml.h here.
131122

132123
```c
133124
// ggml.h
134125

135126
enum ggml_backend {
136-
GGML_BACKEND_CPU = 0,
127+
GGML_BACKEND = 0,
137128
GGML_BACKEND_CUDA = 1,
138129
GGML_BACKEND_CL = 2,
139-
GGML_BACKEND_CBLAS = 3, // has API `cblas_sgemm`
130+
GGML_BACKEND_BLAS = 3, // has API `cblas_sgemm`
140131
};
141132

142133
struct ggml_tensor {
@@ -152,38 +143,8 @@ void ggml_internal_compute_forward_mul_mat(
152143
const struct ggml_tensor * src0,
153144
const struct ggml_tensor * src1,
154145
struct ggml_tensor * dst);
155-
```
156-
157-
## Misc Assets
158-
159-
The `prompt.sh` is a tool for bench `main`, it can generates questions in various
160-
length. Run `./examples/mulmat-tune/prompt.sh -h` for help. I had run it with
161-
`./examples/mulmat-tune/prompt.sh -b -f./examples/mulmat-tune/prompt.sh -b -f`.
162146

163-
The [bench-out dir](./bench-out) contains various bench result files generated
164-
on my device.
165-
166-
The [analyze dir](./analyze/) contains various analysis files generated with
167-
`./mulmat-tune analyze <bench-file>`. I strongly recommend you have a look at them.
168-
169-
Let me introduce them with [the image](./analyze/4096x4096_q4_0.png) which contains bench analysis for 7B/Q4_0/4096x4096.
170-
171-
**Firstly**, I defined three task profiles:
172-
173-
- #0: pure cpu. INIT in CPU with 1 thread, COMPUTE in GPU parallel.
174-
- #1: the `q_f32` CBLAS implementation in master (when defined either
175-
`GGML_USE_ACCELERATE` or `GGML_USE_OPENBLAS`)
176-
- #2: splt `#1` into `INIT` and `COMPUTE`. Where INIT in CPU parallel, COMPUTE
177-
in with Accelerate with 1 thread.
178147

179-
The `#0_0` is read as "profile #0, stage 0 (INIT)", the `#0_1` is read as
180-
"profile #0 stage 1(COMPUTE)". `#2` is profile 2. `nth=x` is read as `run with `x`
181-
thread(s)`. With 1 thread, the overall time of `profile #1` is almost equal to
182-
the that of `profile #2`, so I did not draw `profile #1`.
183-
184-
**Secondly**, I defined several `shape`s for attention, feed-forward and RoPE.
185-
186-
```c
187148
// examples/mulmat-tune/mulmat-tune.h
188149

189150
struct ggml_task_stage {
@@ -232,22 +193,71 @@ Analyze:
232193
./mulmat-tune analyze 7b.q4_0.txt
233194
```
234195
235-
Let's come back to the [the image](./analyze/4096x4096_q4_0.png). There is a table
236-
and five pictures in the image:
237-
238-
- The table contains analysis data block that was copied from output of `./mulmat-tune analyze`.
239-
- The top right picture shows: with 1 thread, cpu INIT, and BLAS compute time and
240-
total time. From this, given n_threads, we can estimate total time.
241-
- The top second picture is used to compare the overall time between profile #0
242-
(`pure CPU`) with profile #2. The pure CPU INIT is very fast (so can be totally omitted),
243-
but the COMPUTE is heavy and scales almost linear with M. The BLAS compute grows
244-
slowly when M doubles. From this picture, we can see the location of intersection
245-
point between both lines: the M is less than 32 -- this is true for all shapes
246-
(with N/K >= 4096) on my device.
247-
- The last three pictures at bottom are estimated time for nth=2/4/8. We could
248-
see that the intersection point (M) grows with `n_threads`. Suppose given M/N/K
249-
src0_type and src1_type, we could find corresponding shape. With this shape,
250-
we could estimate overall time for every profile and choose the fastest profile.
196+
Example bench analyze output looks as follows, which contains 6 shapes (blocks):
197+
198+
```
199+
N=4096,K=4096
200+
201+
#M,1,2,4,8,16,32,64,128,256,512
202+
#0_0_nth=1, 0.002, 0.003, 0.004, 0.009, 0.018, 0.036, 0.072, 0.151, 0.344, 0.719
203+
#0_1_nth=1, 1.268, 2.172, 3.371, 6.502, 13.068, 25.508, 52.853, 107.543, 213.692, 427.260
204+
#0___nth=1, 1.270, 2.175, 3.375, 6.511, 13.086, 25.544, 52.925, 107.694, 214.036, 427.979
205+
#1_1_nth=1, 17.509, 18.774, 15.617, 16.059, 17.877, 16.456, 18.331, 21.935, 34.317, 63.208
206+
#1___nth=1, 17.509, 18.774, 15.617, 16.059, 17.877, 16.456, 18.331, 21.935, 34.317, 63.208
207+
#2_0_nth=1, 12.349, 13.309, 11.130, 10.999, 11.231, 10.987, 10.742, 11.003, 11.106, 10.851
208+
#2_1_nth=1, 2.646, 5.259, 4.252, 4.542, 5.857, 6.642, 7.239, 11.009, 23.582, 52.081
209+
#2___nth=1, 14.995, 18.568, 15.382, 15.541, 17.088, 17.629, 17.981, 22.012, 34.688, 62.932
210+
211+
#0_1_nth=2, 0.634, 1.086, 1.685, 3.251, 6.534, 12.754, 26.426, 53.771, 106.846, 213.630
212+
#0___nth=2, 0.636, 1.089, 1.689, 3.260, 6.552, 12.790, 26.498, 53.922, 107.190, 214.349
213+
#2_0_nth=2, 6.174, 6.654, 5.565, 5.499, 5.615, 5.493, 5.371, 5.501, 5.553, 5.425
214+
#2___nth=2, 8.820, 11.913, 9.817, 10.041, 11.472, 12.135, 12.610, 16.510, 29.135, 57.506
215+
216+
#0_1_nth=4, 0.317, 0.543, 0.842, 1.625, 3.267, 6.377, 13.213, 26.885, 53.423, 106.815
217+
#0___nth=4, 0.319, 0.546, 0.846, 1.634, 3.285, 6.413, 13.285, 27.036, 53.767, 107.534
218+
#2_0_nth=4, 3.087, 3.327, 2.782, 2.749, 2.807, 2.746, 2.685, 2.750, 2.776, 2.712
219+
#2___nth=4, 5.733, 8.586, 7.034, 7.291, 8.664, 9.388, 9.924, 13.759, 26.358, 54.793
220+
221+
#0_1_nth=8, 0.158, 0.271, 0.421, 0.812, 1.633, 3.188, 6.606, 13.442, 26.711, 53.407
222+
#0___nth=8, 0.160, 0.274, 0.425, 0.821, 1.651, 3.224, 6.678, 13.593, 27.055, 54.126
223+
#2_0_nth=8, 1.543, 1.663, 1.391, 1.374, 1.403, 1.373, 1.342, 1.375, 1.388, 1.356
224+
#2___nth=8, 4.189, 6.922, 5.643, 5.916, 7.260, 8.015, 8.581, 12.384, 24.970, 53.437
225+
226+
N=4096,K=11008
227+
228+
...
229+
230+
N=11008,K=4096
231+
232+
...
233+
234+
N=32000,K=4096
235+
236+
...
237+
238+
N=128,K=M
239+
240+
...
241+
242+
N=M,K=128
243+
244+
...
245+
```
246+
247+
Terms:
248+
249+
- #0: pure cpu. INIT with 1 thread, COMPUTE with N threads.
250+
- #1: the `q_f32` BLAS implementation in master (when defined either
251+
`GGML_USE_ACCELERATE` or `GGML_USE_OPENBLAS`)
252+
- #2: split `#1` into `INIT` and `COMPUTE`. Where INIT runs de-quantization
253+
with N threads, COMPUTE in with Accelerate with 1 thread.
254+
255+
The `#0_0` is read as "profile #0, stage 0 (INIT)", the `#0_1` is read as
256+
"profile #0 stage 1 (COMPUTE)". "#0__" is read as total time.
257+
258+
`nth=x` is read as `run with x thread(s)`. With 1 thread, when we know the time
259+
of every stage and known whether this stage can be parallelled or not, we can
260+
estimate the time for N threads.
251261
252262
## Limitations
253263
@@ -284,7 +294,7 @@ bench [-m MODEL] [-t TYPE] [-f FILE] [-y]
284294
--m_num M_NUM number of M, the max M = 2^(M_NUM-1)
285295
requires: in range [8, 12]
286296
default 10
287-
--backend BACKEND blas backend: CUDA | CL | CBLAS
297+
--backend BACKEND backend: CUDA | CL | BLAS
288298
default: auto detect
289299
--n_pass number of passes to run
290300
default 3
@@ -293,7 +303,7 @@ bench [-m MODEL] [-t TYPE] [-f FILE] [-y]
293303
default stdout
294304
-y always answer "yes" to all prompts
295305

296-
Tips on how to build with various BLAS vendors:
306+
Tips on how to build with various backend vendors:
297307

298308
CUDA: make clean; LLAMA_CUBLAS=1 make
299309
ClBlast: make clean; LLAMA_CLBLAST=1 make
@@ -323,7 +333,7 @@ NOTE: to disable ACCELERATE, use LLAMA_NO_ACCELERATE=1
323333
./mulmat-tune bench --n_pass 1
324334

325335
# customized backend:
326-
./mulmat-tune bench --backend CBLAS
336+
./mulmat-tune bench --backend BLAS
327337

328338
# save to file
329339
./mulmat-tune bench --file mulmat-tune.txt
@@ -446,24 +456,25 @@ $ ./mulmat-tune bench
446456
512 33 2832 0 0 313 0
447457
```
448458
459+
<del>
449460
See example files in dir [bench-out](bench-out) for details.
461+
</del>
450462
451463
**Informal Explanation**
452464
453465
```
454466
head
455467
groups+
456468

457-
head := version model type type_name backend blas_vendor n_shapes
469+
head := version model type type_name backend backend_vendor n_shapes
458470
shape+
459471

460472
# head
461473
version: 1
462474
model: "7B" | "13B" | "30B" | "65B"
463475
type: 2 | 3 | 8 | 9 | 7 | 0 | 1
464476
type_name: "Q4_0" | "Q4_1" | "Q5_0" | "Q5_1" | "Q8_0" | "F32" | "F16"
465-
backend: 1 (CUDA) | 2 (CL)| 3 (CBLAS)
466-
blas_vendor: "CUDA" | "CLBLAST" | "ACCELERATE" | "OPENBLAS" | "BLIS"
477+
backend_vendor: "CUDA" | "CLBLAST" | "ACCELERATE" | "OPENBLAS" | "BLIS"
467478
n_shapes: number of shapes
468479

469480
shape := N K m_num n_profiles
@@ -472,7 +483,7 @@ bench_item+
472483

473484
task_conf_profile: stage_conf(init) stage_conf(compute) stage_conf(finalize)
474485
stage_conf: backend parallel wait
475-
backend: -1 (UNKNOWN) | 0 (CPU) | 1 (CUDA) | 2 (CL) | 3 (CBLAS)
486+
backend: -1 (UNKNOWN) | 0 (CPU) | 1 (CUDA) | 2 (CL) | 3 (BLAS)
476487
parallel: 0 | 1
477488
wait: 0 | 1
478489

@@ -488,10 +499,10 @@ Time unit is `us`. A column is all zeros when that stage does not exist.
488499
489500
For Accelerate/OpenBLAS, mul_mat_q_f32, there are three profiles:
490501
491-
- `pure CPU`: INIT in CPU, and COMPUTE in CPU (N threads).
492-
- `use BLAS 1`: COMPUTE (1 thread): (de-quantize) in CPU and mul_mat in GPU.
502+
- `pure CPU`: INIT in CPU, and COMPUTE without BLAS (N threads).
503+
- `use BLAS 1`: COMPUTE (1 thread): (de-quantize) in CPU and mul_mat in with BLAS.
493504
- `use BLAS 2`: INIT (N threads): (de-quantize) in CPU , COMPUTE (1 thread):
494-
mul_mat in GPU.
505+
mul_mat in BLAS.
495506
496507
For any thread number `nth`, when the INIT stage can only run with 1 thread, but
497508
the COMPUTE stage can ran with N threads, then:

0 commit comments

Comments
 (0)