3
3
To readers: this is a length introduction to this big PR. Recommend you review in
4
4
the following steps:
5
5
6
- 1 . read this file first and browse the files in dirs bench-out and analyze .
6
+ 1 . read this file first.
7
7
2 . pull and checkout codes, build and evaluate ` mulmat-tune ` .
8
8
3 . optionally run ` tests/test-mulmat-tune ` .
9
9
4 . place file ` mulmat-tune.txt ` (generated from ` mulmat-tune ` ) into the parent
@@ -15,14 +15,14 @@ the following steps:
15
15
GGML defines three task types(stages): INIT, COMPUTE, FINALIZE. All nodes has
16
16
COMPUTE stage, some has INIT stage, the FINALIZE is never used.
17
17
18
- GGML supports pure CPU, CL, CUDA and optional CBLAS (Accelerate/OpenBLAS/BLIS).
18
+ GGML supports pure CPU, CUDA, CL and BLAS (Accelerate/OpenBLAS/BLIS).
19
19
In recent days, the compute framework keep evolving, introduced a lot of new
20
20
things: backend, vendor. In this CL, I follow the ` backend ` definition, and
21
- defined new backend ` CBLAS ` for the Accelerate etc.
21
+ defined new backend ` BLAS ` for the Accelerate etc.
22
22
23
23
General speaking, as of my tests, it's very slow to run ` cblas_sgemm ` with multi
24
24
OS threads. From master codes I saw that the ` ggml_graph_compute ` sets ` node->n_tasks = 1 `
25
- to when compute ` mul_mat ` in ` GPU ` or ` CBLAS ` .
25
+ to when compute ` mul_mat ` in ` GPU ` or ` BLAS ` .
26
26
27
27
To speedup large prompt and avoid spinning, the master set ` n_threads ` to 1 when
28
28
token size >= 32. ` ggml_compute_forward_mul_mat_use_blas ` checks:
@@ -36,12 +36,12 @@ two simple pull requests because they are not solutions but noises. In the secon
36
36
pull request, @gerganov hinted me the ` 1-thread blas ` problem, so I followed this
37
37
direction since then. My benchmarks on 7B/13B and all Qx_x shows that: for all
38
38
N/K (e.g, 4096x4096), Accelerate always run faster than pure GPU at ` M = 32 ` ,
39
- the difference may vary up to ` 30% ` as of 1 thread. I will show you image for this.
39
+ the difference may vary up to ` 30% ` as of 1 thread. < del > I will show you image for this.</ del >
40
40
41
41
When I observed that the de-quantization takes about half of total time, I think
42
42
it's a good chance to start from here. At first I implemented the new threading
43
43
framework that supports ` wait/notify ` , subtle and fragile to dead lock. I'm happy
44
- that it works. I had tried to bench online by comparing CPU/GPU time, finally I
44
+ that it works. I had tried to bench online by comparing the time between with or without BLAS , finally I
45
45
replaced that with offline bench. To explicitly control details (how to parallel,
46
46
when to wait, how to select the best executing plan), I had to define task config,
47
47
task profiles. Finally after over a month busy coding I got the demo solution.
@@ -77,18 +77,18 @@ In current `master` branch, the `mul_mat` codes run in several implicit profiles
77
77
78
78
- pure cpu: INIT: very fast, COMPUTE: the computation time is proportional to M.
79
79
- CUDA/CL: COMPUTE: de-quantization and mul_mat in GPU.
80
- - Accelerate/OpenBLAS/BLIS: COMPUTE: de-quantization in CPU, mul_mat in GPU .
80
+ - CPU use BLAS: Accelerate/OpenBLAS/BLIS: COMPUTE: mul_mat with BLAS .
81
81
82
82
I observed the following "facts" on Accelerate/OpenBLAS.
83
83
84
84
- Whatever the M is, given N and K, the de-quantization time is constant (in theory).
85
- - The mul_mat time in GPU is heavy (tens to hundreds ms), goes up very slow when
85
+ - The mul_mat time with BLAS is heavy (tens to hundreds ms), goes up very slow when
86
86
M doubles.
87
87
- In the large M range, the de-quantization time accounts for a large proportion
88
88
of the total calculation time. For example, for 7B, Q4_0, NxK=4096x4096, the
89
89
proportion of de-quantization time exceeds or near ` 50% ` for large M range (up
90
- to 128). Other NxK combinations have similar situation. You may look at dirs
91
- [ bench-out] ( ./bench-out/ ) and [ analyze] ( ./analyze/ ) for more examples.
90
+ to 128). Other NxK combinations have similar situation. < del > You may look at dirs
91
+ [ bench-out] ( ./bench-out/ ) and [ analyze] ( ./analyze/ ) for more examples.</ del >
92
92
93
93
In theory, if we split COMPUTE stage as INIT + COMPUTE, we MAY speedup prompt
94
94
eval time a lot: up to 50% for large M range.
@@ -103,40 +103,31 @@ selecting. The latter, although secondary, is necessary for time estimation.
103
103
In conclusion, let me list the key point how does it work.
104
104
105
105
1 . Explicitly task config and profiles:
106
- * define conf profiles (for example, init in CPU, compute in GPU);
106
+ * define conf profiles for controlling which part of code to run.
107
+ for example, run COMPUTE stage with or without BLAS.
107
108
* define for any stage: compute in parallel or not, idle wait or not.
108
109
* non-existing compute stages are not called.
109
110
2 . New threading framework: combine ` spin ` + ` wait/notify ` . Without wait, workers
110
111
busy spinning may causes overheat and slow down the overall speed. The mul_mat
111
112
compute time is long enough (often tens of ms), so the wait/notify overhead
112
113
(at most tens of us) is OK.
113
- 3 . Update mul_mat BLAS codes to support the new task profile.
114
+ 3 . Update mul_mat codes to support the new task profile.
114
115
4 . A tune tool for benching. With bench data, given N/K and n_threads, we could
115
116
estimate total computing time for any M (even if out of bench range), thus
116
117
could select the fastest profile.
117
118
5 . On llama start, it loads the bench data from file (if exists). Before computing
118
- node, we select the fastest profile. When compute, the ` dst->task_conf ` along
119
- with ` params ` controls which part of the codes to run.
119
+ node, we select the fastest profile. When compute, both ` dst->task_conf ` and
120
+ ` params ` control which part of the codes to run.
120
121
121
- Further more, if the de-quantization time in CUDA/CL COULD NOT compete multiple
122
- CPU threads on some devices, we can add profiles for them (just like Accelerate)
123
- to run de-quantization in CPU and run mul_mat in GPU.
124
-
125
- With explicit task config profiles and bench data, we are able to run any task
126
- stage in any backend. For example: for q4_0, we could run INIT in CUDA and COMPUTE
127
- in Accelerate -- if that makes sense.
128
-
129
- Too much changes to explain. Not enough time to write them in details at present
130
- when codes are still unstable, so just list changes in ggml.h here.
131
122
132
123
``` c
133
124
// ggml.h
134
125
135
126
enum ggml_backend {
136
- GGML_BACKEND_CPU = 0,
127
+ GGML_BACKEND = 0,
137
128
GGML_BACKEND_CUDA = 1,
138
129
GGML_BACKEND_CL = 2,
139
- GGML_BACKEND_CBLAS = 3, // has API `cblas_sgemm`
130
+ GGML_BACKEND_BLAS = 3, // has API `cblas_sgemm`
140
131
};
141
132
142
133
struct ggml_tensor {
@@ -152,38 +143,8 @@ void ggml_internal_compute_forward_mul_mat(
152
143
const struct ggml_tensor * src0,
153
144
const struct ggml_tensor * src1,
154
145
struct ggml_tensor * dst);
155
- ```
156
-
157
- ## Misc Assets
158
-
159
- The `prompt.sh` is a tool for bench `main`, it can generates questions in various
160
- length. Run `./examples/mulmat-tune/prompt.sh -h` for help. I had run it with
161
- `./examples/mulmat-tune/prompt.sh -b -f./examples/mulmat-tune/prompt.sh -b -f`.
162
146
163
- The [bench-out dir](./bench-out) contains various bench result files generated
164
- on my device.
165
-
166
- The [analyze dir](./analyze/) contains various analysis files generated with
167
- `./mulmat-tune analyze <bench-file>`. I strongly recommend you have a look at them.
168
-
169
- Let me introduce them with [the image](./analyze/4096x4096_q4_0.png) which contains bench analysis for 7B/Q4_0/4096x4096.
170
-
171
- **Firstly**, I defined three task profiles:
172
-
173
- - #0: pure cpu. INIT in CPU with 1 thread, COMPUTE in GPU parallel.
174
- - #1: the `q_f32` CBLAS implementation in master (when defined either
175
- `GGML_USE_ACCELERATE` or `GGML_USE_OPENBLAS`)
176
- - #2: splt `#1` into `INIT` and `COMPUTE`. Where INIT in CPU parallel, COMPUTE
177
- in with Accelerate with 1 thread.
178
147
179
- The `#0_0` is read as "profile #0, stage 0 (INIT)", the `#0_1` is read as
180
- "profile #0 stage 1(COMPUTE)". `#2` is profile 2. `nth=x` is read as `run with `x`
181
- thread(s)`. With 1 thread, the overall time of `profile #1` is almost equal to
182
- the that of `profile #2`, so I did not draw `profile #1`.
183
-
184
- **Secondly**, I defined several `shape`s for attention, feed-forward and RoPE.
185
-
186
- ```c
187
148
// examples/mulmat-tune/mulmat-tune.h
188
149
189
150
struct ggml_task_stage {
@@ -232,22 +193,71 @@ Analyze:
232
193
./mulmat-tune analyze 7b.q4_0.txt
233
194
```
234
195
235
- Let's come back to the [ the image] ( ./analyze/4096x4096_q4_0.png ) . There is a table
236
- and five pictures in the image:
237
-
238
- - The table contains analysis data block that was copied from output of ` ./mulmat-tune analyze ` .
239
- - The top right picture shows: with 1 thread, cpu INIT, and BLAS compute time and
240
- total time. From this, given n_threads, we can estimate total time.
241
- - The top second picture is used to compare the overall time between profile #0
242
- (` pure CPU ` ) with profile #2 . The pure CPU INIT is very fast (so can be totally omitted),
243
- but the COMPUTE is heavy and scales almost linear with M. The BLAS compute grows
244
- slowly when M doubles. From this picture, we can see the location of intersection
245
- point between both lines: the M is less than 32 -- this is true for all shapes
246
- (with N/K >= 4096) on my device.
247
- - The last three pictures at bottom are estimated time for nth=2/4/8. We could
248
- see that the intersection point (M) grows with ` n_threads ` . Suppose given M/N/K
249
- src0_type and src1_type, we could find corresponding shape. With this shape,
250
- we could estimate overall time for every profile and choose the fastest profile.
196
+ Example bench analyze output looks as follows, which contains 6 shapes (blocks):
197
+
198
+ ```
199
+ N=4096,K=4096
200
+
201
+ #M,1,2,4,8,16,32,64,128,256,512
202
+ #0_0_nth=1, 0.002, 0.003, 0.004, 0.009, 0.018, 0.036, 0.072, 0.151, 0.344, 0.719
203
+ #0_1_nth=1, 1.268, 2.172, 3.371, 6.502, 13.068, 25.508, 52.853, 107.543, 213.692, 427.260
204
+ #0__ _ nth=1, 1.270, 2.175, 3.375, 6.511, 13.086, 25.544, 52.925, 107.694, 214.036, 427.979
205
+ #1_1_nth=1, 17.509, 18.774, 15.617, 16.059, 17.877, 16.456, 18.331, 21.935, 34.317, 63.208
206
+ #1__ _ nth=1, 17.509, 18.774, 15.617, 16.059, 17.877, 16.456, 18.331, 21.935, 34.317, 63.208
207
+ #2_0_nth=1, 12.349, 13.309, 11.130, 10.999, 11.231, 10.987, 10.742, 11.003, 11.106, 10.851
208
+ #2_1_nth=1, 2.646, 5.259, 4.252, 4.542, 5.857, 6.642, 7.239, 11.009, 23.582, 52.081
209
+ #2__ _ nth=1, 14.995, 18.568, 15.382, 15.541, 17.088, 17.629, 17.981, 22.012, 34.688, 62.932
210
+
211
+ #0_1_nth=2, 0.634, 1.086, 1.685, 3.251, 6.534, 12.754, 26.426, 53.771, 106.846, 213.630
212
+ #0__ _ nth=2, 0.636, 1.089, 1.689, 3.260, 6.552, 12.790, 26.498, 53.922, 107.190, 214.349
213
+ #2_0_nth=2, 6.174, 6.654, 5.565, 5.499, 5.615, 5.493, 5.371, 5.501, 5.553, 5.425
214
+ #2__ _ nth=2, 8.820, 11.913, 9.817, 10.041, 11.472, 12.135, 12.610, 16.510, 29.135, 57.506
215
+
216
+ #0_1_nth=4, 0.317, 0.543, 0.842, 1.625, 3.267, 6.377, 13.213, 26.885, 53.423, 106.815
217
+ #0__ _ nth=4, 0.319, 0.546, 0.846, 1.634, 3.285, 6.413, 13.285, 27.036, 53.767, 107.534
218
+ #2_0_nth=4, 3.087, 3.327, 2.782, 2.749, 2.807, 2.746, 2.685, 2.750, 2.776, 2.712
219
+ #2__ _ nth=4, 5.733, 8.586, 7.034, 7.291, 8.664, 9.388, 9.924, 13.759, 26.358, 54.793
220
+
221
+ #0_1_nth=8, 0.158, 0.271, 0.421, 0.812, 1.633, 3.188, 6.606, 13.442, 26.711, 53.407
222
+ #0__ _ nth=8, 0.160, 0.274, 0.425, 0.821, 1.651, 3.224, 6.678, 13.593, 27.055, 54.126
223
+ #2_0_nth=8, 1.543, 1.663, 1.391, 1.374, 1.403, 1.373, 1.342, 1.375, 1.388, 1.356
224
+ #2__ _ nth=8, 4.189, 6.922, 5.643, 5.916, 7.260, 8.015, 8.581, 12.384, 24.970, 53.437
225
+
226
+ N=4096,K=11008
227
+
228
+ ...
229
+
230
+ N=11008,K=4096
231
+
232
+ ...
233
+
234
+ N=32000,K=4096
235
+
236
+ ...
237
+
238
+ N=128,K=M
239
+
240
+ ...
241
+
242
+ N=M,K=128
243
+
244
+ ...
245
+ ```
246
+
247
+ Terms:
248
+
249
+ - #0: pure cpu. INIT with 1 thread, COMPUTE with N threads.
250
+ - #1: the `q_f32` BLAS implementation in master (when defined either
251
+ `GGML_USE_ACCELERATE` or `GGML_USE_OPENBLAS`)
252
+ - #2: split `#1` into `INIT` and `COMPUTE`. Where INIT runs de-quantization
253
+ with N threads, COMPUTE in with Accelerate with 1 thread.
254
+
255
+ The `#0_0` is read as "profile #0, stage 0 (INIT)", the `#0_1` is read as
256
+ "profile #0 stage 1 (COMPUTE)". "#0__" is read as total time.
257
+
258
+ `nth=x` is read as `run with x thread(s)`. With 1 thread, when we know the time
259
+ of every stage and known whether this stage can be parallelled or not, we can
260
+ estimate the time for N threads.
251
261
252
262
## Limitations
253
263
@@ -284,7 +294,7 @@ bench [-m MODEL] [-t TYPE] [-f FILE] [-y]
284
294
--m_num M_NUM number of M, the max M = 2^(M_NUM-1)
285
295
requires: in range [ 8, 12]
286
296
default 10
287
- --backend BACKEND blas backend: CUDA | CL | CBLAS
297
+ --backend BACKEND backend: CUDA | CL | BLAS
288
298
default: auto detect
289
299
--n_pass number of passes to run
290
300
default 3
@@ -293,7 +303,7 @@ bench [-m MODEL] [-t TYPE] [-f FILE] [-y]
293
303
default stdout
294
304
-y always answer "yes" to all prompts
295
305
296
- Tips on how to build with various BLAS vendors:
306
+ Tips on how to build with various backend vendors:
297
307
298
308
CUDA: make clean; LLAMA_CUBLAS=1 make
299
309
ClBlast: make clean; LLAMA_CLBLAST=1 make
@@ -323,7 +333,7 @@ NOTE: to disable ACCELERATE, use LLAMA_NO_ACCELERATE=1
323
333
./mulmat-tune bench --n_pass 1
324
334
325
335
# customized backend:
326
- ./mulmat-tune bench --backend CBLAS
336
+ ./mulmat-tune bench --backend BLAS
327
337
328
338
# save to file
329
339
./mulmat-tune bench --file mulmat-tune.txt
@@ -446,24 +456,25 @@ $ ./mulmat-tune bench
446
456
512 33 2832 0 0 313 0
447
457
```
448
458
459
+ <del>
449
460
See example files in dir [bench-out](bench-out) for details.
461
+ </del>
450
462
451
463
**Informal Explanation**
452
464
453
465
```
454
466
head
455
467
groups+
456
468
457
- head := version model type type_name backend blas_vendor n_shapes
469
+ head := version model type type_name backend backend_vendor n_shapes
458
470
shape+
459
471
460
472
# head
461
473
version: 1
462
474
model: "7B" | "13B" | "30B" | "65B"
463
475
type: 2 | 3 | 8 | 9 | 7 | 0 | 1
464
476
type_name: "Q4_0" | "Q4_1" | "Q5_0" | "Q5_1" | "Q8_0" | "F32" | "F16"
465
- backend: 1 (CUDA) | 2 (CL)| 3 (CBLAS)
466
- blas_vendor: "CUDA" | "CLBLAST" | "ACCELERATE" | "OPENBLAS" | "BLIS"
477
+ backend_vendor: "CUDA" | "CLBLAST" | "ACCELERATE" | "OPENBLAS" | "BLIS"
467
478
n_shapes: number of shapes
468
479
469
480
shape := N K m_num n_profiles
@@ -472,7 +483,7 @@ bench_item+
472
483
473
484
task_conf_profile: stage_conf(init) stage_conf(compute) stage_conf(finalize)
474
485
stage_conf: backend parallel wait
475
- backend: -1 (UNKNOWN) | 0 (CPU) | 1 (CUDA) | 2 (CL) | 3 (CBLAS )
486
+ backend: -1 (UNKNOWN) | 0 (CPU) | 1 (CUDA) | 2 (CL) | 3 (BLAS )
476
487
parallel: 0 | 1
477
488
wait: 0 | 1
478
489
@@ -488,10 +499,10 @@ Time unit is `us`. A column is all zeros when that stage does not exist.
488
499
489
500
For Accelerate/OpenBLAS, mul_mat_q_f32, there are three profiles:
490
501
491
- - ` pure CPU ` : INIT in CPU, and COMPUTE in CPU (N threads).
492
- - ` use BLAS 1 ` : COMPUTE (1 thread): (de-quantize) in CPU and mul_mat in GPU .
502
+ - `pure CPU`: INIT in CPU, and COMPUTE without BLAS (N threads).
503
+ - `use BLAS 1`: COMPUTE (1 thread): (de-quantize) in CPU and mul_mat in with BLAS .
493
504
- `use BLAS 2`: INIT (N threads): (de-quantize) in CPU , COMPUTE (1 thread):
494
- mul_mat in GPU .
505
+ mul_mat in BLAS .
495
506
496
507
For any thread number `nth`, when the INIT stage can only run with 1 thread, but
497
508
the COMPUTE stage can ran with N threads, then:
0 commit comments