1
1
# Fine Tune MUL_MAT with Bench
2
2
3
- ## Introduction
3
+ ## Background
4
4
5
5
GGML defines three task types(stages): INIT, COMPUTE, FINALIZE. All nodes has
6
6
COMPUTE stage, some has INIT stage, the FINALIZE is never used.
@@ -9,15 +9,38 @@ General speaking, codes run in GPU(BLAS) MAY not suitable to run with multi OS
9
9
threads -- sometimes very slow, but CPU could and scales well. So to speedup
10
10
large prompt and avoid spinning, the master code force 1-thread when (M >=32).
11
11
12
+ So, problems to solve:
13
+
14
+ 1 . the ` xxx_mul_mat_can_use_blas ` rule is not accurate. We need bench.
15
+ 2 . with multiple threads, when run those heavy BLAS stages, we have to avoid
16
+ busy spinning.
17
+
18
+ I have been focused on the ` threading ` problem(s) since Apr this year. I dropped
19
+ two simple pull requests because they are not solutions but noises. In the second
20
+ pull request, @gerganov hinted me the ` 1-thread blas ` problem, so I followed this
21
+ direction since then.
22
+
23
+ When I observed that the de-quantization takes about half of total time, I think
24
+ it's a good chance to start from here. At first I implemented the new threading
25
+ framework that supports ` wait/notify ` , subtle and fragile to dead lock. I'm happy
26
+ that it works. I had tried to bench online by comparing CPU/GPU time, finally I
27
+ replaced that with offline bench. To explicitly control details (how to parallel,
28
+ when to wait, how to select the best executing plan), I had to define task config,
29
+ task profiles. Finally I got the demo solution.
30
+
31
+ Data files in in bench result dir were generated in MacBook Pro 2018 with 32 GB
32
+ 2400 MHz DDR4 memory, 2.6 GHz 6-Core Intel Core i7-8850H @2 .60GHz, Intel UHD
33
+ Graphics 630 1536 MB.
34
+
35
+ ## Solution and Result
36
+
12
37
In current ` master ` branch, the ` mul_mat ` codes run in several implicit profiles.
13
38
14
39
- pure cpu: INIT: very fast, COMPUTE: the computation time is proportional to M.
15
40
- CUDA/CL: COMPUTE: de-quantization and mul_mat in GPU.
16
- - Accelerate/OpenBLAS: COMPUTE: de-quantization in CPU, mul_mat in GPU.
41
+ - Accelerate/OpenBLAS/BLIS : COMPUTE: de-quantization in CPU, mul_mat in GPU.
17
42
18
- I observed the following "facts" on Accelerate/OpenBLAS. The following data are
19
- generated in MacBook Pro 2018 with: 32 GB 2400 MHz DDR4 memory, 2.6 GHz 6-Core
20
- Intel Core i7-8850H @2 .60GHz, Intel UHD Graphics 630 1536 MB.
43
+ I observed the following "facts" on Accelerate/OpenBLAS.
21
44
22
45
- Whatever the M is, given N and K, the de-quantization time is constant (in theory).
23
46
- The mul_mat time in GPU is heavy (tens to hundreds ms), goes up very slow when
@@ -29,48 +52,7 @@ Intel Core i7-8850H @2.60GHz, Intel UHD Graphics 630 1536 MB.
29
52
large as NxK=4096x4096.
30
53
31
54
In theory, if we split COMPUTE stage as INIT + COMPUTE, we MAY speedup prompt
32
- eval time a lot: up to 50% for large M range (e.g. 32 - 128) when the ` use GPU `
33
- profile competes ` pure CPU ` profile. The following diagram demonstrates the
34
- ` use GPU ` profile (7B/Q4_0/Accelerate, INIT in CPU, COMPUTE in GPU). We can see
35
- the trends of how computing time changes with M.
36
-
37
- ![ 7b_q4_0_accelerate use GPU time] ( ./images/7b_q4_0_accelerate_png )
38
-
39
- Apart from a bit slower (10% or so) than Accelerate, OpenBLAS behaves similar to
40
- Accelerate. But BLIS is quite slow on my device. I will not show the images for
41
- them. You may want to have a look at [ bench-out] ( bench-out/ ) .
42
-
43
- ClBlast is far more slower than Accelerate on my device. I had manged to make it
44
- run on my device, and split the COMPUTE stage into INIT + COMPUTE for demonstrating
45
- purpose. Since the CPU de-quantization time is fairly smaller than the GPU time,
46
- the overall gain of running CPU INIT + GPU COMPUTE is small: no more than 20% for
47
- M in range \[ 32, 128\] . Anyway, Let me show you the picture below.
48
-
49
- ![ 7b_q4_0_cl use GPU time] ( ./images/7b_q4_0_cl_png )
50
-
51
- The next two pictures demonstrate how does ` n_threads ` affects the overall time
52
- among two config profiles. ` #0 ` is CPU INIT + CPU COMPUTE, ` #1 ` is CPU INIT + GPU
53
- COMPUTE. From these diagrams, given M, we could easily recognize the best config
54
- profile.
55
-
56
- 4096x4096 and 4096x11008:
57
-
58
- ![ n_threads 1] ( ./images/7b_q4_0_accelerate_nth-1.png )
59
-
60
- 11008x4096 and 32000x4096:
61
-
62
- ![ n_threads 2] ( ./images/7b_q4_0_accelerate_nth-2.png )
63
-
64
- I have been focused on the ` threading ` problem(s) since Apr this year. I dropped
65
- two simple pull requests because they are not solutions but noises. In the second
66
- pull request, @gerganov hinted me the ` 1-thread blas ` problem, so I followed this
67
- direction since then.
68
-
69
- At first I implemented the new threading framework that supports ` wait/notify ` ,
70
- subtle and fragile to dead lock. I'm happy that it works. I had tried to bench online
71
- by comparing CPU/GPU time, finally I replaced that with offline bench. To explicitly
72
- control details (how to parallel, when to wait, how to select the best executing plan),
73
- I had to define task config, task profiles. Finally I got the demo solution as follows.
55
+ eval time a lot: up to 50% for large M range (e.g. 32 - 128).
74
56
75
57
The eval time of long prompt decreases a lot. For example, ` examples/chat.sh ` with
76
58
4 threads, the prompt eval time of 99 tokens decreases up to ** -40%** in my device.
@@ -79,25 +61,7 @@ Tests for broad prompt size show speed up of `10% - 40%`.
79
61
The key factor for speeding up is parallelism, followed by more accurate profile
80
62
selection. The latter, although secondary, is necessary in the case of multithreading.
81
63
82
- Just like Accelerate/OpenBLAS, the de-quantization time in CUDA/CL MAY NOT
83
- compete multiple CPU threads on some devices. In case of this, we can add profiles
84
- for them to run de-quantization in CPU and run mul_mat in GPU.
85
-
86
- With explicit task config profiles and bench data, I'm expecting that we are able
87
- to run any task stage in any backend. For example: for q4_0, we could run INIT in
88
- CUDA and COMPUTE in Accelerate -- if the overall speed competes other profiles.
89
-
90
- Anyway, current solution is in demo stage and is incomplete due to various reasons,
91
- you will read them in the following sections.
92
-
93
- The mul_mat related codes keep changing, It's a bit hard for me to follow up and
94
- merge/rebase again and again. I think the overall changes can speak for themselves,
95
- so it's time to initiate a discussion or pull request.
96
-
97
- I'm new to machine learning this year and have little knowledge of AI. There must
98
- be a lot of problems with this pull request, please do not hesitate to advise.
99
-
100
- ## Solutions
64
+ In conclusion, let me list the key point how does it work.
101
65
102
66
1 . Update mul_mat BLAS codes: allow de-quantizing in CPU or GPU (if possible).
103
67
2 . Explicitly task config and profiles:
@@ -115,9 +79,15 @@ be a lot of problems with this pull request, please do not hesitate to advise.
115
79
node, we select the fastest profile. When compute, the ` dst->task_conf ` along
116
80
with ` params ` controls which part of the codes to run.
117
81
118
- About how to select profile, see section "** How To Estimate Execution Time** ".
82
+ Further more, if the de-quantization time in CUDA/CL COULD NOT compete multiple
83
+ CPU threads on some devices, we can add profiles for them (just like Accelerate)
84
+ to run de-quantization in CPU and run mul_mat in GPU.
119
85
120
- ** Explicitly configure task profiles**
86
+ With explicit task config profiles and bench data, we are able to run any task
87
+ stage in any backend. For example: for q4_0, we could run INIT in CUDA and COMPUTE
88
+ in Accelerate -- if that makes sense.
89
+
90
+ ## Task profile and task stage
121
91
122
92
``` c
123
93
// ggml.h
@@ -148,7 +118,44 @@ struct ggml_tensor {
148
118
}
149
119
```
150
120
151
- ## Limitations and TODOs
121
+ ## Misc Assets
122
+
123
+ The ` prompt.sh ` is a tool for bench ` main ` , it can generates questions in various
124
+ length. Run ` ./examples/mulmat-tune/prompt.sh -h ` for help. I had run it with
125
+ ` ./examples/mulmat-tune/prompt.sh -b -f./examples/mulmat-tune/prompt.sh -b -f ` .
126
+
127
+ The [ bench-out dir] ( ./bench-out ) contains various bench result files generated
128
+ on my device.
129
+ BTW, ` .*.cl.txt ` were generated by [ early version] ( https://github.com/ggerganov/llama.cpp/compare/master...mqy:blas-n_threads-fix-10#diff-e40acc281787b19c5975346837f154e7e75351733a9f9575317c64d2dbe38799 ) .
130
+ I solved the OOM problem, but unfortunately, I'm unable to catch up with latest updates.
131
+
132
+ The [ images dir] ( ./images ) contains images drawn from [ bench results] ( ./bench-out )
133
+ of N/K combinations from 7B and 13B. I compared two profiles for Q4_0.
134
+
135
+ - #0 : pure cpu. INIT in CPU with 1 thread, COMPUTE in GPU parallel.
136
+ - #2 : use Blas. INIT in CPU parallel, COMPUTE in with Accelerate with 1 thread.
137
+
138
+ The ` #0_0 ` is read as "profile #0 , stage 0 (INIT)", the ` #0_1 ` is read as
139
+ "profile #0 stage 1(COMPUTE)". ` #2 ` is profile 2.
140
+
141
+ The data data were generated like this:
142
+
143
+ Bench:
144
+
145
+ ```
146
+ ./mulmat-tune-tool bench 7B --file mulmat-tune.7b.q4_0.txt
147
+ ```
148
+
149
+ Analyze:
150
+
151
+ ```
152
+ ./mulmat-tune-tool analyze mulmat-tune.7b.q4_0.txt
153
+ ```
154
+
155
+ The m512 bench data were created by run bench for 3 times, manually combine the
156
+ results into one file.
157
+
158
+ ## Limitations
152
159
153
160
- Only tested models 7B and 13B.
154
161
- My OSs/devices can not use CUDA, so did not run benches for CUDA.
@@ -157,6 +164,13 @@ struct ggml_tensor {
157
164
- It's incomplete to validate bench data.
158
165
- and more ...
159
166
167
+ The mul_mat related codes keep changing, It's a bit hard for me to follow up and
168
+ merge/rebase again and again. I think the codes and data can speak for themselves,
169
+ so it's time to initiate a discussion or pull request.
170
+
171
+ I'm new to machine learning this year and have little knowledge of AI. There must
172
+ be a lot of problems with this pull request, please do not hesitate to advise.
173
+
160
174
## How to Evaluate
161
175
162
176
** Build**
@@ -168,28 +182,37 @@ struct ggml_tensor {
168
182
** Bench**
169
183
170
184
```
171
- ./mulmat-tune -h
172
185
usage: ./mulmat-tune [bench ...] | [analyze FILE] | test | [-h | --help]
173
186
174
187
bench [-m MODEL] [-t TYPE] [-f FILE] [-y]
175
- --model MODEL 7B | 13B | 30B | 65B
176
- default 7B
177
- --type TYPE Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F32 | F16
178
- default Q4_0
179
- --m_num M_NUM number of M, max M = 16 * M_NUM
180
- requires: M_NUM in range [8, 16]
181
- default 8
182
- --file FILE data file to write
183
- default stdout
184
- -y always answer "yes" to all prompts
188
+ --model MODEL 7B | 13B | 30B | 65B
189
+ default 7B
190
+ --type TYPE Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F32 | F16
191
+ default Q4_0
192
+ --m_start M_START start value of M
193
+ requires: even number
194
+ default 16
195
+ --m_step M_STEP delta between adjacent M
196
+ requires: even number, in range[2, 32]
197
+ default 16
198
+ --m_num M_NUM number of M, the max M = M_STEP * M_NUM
199
+ requires: in range [2, 128]
200
+ default 8
201
+ --backend BACKEND blas backend: CUDA | CL | CBLAS
202
+ default: auto detect
203
+ --file FILE data file to write
204
+ default stdout
205
+ -y always answer "yes" to all prompts
185
206
186
207
Tips on how to build with various BLAS vendors:
187
208
188
- * CUDA: make clean; LLAMA_NO_ACCELERATE=1 LLAMA_CUBLAS=1 make
189
- * ClBlast: make clean; LLAMA_NO_ACCELERATE=1 LLAMA_CLBLAST=1 make
190
- * Accelerate: make clean; LLAMA_NO_ACCELERATE= make
191
- * OpenBLAS: make clean; LLAMA_NO_ACCELERATE=1 LLAMA_OPENBLAS=1 make
192
- * BLIS: make clean; LLAMA_NO_ACCELERATE=1 LLAMA_BLIS=1 make
209
+ CUDA: make clean; LLAMA_CUBLAS=1 make
210
+ ClBlast: make clean; LLAMA_CLBLAST=1 make
211
+ Accelerate: make clean; LLAMA_NO_ACCELERATE= make
212
+ OpenBLAS: make clean; LLAMA_NO_ACCELERATE=1 LLAMA_OPENBLAS=1 make
213
+ BLIS: make clean; LLAMA_NO_ACCELERATE=1 LLAMA_BLIS=1 make
214
+
215
+ NOTE: to disable ACCELERATE, use LLAMA_NO_ACCELERATE=1
193
216
```
194
217
195
218
Examples:
@@ -198,11 +221,11 @@ Examples:
198
221
# run with default params (7B, Q4_0, ...)
199
222
./mulmat-tune
200
223
201
- # to run 13B and Q4_1 with alway-yes
224
+ # run 13B and Q4_1 with alway-yes
202
225
./mulmat-tune bench -model 13B --type Q4_1 -y
203
226
204
- # customized m_step (32 * 16)
205
- ./mulmat-tune bench -model 7B --m_step 32 -num_m 16
227
+ # customized m_start, m_step, m_num
228
+ ./mulmat-tune bench -model 7B --m_start 8 -- m_step 8 -m_num 8
206
229
207
230
# save to file
208
231
./mulmat-tune bench -model 7B --file mulmat-tune.txt
@@ -230,7 +253,7 @@ The program will print debug log when or when not found the file.
230
253
$ ./mulmat-tune bench --m_num 2
231
254
[BENCH] model: 7B, type: Q4_0, backend: CBLAS, BLAS vendor: ACCELERATE.
232
255
233
- 1 7B 2 Q4_0 3 ACCELERATE 4 16 2 3
256
+ 1 7B 2 Q4_0 3 ACCELERATE 4 16 3
234
257
0 0 0 0 1 0 -1 0 0
235
258
-1 0 0 3 0 1 -1 0 0
236
259
0 1 0 3 0 1 -1 0 0
@@ -257,7 +280,7 @@ See example files in dir [bench-out](bench-out) for details.
257
280
head
258
281
groups+
259
282
260
- head := version model type type_name backend blas_vendor n_shapes m_step num_m n_profiles
283
+ head := version model type type_name backend blas_vendor n_shapes m_num n_profiles
261
284
task_conf_profile+
262
285
shape
263
286
bench_item+
0 commit comments