Skip to content

Commit 022c370

Browse files
committed
mumat-tune-tool: add --m_start and refactor (better analyze cmd); document
1 parent 2ea239a commit 022c370

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+678
-1413
lines changed

examples/mulmat-tune/README.md

Lines changed: 113 additions & 90 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Fine Tune MUL_MAT with Bench
22

3-
## Introduction
3+
## Background
44

55
GGML defines three task types(stages): INIT, COMPUTE, FINALIZE. All nodes has
66
COMPUTE stage, some has INIT stage, the FINALIZE is never used.
@@ -9,15 +9,38 @@ General speaking, codes run in GPU(BLAS) MAY not suitable to run with multi OS
99
threads -- sometimes very slow, but CPU could and scales well. So to speedup
1010
large prompt and avoid spinning, the master code force 1-thread when (M >=32).
1111

12+
So, problems to solve:
13+
14+
1. the `xxx_mul_mat_can_use_blas` rule is not accurate. We need bench.
15+
2. with multiple threads, when run those heavy BLAS stages, we have to avoid
16+
busy spinning.
17+
18+
I have been focused on the `threading` problem(s) since Apr this year. I dropped
19+
two simple pull requests because they are not solutions but noises. In the second
20+
pull request, @gerganov hinted me the `1-thread blas` problem, so I followed this
21+
direction since then.
22+
23+
When I observed that the de-quantization takes about half of total time, I think
24+
it's a good chance to start from here. At first I implemented the new threading
25+
framework that supports `wait/notify`, subtle and fragile to dead lock. I'm happy
26+
that it works. I had tried to bench online by comparing CPU/GPU time, finally I
27+
replaced that with offline bench. To explicitly control details (how to parallel,
28+
when to wait, how to select the best executing plan), I had to define task config,
29+
task profiles. Finally I got the demo solution.
30+
31+
Data files in in bench result dir were generated in MacBook Pro 2018 with 32 GB
32+
2400 MHz DDR4 memory, 2.6 GHz 6-Core Intel Core i7-8850H @2.60GHz, Intel UHD
33+
Graphics 630 1536 MB.
34+
35+
## Solution and Result
36+
1237
In current `master` branch, the `mul_mat` codes run in several implicit profiles.
1338

1439
- pure cpu: INIT: very fast, COMPUTE: the computation time is proportional to M.
1540
- CUDA/CL: COMPUTE: de-quantization and mul_mat in GPU.
16-
- Accelerate/OpenBLAS: COMPUTE: de-quantization in CPU, mul_mat in GPU.
41+
- Accelerate/OpenBLAS/BLIS: COMPUTE: de-quantization in CPU, mul_mat in GPU.
1742

18-
I observed the following "facts" on Accelerate/OpenBLAS. The following data are
19-
generated in MacBook Pro 2018 with: 32 GB 2400 MHz DDR4 memory, 2.6 GHz 6-Core
20-
Intel Core i7-8850H @2.60GHz, Intel UHD Graphics 630 1536 MB.
43+
I observed the following "facts" on Accelerate/OpenBLAS.
2144

2245
- Whatever the M is, given N and K, the de-quantization time is constant (in theory).
2346
- The mul_mat time in GPU is heavy (tens to hundreds ms), goes up very slow when
@@ -29,48 +52,7 @@ Intel Core i7-8850H @2.60GHz, Intel UHD Graphics 630 1536 MB.
2952
large as NxK=4096x4096.
3053

3154
In theory, if we split COMPUTE stage as INIT + COMPUTE, we MAY speedup prompt
32-
eval time a lot: up to 50% for large M range (e.g. 32 - 128) when the `use GPU`
33-
profile competes `pure CPU` profile. The following diagram demonstrates the
34-
`use GPU` profile (7B/Q4_0/Accelerate, INIT in CPU, COMPUTE in GPU). We can see
35-
the trends of how computing time changes with M.
36-
37-
![7b_q4_0_accelerate use GPU time](./images/7b_q4_0_accelerate_png)
38-
39-
Apart from a bit slower (10% or so) than Accelerate, OpenBLAS behaves similar to
40-
Accelerate. But BLIS is quite slow on my device. I will not show the images for
41-
them. You may want to have a look at [bench-out](bench-out/).
42-
43-
ClBlast is far more slower than Accelerate on my device. I had manged to make it
44-
run on my device, and split the COMPUTE stage into INIT + COMPUTE for demonstrating
45-
purpose. Since the CPU de-quantization time is fairly smaller than the GPU time,
46-
the overall gain of running CPU INIT + GPU COMPUTE is small: no more than 20% for
47-
M in range \[32, 128\]. Anyway, Let me show you the picture below.
48-
49-
![7b_q4_0_cl use GPU time](./images/7b_q4_0_cl_png)
50-
51-
The next two pictures demonstrate how does `n_threads` affects the overall time
52-
among two config profiles. `#0` is CPU INIT + CPU COMPUTE, `#1` is CPU INIT + GPU
53-
COMPUTE. From these diagrams, given M, we could easily recognize the best config
54-
profile.
55-
56-
4096x4096 and 4096x11008:
57-
58-
![n_threads 1](./images/7b_q4_0_accelerate_nth-1.png)
59-
60-
11008x4096 and 32000x4096:
61-
62-
![n_threads 2](./images/7b_q4_0_accelerate_nth-2.png)
63-
64-
I have been focused on the `threading` problem(s) since Apr this year. I dropped
65-
two simple pull requests because they are not solutions but noises. In the second
66-
pull request, @gerganov hinted me the `1-thread blas` problem, so I followed this
67-
direction since then.
68-
69-
At first I implemented the new threading framework that supports `wait/notify`,
70-
subtle and fragile to dead lock. I'm happy that it works. I had tried to bench online
71-
by comparing CPU/GPU time, finally I replaced that with offline bench. To explicitly
72-
control details (how to parallel, when to wait, how to select the best executing plan),
73-
I had to define task config, task profiles. Finally I got the demo solution as follows.
55+
eval time a lot: up to 50% for large M range (e.g. 32 - 128).
7456

7557
The eval time of long prompt decreases a lot. For example, `examples/chat.sh` with
7658
4 threads, the prompt eval time of 99 tokens decreases up to **-40%** in my device.
@@ -79,25 +61,7 @@ Tests for broad prompt size show speed up of `10% - 40%`.
7961
The key factor for speeding up is parallelism, followed by more accurate profile
8062
selection. The latter, although secondary, is necessary in the case of multithreading.
8163

82-
Just like Accelerate/OpenBLAS, the de-quantization time in CUDA/CL MAY NOT
83-
compete multiple CPU threads on some devices. In case of this, we can add profiles
84-
for them to run de-quantization in CPU and run mul_mat in GPU.
85-
86-
With explicit task config profiles and bench data, I'm expecting that we are able
87-
to run any task stage in any backend. For example: for q4_0, we could run INIT in
88-
CUDA and COMPUTE in Accelerate -- if the overall speed competes other profiles.
89-
90-
Anyway, current solution is in demo stage and is incomplete due to various reasons,
91-
you will read them in the following sections.
92-
93-
The mul_mat related codes keep changing, It's a bit hard for me to follow up and
94-
merge/rebase again and again. I think the overall changes can speak for themselves,
95-
so it's time to initiate a discussion or pull request.
96-
97-
I'm new to machine learning this year and have little knowledge of AI. There must
98-
be a lot of problems with this pull request, please do not hesitate to advise.
99-
100-
## Solutions
64+
In conclusion, let me list the key point how does it work.
10165

10266
1. Update mul_mat BLAS codes: allow de-quantizing in CPU or GPU (if possible).
10367
2. Explicitly task config and profiles:
@@ -115,9 +79,15 @@ be a lot of problems with this pull request, please do not hesitate to advise.
11579
node, we select the fastest profile. When compute, the `dst->task_conf` along
11680
with `params` controls which part of the codes to run.
11781

118-
About how to select profile, see section "**How To Estimate Execution Time**".
82+
Further more, if the de-quantization time in CUDA/CL COULD NOT compete multiple
83+
CPU threads on some devices, we can add profiles for them (just like Accelerate)
84+
to run de-quantization in CPU and run mul_mat in GPU.
11985

120-
**Explicitly configure task profiles**
86+
With explicit task config profiles and bench data, we are able to run any task
87+
stage in any backend. For example: for q4_0, we could run INIT in CUDA and COMPUTE
88+
in Accelerate -- if that makes sense.
89+
90+
## Task profile and task stage
12191

12292
```c
12393
// ggml.h
@@ -148,7 +118,44 @@ struct ggml_tensor {
148118
}
149119
```
150120

151-
## Limitations and TODOs
121+
## Misc Assets
122+
123+
The `prompt.sh` is a tool for bench `main`, it can generates questions in various
124+
length. Run `./examples/mulmat-tune/prompt.sh -h` for help. I had run it with
125+
`./examples/mulmat-tune/prompt.sh -b -f./examples/mulmat-tune/prompt.sh -b -f`.
126+
127+
The [bench-out dir](./bench-out) contains various bench result files generated
128+
on my device.
129+
BTW, `.*.cl.txt` were generated by [early version](https://github.com/ggerganov/llama.cpp/compare/master...mqy:blas-n_threads-fix-10#diff-e40acc281787b19c5975346837f154e7e75351733a9f9575317c64d2dbe38799).
130+
I solved the OOM problem, but unfortunately, I'm unable to catch up with latest updates.
131+
132+
The [images dir](./images) contains images drawn from [bench results](./bench-out)
133+
of N/K combinations from 7B and 13B. I compared two profiles for Q4_0.
134+
135+
- #0: pure cpu. INIT in CPU with 1 thread, COMPUTE in GPU parallel.
136+
- #2: use Blas. INIT in CPU parallel, COMPUTE in with Accelerate with 1 thread.
137+
138+
The `#0_0` is read as "profile #0, stage 0 (INIT)", the `#0_1` is read as
139+
"profile #0 stage 1(COMPUTE)". `#2` is profile 2.
140+
141+
The data data were generated like this:
142+
143+
Bench:
144+
145+
```
146+
./mulmat-tune-tool bench 7B --file mulmat-tune.7b.q4_0.txt
147+
```
148+
149+
Analyze:
150+
151+
```
152+
./mulmat-tune-tool analyze mulmat-tune.7b.q4_0.txt
153+
```
154+
155+
The m512 bench data were created by run bench for 3 times, manually combine the
156+
results into one file.
157+
158+
## Limitations
152159

153160
- Only tested models 7B and 13B.
154161
- My OSs/devices can not use CUDA, so did not run benches for CUDA.
@@ -157,6 +164,13 @@ struct ggml_tensor {
157164
- It's incomplete to validate bench data.
158165
- and more ...
159166

167+
The mul_mat related codes keep changing, It's a bit hard for me to follow up and
168+
merge/rebase again and again. I think the codes and data can speak for themselves,
169+
so it's time to initiate a discussion or pull request.
170+
171+
I'm new to machine learning this year and have little knowledge of AI. There must
172+
be a lot of problems with this pull request, please do not hesitate to advise.
173+
160174
## How to Evaluate
161175

162176
**Build**
@@ -168,28 +182,37 @@ struct ggml_tensor {
168182
**Bench**
169183

170184
```
171-
./mulmat-tune -h
172185
usage: ./mulmat-tune [bench ...] | [analyze FILE] | test | [-h | --help]
173186
174187
bench [-m MODEL] [-t TYPE] [-f FILE] [-y]
175-
--model MODEL 7B | 13B | 30B | 65B
176-
default 7B
177-
--type TYPE Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F32 | F16
178-
default Q4_0
179-
--m_num M_NUM number of M, max M = 16 * M_NUM
180-
requires: M_NUM in range [8, 16]
181-
default 8
182-
--file FILE data file to write
183-
default stdout
184-
-y always answer "yes" to all prompts
188+
--model MODEL 7B | 13B | 30B | 65B
189+
default 7B
190+
--type TYPE Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F32 | F16
191+
default Q4_0
192+
--m_start M_START start value of M
193+
requires: even number
194+
default 16
195+
--m_step M_STEP delta between adjacent M
196+
requires: even number, in range[2, 32]
197+
default 16
198+
--m_num M_NUM number of M, the max M = M_STEP * M_NUM
199+
requires: in range [2, 128]
200+
default 8
201+
--backend BACKEND blas backend: CUDA | CL | CBLAS
202+
default: auto detect
203+
--file FILE data file to write
204+
default stdout
205+
-y always answer "yes" to all prompts
185206
186207
Tips on how to build with various BLAS vendors:
187208
188-
* CUDA: make clean; LLAMA_NO_ACCELERATE=1 LLAMA_CUBLAS=1 make
189-
* ClBlast: make clean; LLAMA_NO_ACCELERATE=1 LLAMA_CLBLAST=1 make
190-
* Accelerate: make clean; LLAMA_NO_ACCELERATE= make
191-
* OpenBLAS: make clean; LLAMA_NO_ACCELERATE=1 LLAMA_OPENBLAS=1 make
192-
* BLIS: make clean; LLAMA_NO_ACCELERATE=1 LLAMA_BLIS=1 make
209+
CUDA: make clean; LLAMA_CUBLAS=1 make
210+
ClBlast: make clean; LLAMA_CLBLAST=1 make
211+
Accelerate: make clean; LLAMA_NO_ACCELERATE= make
212+
OpenBLAS: make clean; LLAMA_NO_ACCELERATE=1 LLAMA_OPENBLAS=1 make
213+
BLIS: make clean; LLAMA_NO_ACCELERATE=1 LLAMA_BLIS=1 make
214+
215+
NOTE: to disable ACCELERATE, use LLAMA_NO_ACCELERATE=1
193216
```
194217

195218
Examples:
@@ -198,11 +221,11 @@ Examples:
198221
# run with default params (7B, Q4_0, ...)
199222
./mulmat-tune
200223
201-
# to run 13B and Q4_1 with alway-yes
224+
# run 13B and Q4_1 with alway-yes
202225
./mulmat-tune bench -model 13B --type Q4_1 -y
203226
204-
# customized m_step (32 * 16)
205-
./mulmat-tune bench -model 7B --m_step 32 -num_m 16
227+
# customized m_start, m_step, m_num
228+
./mulmat-tune bench -model 7B --m_start 8 --m_step 8 -m_num 8
206229
207230
# save to file
208231
./mulmat-tune bench -model 7B --file mulmat-tune.txt
@@ -230,7 +253,7 @@ The program will print debug log when or when not found the file.
230253
$ ./mulmat-tune bench --m_num 2
231254
[BENCH] model: 7B, type: Q4_0, backend: CBLAS, BLAS vendor: ACCELERATE.
232255
233-
1 7B 2 Q4_0 3 ACCELERATE 4 16 2 3
256+
1 7B 2 Q4_0 3 ACCELERATE 4 16 3
234257
0 0 0 0 1 0 -1 0 0
235258
-1 0 0 3 0 1 -1 0 0
236259
0 1 0 3 0 1 -1 0 0
@@ -257,7 +280,7 @@ See example files in dir [bench-out](bench-out) for details.
257280
head
258281
groups+
259282
260-
head := version model type type_name backend blas_vendor n_shapes m_step num_m n_profiles
283+
head := version model type type_name backend blas_vendor n_shapes m_num n_profiles
261284
task_conf_profile+
262285
shape
263286
bench_item+

0 commit comments

Comments
 (0)