ggml-org
diff --git a/‎examples/mulmat-tune/README.md
Lines changed: 113 additions & 90 deletions b/‎examples/mulmat-tune/README.md
Lines changed: 113 additions & 90 deletions
@@ -1,6 +1,6 @@
 # Fine Tune MUL_MAT with Bench
 
-## Introduction
+## Background
 
 GGML defines three task types(stages): INIT, COMPUTE, FINALIZE. All nodes has
 COMPUTE stage, some has INIT stage, the FINALIZE is never used.
@@ -9,15 +9,38 @@ General speaking, codes run in GPU(BLAS) MAY not suitable to run with multi OS
 threads -- sometimes very slow, but CPU could and scales well. So to speedup
 large prompt and avoid spinning, the master code force 1-thread when (M >=32).
 
+So, problems to solve:
+
+1. the `xxx_mul_mat_can_use_blas` rule is not accurate. We need bench.
+2. with multiple threads, when run those heavy BLAS stages, we have to avoid
+   busy spinning.
+
+I have been focused on the `threading` problem(s) since Apr this year. I dropped
+two simple pull requests because they are not solutions but noises. In the second
+pull request, @gerganov hinted me the `1-thread blas` problem, so I followed this
+direction since then.
+
+When I observed that the de-quantization takes about half of total time, I think
+it's a good chance to start from here. At first I implemented the new threading
+framework that supports `wait/notify`, subtle and fragile to dead lock. I'm happy
+that it works. I had tried to bench online by comparing CPU/GPU time, finally I
+replaced that with offline bench. To explicitly control details (how to parallel,
+when to wait, how to select the best executing plan), I had to define task config,
+task profiles. Finally I got the demo solution.
+
+Data files in in bench result dir were generated in MacBook Pro 2018 with 32 GB
+2400 MHz DDR4 memory, 2.6 GHz 6-Core Intel Core i7-8850H @2.60GHz, Intel UHD
+Graphics 630 1536 MB.
+
+## Solution and Result
+
 In current `master` branch, the `mul_mat` codes run in several implicit profiles.
 
 - pure cpu: INIT: very fast, COMPUTE: the computation time is proportional to M.
 - CUDA/CL: COMPUTE: de-quantization and mul_mat in GPU.
-- Accelerate/OpenBLAS: COMPUTE: de-quantization in CPU, mul_mat in GPU.
+- Accelerate/OpenBLAS/BLIS: COMPUTE: de-quantization in CPU, mul_mat in GPU.
 
-I observed the following "facts" on Accelerate/OpenBLAS. The following data are
-generated in MacBook Pro 2018 with: 32 GB 2400 MHz DDR4 memory, 2.6 GHz 6-Core
-Intel Core i7-8850H @2.60GHz, Intel UHD Graphics 630 1536 MB.
+I observed the following "facts" on Accelerate/OpenBLAS.
 
 - Whatever the M is, given N and K, the de-quantization time is constant (in theory).
 - The mul_mat time in GPU is heavy (tens to hundreds ms), goes up very slow when
@@ -29,48 +52,7 @@ Intel Core i7-8850H @2.60GHz, Intel UHD Graphics 630 1536 MB.
   large as NxK=4096x4096.
 
 In theory, if we split COMPUTE stage as INIT + COMPUTE, we MAY speedup prompt
-eval time a lot: up to 50% for large M range (e.g. 32 - 128) when the `use GPU`
-profile competes `pure CPU` profile. The following diagram demonstrates the
-`use GPU` profile (7B/Q4_0/Accelerate, INIT in CPU, COMPUTE in GPU). We can see
-the trends of how computing time changes with M.
-
-![7b_q4_0_accelerate use GPU time](./images/7b_q4_0_accelerate_png)
-
-Apart from a bit slower (10% or so) than Accelerate, OpenBLAS behaves similar to
-Accelerate. But BLIS is quite slow on my device. I will not show the images for
-them. You may want to have a look at [bench-out](bench-out/).
-
-ClBlast is far more slower than Accelerate on my device. I had manged to make it
-run on my device, and split the COMPUTE stage into INIT + COMPUTE for demonstrating
-purpose. Since the CPU de-quantization time is fairly smaller than the GPU time,
-the overall gain of running CPU INIT + GPU COMPUTE is small: no more than 20% for
-M in range \[32, 128\]. Anyway, Let me show you the picture below.
-
-![7b_q4_0_cl use GPU time](./images/7b_q4_0_cl_png)
-
-The next two pictures demonstrate how does `n_threads` affects the overall time
-among two config profiles. `#0` is CPU INIT + CPU COMPUTE, `#1` is CPU INIT + GPU
-COMPUTE. From these diagrams, given M, we could easily recognize the best config
-profile.
-
-4096x4096 and 4096x11008:
-
-![n_threads 1](./images/7b_q4_0_accelerate_nth-1.png)
-
-11008x4096 and 32000x4096:
-
-![n_threads 2](./images/7b_q4_0_accelerate_nth-2.png)
-
-I have been focused on the `threading` problem(s) since Apr this year. I dropped
-two simple pull requests because they are not solutions but noises. In the second
-pull request, @gerganov hinted me the `1-thread blas` problem, so I followed this
-direction since then.
-
-At first I implemented the new threading framework that supports `wait/notify`,
-subtle and fragile to dead lock. I'm happy that it works. I had tried to bench online
-by comparing CPU/GPU time, finally I replaced that with offline bench. To explicitly
-control details (how to parallel, when to wait, how to select the best executing plan),
-I had to define task config, task profiles. Finally I got the demo solution as follows.
+eval time a lot: up to 50% for large M range (e.g. 32 - 128).
 
 The eval time of long prompt decreases a lot. For example, `examples/chat.sh` with
 4 threads, the prompt eval time of 99 tokens decreases up to **-40%** in my device.
@@ -79,25 +61,7 @@ Tests for broad prompt size show speed up of `10% - 40%`.
 The key factor for speeding up is parallelism, followed by more accurate profile
 selection. The latter, although secondary, is necessary in the case of multithreading.
 
-Just like Accelerate/OpenBLAS, the de-quantization time in CUDA/CL MAY NOT
-compete multiple CPU threads on some devices. In case of this, we can add profiles
-for them to run de-quantization in CPU and run mul_mat in GPU.
-
-With explicit task config profiles and bench data, I'm expecting that we are able
-to run any task stage in any backend. For example: for q4_0, we could run INIT in
-CUDA and COMPUTE in Accelerate -- if the overall speed competes other profiles.
-
-Anyway, current solution is in demo stage and is incomplete due to various reasons,
-you will read them in the following sections.
-
-The mul_mat related codes keep changing, It's a bit hard for me to follow up and
-merge/rebase again and again. I think the overall changes can speak for themselves,
-so it's time to initiate a discussion or pull request.
-
-I'm new to machine learning this year and have little knowledge of AI. There must
-be a lot of problems with this pull request, please do not hesitate to advise.
-
-## Solutions
+In conclusion, let me list the key point how does it work.
 
 1. Update mul_mat BLAS codes: allow de-quantizing in CPU or GPU (if possible).
 2. Explicitly task config and profiles:
@@ -115,9 +79,15 @@ be a lot of problems with this pull request, please do not hesitate to advise.
    node, we select the fastest profile. When compute, the `dst->task_conf` along
    with `params` controls which part of the codes to run.
 
-About how to select profile, see section "**How To Estimate Execution Time**".
+Further more, if the de-quantization time in CUDA/CL COULD NOT compete multiple
+CPU threads on some devices, we can add profiles for them (just like Accelerate)
+to run de-quantization in CPU and run mul_mat in GPU.
 
-**Explicitly configure task profiles**
+With explicit task config profiles and bench data, we are able to run any task
+stage in any backend. For example: for q4_0, we could run INIT in CUDA and COMPUTE
+in Accelerate -- if that makes sense.
+
+## Task profile and task stage
 
 ```c
 // ggml.h
@@ -148,7 +118,44 @@ struct ggml_tensor {
 }
 ```
 
-## Limitations and TODOs
+## Misc Assets
+
+The `prompt.sh` is a tool for bench `main`, it can generates questions in various
+length. Run `./examples/mulmat-tune/prompt.sh -h` for help. I had run it with
+`./examples/mulmat-tune/prompt.sh -b -f./examples/mulmat-tune/prompt.sh -b -f`.
+
+The [bench-out dir](./bench-out) contains various bench result files generated
+on my device.
+BTW, `.*.cl.txt` were generated by [early version](https://github.com/ggerganov/llama.cpp/compare/master...mqy:blas-n_threads-fix-10#diff-e40acc281787b19c5975346837f154e7e75351733a9f9575317c64d2dbe38799).
+I solved the OOM problem, but unfortunately, I'm unable to catch up with latest updates.
+
+The [images dir](./images) contains images drawn from [bench results](./bench-out)
+of N/K combinations from 7B and 13B. I compared two profiles for Q4_0.
+
+- #0: pure cpu. INIT in CPU with 1 thread, COMPUTE in GPU parallel.
+- #2: use Blas. INIT in CPU parallel, COMPUTE in with Accelerate with 1 thread.
+
+The `#0_0` is read as "profile #0, stage 0 (INIT)", the `#0_1` is read as
+"profile #0 stage 1(COMPUTE)". `#2` is profile 2.
+
+The data data were generated like this:
+
+Bench:
+
+```
+./mulmat-tune-tool bench 7B --file mulmat-tune.7b.q4_0.txt
+```
+
+Analyze:
+
+```
+./mulmat-tune-tool analyze mulmat-tune.7b.q4_0.txt
+```
+
+The m512 bench data were created by run bench for 3 times, manually combine the
+results into one file.
+
+## Limitations
 
 - Only tested models 7B and 13B.
 - My OSs/devices can not use CUDA, so did not run benches for CUDA.
@@ -157,6 +164,13 @@ struct ggml_tensor {
 - It's incomplete to validate bench data.
 - and more ...
 
+The mul_mat related codes keep changing, It's a bit hard for me to follow up and
+merge/rebase again and again. I think the codes and data can speak for themselves,
+so it's time to initiate a discussion or pull request.
+
+I'm new to machine learning this year and have little knowledge of AI. There must
+be a lot of problems with this pull request, please do not hesitate to advise.
+
 ## How to Evaluate
 
 **Build**
@@ -168,28 +182,37 @@ struct ggml_tensor {
 **Bench**
 
 ```
-./mulmat-tune -h
 usage: ./mulmat-tune [bench ...] | [analyze FILE] | test | [-h | --help]
 
 bench [-m MODEL] [-t TYPE] [-f FILE] [-y]
---model  MODEL  7B | 13B | 30B | 65B
-                default 7B
---type   TYPE   Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F32 | F16
-                default Q4_0
---m_num  M_NUM  number of M, max M = 16 * M_NUM
-                requires: M_NUM in range [8, 16]
-                default 8
---file   FILE   data file to write
-                default stdout
--y              always answer "yes" to all prompts
+--model   MODEL    7B | 13B | 30B | 65B
+                   default 7B
+--type    TYPE     Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F32 | F16
+                   default Q4_0
+--m_start M_START  start value of M
+                   requires: even number
+                   default 16
+--m_step  M_STEP   delta between adjacent M
+                   requires: even number, in range[2, 32]
+                   default 16
+--m_num   M_NUM    number of M, the max M = M_STEP * M_NUM
+                   requires: in range [2, 128]
+                   default 8
+--backend BACKEND  blas backend: CUDA | CL | CBLAS
+                   default: auto detect
+--file    FILE     data file to write
+                   default stdout
+-y                 always answer "yes" to all prompts
 
 Tips on how to build with various BLAS vendors:
 
-* CUDA:       make clean; LLAMA_NO_ACCELERATE=1 LLAMA_CUBLAS=1 make
-* ClBlast:    make clean; LLAMA_NO_ACCELERATE=1 LLAMA_CLBLAST=1 make
-* Accelerate: make clean; LLAMA_NO_ACCELERATE=  make
-* OpenBLAS:   make clean; LLAMA_NO_ACCELERATE=1 LLAMA_OPENBLAS=1 make
-* BLIS:       make clean; LLAMA_NO_ACCELERATE=1 LLAMA_BLIS=1 make
+CUDA:       make clean; LLAMA_CUBLAS=1 make
+ClBlast:    make clean; LLAMA_CLBLAST=1 make
+Accelerate: make clean; LLAMA_NO_ACCELERATE=  make
+OpenBLAS:   make clean; LLAMA_NO_ACCELERATE=1 LLAMA_OPENBLAS=1 make
+BLIS:       make clean; LLAMA_NO_ACCELERATE=1 LLAMA_BLIS=1 make
+
+NOTE: to disable ACCELERATE, use LLAMA_NO_ACCELERATE=1
 ```
 
 Examples:
@@ -198,11 +221,11 @@ Examples:
 # run with default params (7B, Q4_0, ...)
 ./mulmat-tune
 
-# to run 13B and Q4_1 with alway-yes
+# run 13B and Q4_1 with alway-yes
 ./mulmat-tune bench -model 13B --type Q4_1 -y
 
-# customized m_step (32 * 16)
-./mulmat-tune bench -model 7B --m_step 32 -num_m 16
+# customized m_start, m_step, m_num
+./mulmat-tune bench -model 7B --m_start 8 --m_step 8 -m_num 8
 
 # save to file
 ./mulmat-tune bench -model 7B --file mulmat-tune.txt
@@ -230,7 +253,7 @@ The program will print debug log when or when not found the file.
 $ ./mulmat-tune bench --m_num 2
 [BENCH] model: 7B, type: Q4_0, backend: CBLAS, BLAS vendor: ACCELERATE.
 
-1 7B 2 Q4_0 3 ACCELERATE 4 16 2 3
+1 7B 2 Q4_0 3 ACCELERATE 4 16 3
  0  0  0    0  1  0   -1  0  0
 -1  0  0    3  0  1   -1  0  0
  0  1  0    3  0  1   -1  0  0
@@ -257,7 +280,7 @@ See example files in dir [bench-out](bench-out) for details.
 head
 groups+
 
-head := version model type type_name backend blas_vendor n_shapes m_step num_m n_profiles
+head := version model type type_name backend blas_vendor n_shapes m_num n_profiles
 task_conf_profile+
 shape
 bench_item+