ggml-org
diff --git a/‎examples/mulmat-tune/.gitignore
Lines changed: 2 additions & 0 deletions b/‎examples/mulmat-tune/.gitignore
Lines changed: 2 additions & 0 deletions
diff --git a/‎examples/mulmat-tune/README.md
Lines changed: 93 additions & 82 deletions b/‎examples/mulmat-tune/README.md
Lines changed: 93 additions & 82 deletions
@@ -0,0 +1,2 @@
+analyze/
+bench-out/
@@ -3,7 +3,7 @@
 To readers: this is a length introduction to this big PR. Recommend you review in
 the following steps:
 
-1. read this file first and browse the files in dirs bench-out and analyze.
+1. read this file first.
 2. pull and checkout codes, build and evaluate `mulmat-tune`.
 3. optionally run `tests/test-mulmat-tune`.
 4. place file `mulmat-tune.txt` (generated from `mulmat-tune`) into the parent
@@ -15,14 +15,14 @@ the following steps:
 GGML defines three task types(stages): INIT, COMPUTE, FINALIZE. All nodes has
 COMPUTE stage, some has INIT stage, the FINALIZE is never used.
 
-GGML supports pure CPU, CL, CUDA and optional CBLAS (Accelerate/OpenBLAS/BLIS).
+GGML supports pure CPU, CUDA, CL and BLAS (Accelerate/OpenBLAS/BLIS).
 In recent days, the compute framework keep evolving, introduced a lot of new
 things: backend, vendor. In this CL, I follow the `backend` definition, and
-defined new backend `CBLAS` for the Accelerate etc.
+defined new backend `BLAS` for the Accelerate etc.
 
 General speaking, as of my tests, it's very slow to run `cblas_sgemm` with multi
 OS threads. From master codes I saw that the `ggml_graph_compute` sets `node->n_tasks = 1`
-to when compute `mul_mat` in `GPU` or `CBLAS`.
+to when compute `mul_mat` in `GPU` or `BLAS`.
 
 To speedup large prompt and avoid spinning, the master set `n_threads` to 1 when
 token size >= 32. `ggml_compute_forward_mul_mat_use_blas` checks:
@@ -36,12 +36,12 @@ two simple pull requests because they are not solutions but noises. In the secon
 pull request, @gerganov hinted me the `1-thread blas` problem, so I followed this
 direction since then. My benchmarks on 7B/13B and all Qx_x shows that: for all
 N/K (e.g, 4096x4096), Accelerate always run faster than pure GPU at `M = 32`,
-the difference may vary up to `30%` as of 1 thread. I will show you image for this.
+the difference may vary up to `30%` as of 1 thread. <del>I will show you image for this.</del>
 
 When I observed that the de-quantization takes about half of total time, I think
 it's a good chance to start from here. At first I implemented the new threading
 framework that supports `wait/notify`, subtle and fragile to dead lock. I'm happy
-that it works. I had tried to bench online by comparing CPU/GPU time, finally I
+that it works. I had tried to bench online by comparing the time between with or without BLAS, finally I
 replaced that with offline bench. To explicitly control details (how to parallel,
 when to wait, how to select the best executing plan), I had to define task config,
 task profiles. Finally after over a month busy coding I got the demo solution.
@@ -77,18 +77,18 @@ In current `master` branch, the `mul_mat` codes run in several implicit profiles
 
 - pure cpu: INIT: very fast, COMPUTE: the computation time is proportional to M.
 - CUDA/CL: COMPUTE: de-quantization and mul_mat in GPU.
-- Accelerate/OpenBLAS/BLIS: COMPUTE: de-quantization in CPU, mul_mat in GPU.
+- CPU use BLAS: Accelerate/OpenBLAS/BLIS: COMPUTE: mul_mat with BLAS.
 
 I observed the following "facts" on Accelerate/OpenBLAS.
 
 - Whatever the M is, given N and K, the de-quantization time is constant (in theory).
-- The mul_mat time in GPU is heavy (tens to hundreds ms), goes up very slow when
+- The mul_mat time with BLAS is heavy (tens to hundreds ms), goes up very slow when
   M doubles.
 - In the large M range, the de-quantization time accounts for a large proportion
   of the total calculation time. For example, for 7B, Q4_0, NxK=4096x4096, the
   proportion of de-quantization time exceeds or near `50%` for large M range (up
-  to 128). Other NxK combinations have similar situation. You may look at dirs
-  [bench-out](./bench-out/) and [analyze](./analyze/) for more examples.
+  to 128). Other NxK combinations have similar situation. <del>You may look at dirs
+  [bench-out](./bench-out/) and [analyze](./analyze/) for more examples.</del>
 
 In theory, if we split COMPUTE stage as INIT + COMPUTE, we MAY speedup prompt
 eval time a lot: up to 50% for large M range.
@@ -103,40 +103,31 @@ selecting. The latter, although secondary, is necessary for time estimation.
 In conclusion, let me list the key point how does it work.
 
 1. Explicitly task config and profiles:
-   * define conf profiles (for example, init in CPU, compute in GPU);
+   * define conf profiles for controlling which part of code to run.
+     for example, run COMPUTE stage with or without BLAS.
    * define for any stage: compute in parallel or not, idle wait or not.
    * non-existing compute stages are not called.
 2. New threading framework: combine `spin` + `wait/notify`. Without wait, workers
    busy spinning may causes overheat and slow down the overall speed. The mul_mat
    compute time is long enough (often tens of ms), so the wait/notify overhead
    (at most tens of us) is OK.
-3. Update mul_mat BLAS codes to support the new task profile.
+3. Update mul_mat codes to support the new task profile.
 4. A tune tool for benching. With bench data, given N/K and n_threads, we could
    estimate total computing time for any M (even if out of bench range), thus
    could select the fastest profile.
 5. On llama start, it loads the bench data from file (if exists). Before computing
-   node, we select the fastest profile. When compute, the `dst->task_conf` along
-   with `params` controls which part of the codes to run.
+   node, we select the fastest profile. When compute, both `dst->task_conf` and 
+   `params` control which part of the codes to run.
 
-Further more, if the de-quantization time in CUDA/CL COULD NOT compete multiple
-CPU threads on some devices, we can add profiles for them (just like Accelerate)
-to run de-quantization in CPU and run mul_mat in GPU.
-
-With explicit task config profiles and bench data, we are able to run any task
-stage in any backend. For example: for q4_0, we could run INIT in CUDA and COMPUTE
-in Accelerate -- if that makes sense.
-
-Too much changes to explain. Not enough time to write them in details at present
-when codes are still unstable, so just list changes in ggml.h here.
 
 ```c
 // ggml.h
 
 enum ggml_backend {
-    GGML_BACKEND_CPU = 0,
+    GGML_BACKEND = 0,
     GGML_BACKEND_CUDA = 1,
     GGML_BACKEND_CL = 2,
-    GGML_BACKEND_CBLAS = 3, // has API `cblas_sgemm`
+    GGML_BACKEND_BLAS = 3, // has API `cblas_sgemm`
 };
 
 struct ggml_tensor {
@@ -152,38 +143,8 @@ void ggml_internal_compute_forward_mul_mat(
    const struct ggml_tensor * src0,
    const struct ggml_tensor * src1,
          struct ggml_tensor * dst);
-```
-
-## Misc Assets
-
-The `prompt.sh` is a tool for bench `main`, it can generates questions in various
-length. Run `./examples/mulmat-tune/prompt.sh -h` for help. I had run it with
-`./examples/mulmat-tune/prompt.sh -b -f./examples/mulmat-tune/prompt.sh -b -f`.
 
-The [bench-out dir](./bench-out) contains various bench result files generated
-on my device.
-
-The [analyze dir](./analyze/) contains various analysis files generated with
-`./mulmat-tune analyze <bench-file>`. I strongly recommend you have a look at them.
-
-Let me introduce them with [the image](./analyze/4096x4096_q4_0.png) which contains bench analysis for 7B/Q4_0/4096x4096.
-
-**Firstly**, I defined three task profiles:
-
-- #0: pure cpu. INIT in CPU with 1 thread, COMPUTE in GPU parallel.
-- #1: the `q_f32` CBLAS implementation in master (when defined either
-      `GGML_USE_ACCELERATE` or `GGML_USE_OPENBLAS`)
-- #2: splt `#1` into `INIT` and `COMPUTE`. Where INIT in CPU parallel, COMPUTE
-      in with Accelerate with 1 thread.
 
-The `#0_0` is read as "profile #0, stage 0 (INIT)", the `#0_1` is read as
-"profile #0 stage 1(COMPUTE)". `#2` is profile 2. `nth=x` is read as `run with `x`
-thread(s)`. With 1 thread, the overall time of  `profile #1` is almost equal to
-the that of `profile #2`, so I did not draw `profile #1`.
-
-**Secondly**, I defined several `shape`s for attention, feed-forward and RoPE.
-
-```c
 // examples/mulmat-tune/mulmat-tune.h
 
 struct ggml_task_stage {
@@ -232,22 +193,71 @@ Analyze:
 ./mulmat-tune analyze 7b.q4_0.txt
 ```
 
-Let's come back to the [the image](./analyze/4096x4096_q4_0.png). There is a table
-and five pictures in the image:
-
-- The table contains analysis data block that was copied from output of `./mulmat-tune analyze`.
-- The top right picture shows: with 1 thread, cpu INIT, and BLAS compute time and
-  total time. From this, given n_threads, we can estimate total time.
-- The top second picture is used to compare the overall time between profile #0
-  (`pure CPU`) with profile #2. The pure CPU INIT is very fast (so can be totally omitted),
-  but the COMPUTE is heavy and scales almost linear with M. The BLAS compute grows
-  slowly when M doubles. From this picture, we can see the location of intersection
-  point between both lines: the M is less than 32 -- this is true for all shapes
-  (with N/K >= 4096) on my device.
-- The last three pictures at bottom are estimated time for nth=2/4/8. We could
-  see that the intersection point (M) grows with `n_threads`. Suppose given M/N/K
-  src0_type and src1_type, we could find corresponding shape. With this shape,
-  we could estimate overall time for every profile and choose the fastest profile.
+Example bench analyze output looks as follows, which contains 6 shapes (blocks):
+
+```
+N=4096,K=4096
+
+#M,1,2,4,8,16,32,64,128,256,512
+#0_0_nth=1,   0.002,   0.003,   0.004,   0.009,   0.018,   0.036,   0.072,   0.151,   0.344,   0.719
+#0_1_nth=1,   1.268,   2.172,   3.371,   6.502,  13.068,  25.508,  52.853, 107.543, 213.692, 427.260
+#0___nth=1,   1.270,   2.175,   3.375,   6.511,  13.086,  25.544,  52.925, 107.694, 214.036, 427.979
+#1_1_nth=1,  17.509,  18.774,  15.617,  16.059,  17.877,  16.456,  18.331,  21.935,  34.317,  63.208
+#1___nth=1,  17.509,  18.774,  15.617,  16.059,  17.877,  16.456,  18.331,  21.935,  34.317,  63.208
+#2_0_nth=1,  12.349,  13.309,  11.130,  10.999,  11.231,  10.987,  10.742,  11.003,  11.106,  10.851
+#2_1_nth=1,   2.646,   5.259,   4.252,   4.542,   5.857,   6.642,   7.239,  11.009,  23.582,  52.081
+#2___nth=1,  14.995,  18.568,  15.382,  15.541,  17.088,  17.629,  17.981,  22.012,  34.688,  62.932
+
+#0_1_nth=2,   0.634,   1.086,   1.685,   3.251,   6.534,  12.754,  26.426,  53.771, 106.846, 213.630
+#0___nth=2,   0.636,   1.089,   1.689,   3.260,   6.552,  12.790,  26.498,  53.922, 107.190, 214.349
+#2_0_nth=2,   6.174,   6.654,   5.565,   5.499,   5.615,   5.493,   5.371,   5.501,   5.553,   5.425
+#2___nth=2,   8.820,  11.913,   9.817,  10.041,  11.472,  12.135,  12.610,  16.510,  29.135,  57.506
+
+#0_1_nth=4,   0.317,   0.543,   0.842,   1.625,   3.267,   6.377,  13.213,  26.885,  53.423, 106.815
+#0___nth=4,   0.319,   0.546,   0.846,   1.634,   3.285,   6.413,  13.285,  27.036,  53.767, 107.534
+#2_0_nth=4,   3.087,   3.327,   2.782,   2.749,   2.807,   2.746,   2.685,   2.750,   2.776,   2.712
+#2___nth=4,   5.733,   8.586,   7.034,   7.291,   8.664,   9.388,   9.924,  13.759,  26.358,  54.793
+
+#0_1_nth=8,   0.158,   0.271,   0.421,   0.812,   1.633,   3.188,   6.606,  13.442,  26.711,  53.407
+#0___nth=8,   0.160,   0.274,   0.425,   0.821,   1.651,   3.224,   6.678,  13.593,  27.055,  54.126
+#2_0_nth=8,   1.543,   1.663,   1.391,   1.374,   1.403,   1.373,   1.342,   1.375,   1.388,   1.356
+#2___nth=8,   4.189,   6.922,   5.643,   5.916,   7.260,   8.015,   8.581,  12.384,  24.970,  53.437
+
+N=4096,K=11008
+
+...
+
+N=11008,K=4096
+
+...
+
+N=32000,K=4096
+
+...
+
+N=128,K=M
+
+...
+
+N=M,K=128
+
+...
+```
+
+Terms:
+
+- #0: pure cpu. INIT with 1 thread, COMPUTE with N threads.
+- #1: the `q_f32` BLAS implementation in master (when defined either
+      `GGML_USE_ACCELERATE` or `GGML_USE_OPENBLAS`)
+- #2: split `#1` into `INIT` and `COMPUTE`. Where INIT runs de-quantization
+      with N threads, COMPUTE in with Accelerate with 1 thread.
+
+The `#0_0` is read as "profile #0, stage 0 (INIT)", the `#0_1` is read as
+"profile #0 stage 1 (COMPUTE)". "#0__" is read as total time.
+
+`nth=x` is read as `run with x thread(s)`. With 1 thread, when we know the time
+of every stage and known whether this stage can be parallelled or not, we can
+estimate the time for N threads.
 
 ## Limitations
 
@@ -284,7 +294,7 @@ bench [-m MODEL] [-t TYPE] [-f FILE] [-y]
 --m_num   M_NUM    number of M, the max M = 2^(M_NUM-1)
                    requires: in range [8, 12]
                    default 10
---backend BACKEND  blas backend: CUDA | CL | CBLAS
+--backend BACKEND  backend: CUDA | CL | BLAS
                    default: auto detect
 --n_pass           number of passes to run
                    default 3
@@ -293,7 +303,7 @@ bench [-m MODEL] [-t TYPE] [-f FILE] [-y]
                    default stdout
 -y                 always answer "yes" to all prompts
 
-Tips on how to build with various BLAS vendors:
+Tips on how to build with various backend vendors:
 
 CUDA:       make clean; LLAMA_CUBLAS=1 make
 ClBlast:    make clean; LLAMA_CLBLAST=1 make
@@ -323,7 +333,7 @@ NOTE: to disable ACCELERATE, use LLAMA_NO_ACCELERATE=1
 ./mulmat-tune bench --n_pass 1
 
 # customized backend:
-./mulmat-tune bench --backend CBLAS
+./mulmat-tune bench --backend BLAS
 
 # save to file
 ./mulmat-tune bench --file mulmat-tune.txt
@@ -446,24 +456,25 @@ $ ./mulmat-tune bench
  512       33     2832 0 0      313 0
  ```
 
+<del>
 See example files in dir [bench-out](bench-out) for details.
+</del>
 
 **Informal Explanation**
 
 ```
 head
 groups+
 
-head := version model type type_name backend blas_vendor n_shapes
+head := version model type type_name backend backend_vendor n_shapes
 shape+
 
 # head
 version: 1
 model: "7B" | "13B" | "30B" | "65B"
 type: 2 | 3 | 8 | 9 | 7 | 0 | 1
 type_name: "Q4_0" | "Q4_1" | "Q5_0" | "Q5_1" | "Q8_0" | "F32" | "F16"
-backend: 1 (CUDA) | 2 (CL)| 3 (CBLAS)
-blas_vendor: "CUDA" | "CLBLAST" | "ACCELERATE" | "OPENBLAS" | "BLIS"
+backend_vendor: "CUDA" | "CLBLAST" | "ACCELERATE" | "OPENBLAS" | "BLIS"
 n_shapes: number of shapes
 
 shape := N K  m_num n_profiles
@@ -472,7 +483,7 @@ bench_item+
 
 task_conf_profile: stage_conf(init) stage_conf(compute) stage_conf(finalize)
 stage_conf: backend parallel wait
-backend: -1 (UNKNOWN) | 0 (CPU) | 1 (CUDA) | 2 (CL) | 3 (CBLAS)
+backend: -1 (UNKNOWN) | 0 (CPU) | 1 (CUDA) | 2 (CL) | 3 (BLAS)
 parallel: 0 | 1
 wait: 0 | 1
 
@@ -488,10 +499,10 @@ Time unit is `us`. A column is all zeros when that stage does not exist.
 
 For Accelerate/OpenBLAS, mul_mat_q_f32, there are three profiles:
 
-- `pure CPU`: INIT in CPU, and COMPUTE in CPU (N threads).
-- `use BLAS 1`: COMPUTE (1 thread): (de-quantize) in CPU and mul_mat in GPU.
+- `pure CPU`: INIT in CPU, and COMPUTE without BLAS (N threads).
+- `use BLAS 1`: COMPUTE (1 thread): (de-quantize) in CPU and mul_mat in with BLAS.
 - `use BLAS 2`: INIT (N threads): (de-quantize) in CPU , COMPUTE (1 thread):
-   mul_mat in GPU.
+   mul_mat in BLAS.
 
 For any thread number `nth`, when the INIT stage can only run with 1 thread, but
 the COMPUTE stage can ran with N threads, then: