whisper : fix bench regression #1275

ggerganov · 2023-09-12T08:26:46Z

Can you try this branch and show me the results of the following command compared to 6780c98:

make clean
make -j
./extra/bench-all.sh

Here is mine on M2 Ultra:

6780c98

$ ./extra/bench-all.sh
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 52.23 GB/s (1 thread)
sum:    -536871564.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     9.1 GFLOPS (128 runs) | Q4_1     9.0 GFLOPS (128 runs)
  64 x   64: Q5_0     9.4 GFLOPS (128 runs) | Q5_1    11.4 GFLOPS (128 runs) | Q8_0    16.8 GFLOPS (128 runs)
  64 x   64: F16     16.0 GFLOPS (128 runs) | F32     16.4 GFLOPS (128 runs)
 128 x  128: Q4_0   105.3 GFLOPS (128 runs) | Q4_1    96.3 GFLOPS (128 runs)
 128 x  128: Q5_0   103.9 GFLOPS (128 runs) | Q5_1   103.1 GFLOPS (128 runs) | Q8_0   107.8 GFLOPS (128 runs)
 128 x  128: F16    109.3 GFLOPS (128 runs) | F32    113.8 GFLOPS (128 runs)
 256 x  256: Q4_0   451.1 GFLOPS (128 runs) | Q4_1   396.9 GFLOPS (128 runs)
 256 x  256: Q5_0   442.1 GFLOPS (128 runs) | Q5_1   450.8 GFLOPS (128 runs) | Q8_0   469.8 GFLOPS (128 runs)
 256 x  256: F16    490.6 GFLOPS (128 runs) | F32    508.1 GFLOPS (128 runs)
 512 x  512: Q4_0  1021.9 GFLOPS (128 runs) | Q4_1   936.9 GFLOPS (128 runs)
 512 x  512: Q5_0  1138.6 GFLOPS (128 runs) | Q5_1  1135.5 GFLOPS (128 runs) | Q8_0  1019.1 GFLOPS (128 runs)
 512 x  512: F16   1198.5 GFLOPS (128 runs) | F32   1530.6 GFLOPS (128 runs)
1024 x 1024: Q4_0  1817.3 GFLOPS (128 runs) | Q4_1  1628.3 GFLOPS (128 runs)
1024 x 1024: Q5_0  1816.7 GFLOPS (128 runs) | Q5_1  1899.6 GFLOPS (128 runs) | Q8_0  1772.0 GFLOPS (128 runs)
1024 x 1024: F16   2044.1 GFLOPS (128 runs) | F32   2410.2 GFLOPS (128 runs)
2048 x 2048: Q4_0  3302.0 GFLOPS (128 runs) | Q4_1  2883.1 GFLOPS (128 runs)
2048 x 2048: Q5_0  3076.9 GFLOPS (128 runs) | Q5_1  3178.2 GFLOPS (128 runs) | Q8_0  3397.5 GFLOPS (128 runs)
2048 x 2048: F16   3308.1 GFLOPS (128 runs) | F32   3420.7 GFLOPS (128 runs)
4096 x 4096: Q4_0  3391.7 GFLOPS ( 25 runs) | Q4_1  3163.5 GFLOPS ( 24 runs)
4096 x 4096: Q5_0  3339.2 GFLOPS ( 25 runs) | Q5_1  3355.0 GFLOPS ( 25 runs) | Q8_0  3282.0 GFLOPS ( 24 runs)
4096 x 4096: F16   3453.9 GFLOPS ( 26 runs) | F32   3545.1 GFLOPS ( 26 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON BLAS | tiny | 4 | 45 | 161 | 6780c98 |
| <todo> | <todo> |  NEON BLAS | base | 4 | 61 | 266 | 6780c98 |
| <todo> | <todo> |  NEON BLAS | small | 4 | 146 | 849 | 6780c98 |
| <todo> | <todo> |  NEON BLAS | medium | 4 | 423 | 2193 | 6780c98 |
| <todo> | <todo> |  NEON BLAS | large | 4 | 842 | 3617 | 6780c98 |

this PR

$ ▶ ./extra/bench-all.sh
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 52.33 GB/s (1 thread)
sum:    -536871564.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     8.0 GFLOPS (128 runs) | Q4_1     7.9 GFLOPS (128 runs)
  64 x   64: Q5_0     8.0 GFLOPS (128 runs) | Q5_1    13.0 GFLOPS (128 runs) | Q8_0    14.3 GFLOPS (128 runs)
  64 x   64: F16     14.5 GFLOPS (128 runs) | F32     14.9 GFLOPS (128 runs)
 128 x  128: Q4_0    86.0 GFLOPS (128 runs) | Q4_1    89.3 GFLOPS (128 runs)
 128 x  128: Q5_0    93.4 GFLOPS (128 runs) | Q5_1    96.7 GFLOPS (128 runs) | Q8_0    78.2 GFLOPS (128 runs)
 128 x  128: F16    102.4 GFLOPS (128 runs) | F32    102.2 GFLOPS (128 runs)
 256 x  256: Q4_0   433.2 GFLOPS (128 runs) | Q4_1   386.7 GFLOPS (128 runs)
 256 x  256: Q5_0   421.1 GFLOPS (128 runs) | Q5_1   430.3 GFLOPS (128 runs) | Q8_0   447.3 GFLOPS (128 runs)
 256 x  256: F16    471.6 GFLOPS (128 runs) | F32    485.0 GFLOPS (128 runs)
 512 x  512: Q4_0   681.7 GFLOPS (128 runs) | Q4_1   769.8 GFLOPS (128 runs)
 512 x  512: Q5_0   797.1 GFLOPS (128 runs) | Q5_1   808.3 GFLOPS (128 runs) | Q8_0   680.8 GFLOPS (128 runs)
 512 x  512: F16    780.8 GFLOPS (128 runs) | F32   1066.9 GFLOPS (128 runs)
1024 x 1024: Q4_0  1130.5 GFLOPS (128 runs) | Q4_1   910.1 GFLOPS (128 runs)
1024 x 1024: Q5_0  1176.1 GFLOPS (128 runs) | Q5_1  1175.4 GFLOPS (128 runs) | Q8_0   994.8 GFLOPS (128 runs)
1024 x 1024: F16   1579.1 GFLOPS (128 runs) | F32   1799.8 GFLOPS (128 runs)
2048 x 2048: Q4_0  3193.6 GFLOPS (128 runs) | Q4_1  2765.2 GFLOPS (128 runs)
2048 x 2048: Q5_0  3020.0 GFLOPS (128 runs) | Q5_1  3094.0 GFLOPS (128 runs) | Q8_0  3188.2 GFLOPS (128 runs)
2048 x 2048: F16   3399.7 GFLOPS (128 runs) | F32   3406.8 GFLOPS (128 runs)
4096 x 4096: Q4_0  3345.7 GFLOPS ( 25 runs) | Q4_1  3088.3 GFLOPS ( 23 runs)
4096 x 4096: Q5_0  3256.4 GFLOPS ( 24 runs) | Q5_1  3309.9 GFLOPS ( 25 runs) | Q8_0  3348.8 GFLOPS ( 25 runs)
4096 x 4096: F16   3371.8 GFLOPS ( 25 runs) | F32   3473.8 GFLOPS ( 26 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON BLAS | tiny | 4 | 39 | 104 | 39c4fc5 |
| <todo> | <todo> |  NEON BLAS | base | 4 | 57 | 200 | 39c4fc5 |
| <todo> | <todo> |  NEON BLAS | small | 4 | 147 | 594 | 39c4fc5 |
| <todo> | <todo> |  NEON BLAS | medium | 4 | 425 | 1649 | 39c4fc5 |
| <todo> | <todo> |  NEON BLAS | large | 4 | 846 | 2827 | 39c4fc5 |

after 09a6325de56856490ae9046bf0030ceedc04028a

  64 x   64: Q4_0     9.1 GFLOPS (128 runs) | Q4_1     9.1 GFLOPS (128 runs)
  64 x   64: Q5_0     9.2 GFLOPS (128 runs) | Q5_1    11.8 GFLOPS (128 runs) | Q8_0    16.5 GFLOPS (128 runs)
  64 x   64: F16     16.9 GFLOPS (128 runs) | F32     16.3 GFLOPS (128 runs)
 128 x  128: Q4_0   102.0 GFLOPS (128 runs) | Q4_1    96.9 GFLOPS (128 runs)
 128 x  128: Q5_0   101.1 GFLOPS (128 runs) | Q5_1   104.8 GFLOPS (128 runs) | Q8_0   107.3 GFLOPS (128 runs)
 128 x  128: F16    110.0 GFLOPS (128 runs) | F32    112.2 GFLOPS (128 runs)
 256 x  256: Q4_0   456.4 GFLOPS (128 runs) | Q4_1   399.4 GFLOPS (128 runs)
 256 x  256: Q5_0   446.2 GFLOPS (128 runs) | Q5_1   446.3 GFLOPS (128 runs) | Q8_0   473.0 GFLOPS (128 runs)
 256 x  256: F16    490.1 GFLOPS (128 runs) | F32    508.5 GFLOPS (128 runs)
 512 x  512: Q4_0   963.2 GFLOPS (128 runs) | Q4_1   893.9 GFLOPS (128 runs)
 512 x  512: Q5_0  1092.0 GFLOPS (128 runs) | Q5_1  1067.5 GFLOPS (128 runs) | Q8_0   960.0 GFLOPS (128 runs)
 512 x  512: F16   1125.6 GFLOPS (128 runs) | F32   1493.3 GFLOPS (128 runs)
1024 x 1024: Q4_0  1816.4 GFLOPS (128 runs) | Q4_1  1550.5 GFLOPS (128 runs)
1024 x 1024: Q5_0  1799.5 GFLOPS (128 runs) | Q5_1  1895.6 GFLOPS (128 runs) | Q8_0  1722.0 GFLOPS (128 runs)
1024 x 1024: F16   2000.7 GFLOPS (128 runs) | F32   2349.3 GFLOPS (128 runs)
2048 x 2048: Q4_0  3290.1 GFLOPS (128 runs) | Q4_1  2850.6 GFLOPS (128 runs)
2048 x 2048: Q5_0  3127.4 GFLOPS (128 runs) | Q5_1  3181.2 GFLOPS (128 runs) | Q8_0  3300.4 GFLOPS (128 runs)
2048 x 2048: F16   3330.3 GFLOPS (128 runs) | F32   3458.5 GFLOPS (128 runs)
4096 x 4096: Q4_0  3065.2 GFLOPS ( 23 runs) | Q4_1  3011.1 GFLOPS ( 22 runs)
4096 x 4096: Q5_0  3243.4 GFLOPS ( 24 runs) | Q5_1  3183.4 GFLOPS ( 24 runs) | Q8_0  3260.7 GFLOPS ( 24 runs)
4096 x 4096: F16   3297.0 GFLOPS ( 24 runs) | F32   3418.8 GFLOPS ( 25 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  NEON BLAS | tiny | 4 | 38 | 106 | 09a6325 |
| <todo> | <todo> |  NEON BLAS | base | 4 | 57 | 203 | 09a6325 |
| <todo> | <todo> |  NEON BLAS | small | 4 | 147 | 593 | 09a6325 |
| <todo> | <todo> |  NEON BLAS | medium | 4 | 425 | 1733 | 09a6325 |
| <todo> | <todo> |  NEON BLAS | large | 4 | 843 | 2895 | 09a6325 |

bobqianic · 2023-09-12T08:28:13Z

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  8
  On-line CPU(s) list:   0-7
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 7 5700X 8-Core Processor
    CPU family:          25
    Model:               33
    Thread(s) per core:  2
    Core(s) per socket:  4
    Socket(s):           1
    Stepping:            2
    BogoMIPS:            6800.08
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable
                          nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignss
                         e 3dnowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerp
                         tr rdpru arat umip vaes vpclmulqdq rdpid fsrm
Virtualization features: 
  Hypervisor vendor:     Microsoft
  Virtualization type:   full
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    2 MiB (4 instances)
  L3:                    32 MiB (1 instance)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-7
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

39c4fc5

root@ubuntu:/home/test/whisper.cpp-39c4fc59dd72cd941bbbdeb470162fb42edaea15# ./extra/bench-all.sh
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 18.99 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     5.9 GFLOPS (128 runs) | Q4_1     6.8 GFLOPS (128 runs)
  64 x   64: Q5_0     6.9 GFLOPS (128 runs) | Q5_1     6.6 GFLOPS (128 runs) | Q8_0     7.0 GFLOPS (128 runs)
  64 x   64: F16      7.1 GFLOPS (128 runs) | F32      7.1 GFLOPS (128 runs)
 128 x  128: Q4_0    40.3 GFLOPS (128 runs) | Q4_1    38.7 GFLOPS (128 runs)
 128 x  128: Q5_0    38.6 GFLOPS (128 runs) | Q5_1    36.5 GFLOPS (128 runs) | Q8_0    44.1 GFLOPS (128 runs)
 128 x  128: F16     43.4 GFLOPS (128 runs) | F32     43.5 GFLOPS (128 runs)
 256 x  256: Q4_0   112.6 GFLOPS (128 runs) | Q4_1   108.3 GFLOPS (128 runs)
 256 x  256: Q5_0    96.2 GFLOPS (128 runs) | Q5_1    87.3 GFLOPS (128 runs) | Q8_0   133.5 GFLOPS (128 runs)
 256 x  256: F16    132.9 GFLOPS (128 runs) | F32    131.1 GFLOPS (128 runs)
 512 x  512: Q4_0   161.1 GFLOPS (128 runs) | Q4_1   158.0 GFLOPS (128 runs)
 512 x  512: Q5_0   137.7 GFLOPS (128 runs) | Q5_1   129.2 GFLOPS (128 runs) | Q8_0   193.3 GFLOPS (128 runs)
 512 x  512: F16    187.0 GFLOPS (128 runs) | F32    166.4 GFLOPS (128 runs)
1024 x 1024: Q4_0   182.5 GFLOPS ( 85 runs) | Q4_1   185.8 GFLOPS ( 87 runs)
1024 x 1024: Q5_0   150.3 GFLOPS ( 71 runs) | Q5_1   143.2 GFLOPS ( 67 runs) | Q8_0   223.5 GFLOPS (105 runs)
1024 x 1024: F16    214.7 GFLOPS (100 runs) | F32    169.5 GFLOPS ( 79 runs)
2048 x 2048: Q4_0   182.2 GFLOPS ( 11 runs) | Q4_1   181.8 GFLOPS ( 11 runs)
2048 x 2048: Q5_0   142.8 GFLOPS (  9 runs) | Q5_1   143.8 GFLOPS (  9 runs) | Q8_0   234.5 GFLOPS ( 14 runs)
2048 x 2048: F16    188.6 GFLOPS ( 12 runs) | F32    164.2 GFLOPS ( 10 runs)
4096 x 4096: Q4_0   177.7 GFLOPS (  3 runs) | Q4_1   167.1 GFLOPS (  3 runs)
4096 x 4096: Q5_0   147.4 GFLOPS (  3 runs) | Q5_1   126.7 GFLOPS (  3 runs) | Q8_0   190.6 GFLOPS (  3 runs)
4096 x 4096: F16    182.1 GFLOPS (  3 runs) | F32    111.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  AVX2 | tiny | 4 | 47 | 389 | 39c4fc5 |
| <todo> | <todo> |  AVX2 | base | 4 | 57 | 756 | 39c4fc5 |
| <todo> | <todo> |  AVX2 | small | 4 | 149 | 3352 | 39c4fc5 |
| <todo> | <todo> |  AVX2 | medium | 4 | 421 | 9026 | 39c4fc5 |
| <todo> | <todo> |  AVX2 | large | 4 | 837 | 17317 | 39c4fc5 |

6780c98

root@ubuntu:/home/test/whisper.cpp-6780c98e193c19decb99157496c74046dd0e4aac# ./extra/bench-all.sh
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 18.93 GB/s (1 thread)
sum:    -536869898.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     6.2 GFLOPS (128 runs) | Q4_1     6.7 GFLOPS (128 runs)
  64 x   64: Q5_0     7.2 GFLOPS (128 runs) | Q5_1     6.8 GFLOPS (128 runs) | Q8_0     7.1 GFLOPS (128 runs)
  64 x   64: F16      7.3 GFLOPS (128 runs) | F32      7.5 GFLOPS (128 runs)
 128 x  128: Q4_0    41.9 GFLOPS (128 runs) | Q4_1    42.5 GFLOPS (128 runs)
 128 x  128: Q5_0    40.0 GFLOPS (128 runs) | Q5_1    39.3 GFLOPS (128 runs) | Q8_0    45.9 GFLOPS (128 runs)
 128 x  128: F16     43.6 GFLOPS (128 runs) | F32     44.7 GFLOPS (128 runs)
 256 x  256: Q4_0   120.6 GFLOPS (128 runs) | Q4_1   120.1 GFLOPS (128 runs)
 256 x  256: Q5_0    99.0 GFLOPS (128 runs) | Q5_1    92.6 GFLOPS (128 runs) | Q8_0   138.2 GFLOPS (128 runs)
 256 x  256: F16    128.0 GFLOPS (128 runs) | F32    114.9 GFLOPS (128 runs)
 512 x  512: Q4_0   158.2 GFLOPS (128 runs) | Q4_1   162.2 GFLOPS (128 runs)
 512 x  512: Q5_0   140.5 GFLOPS (128 runs) | Q5_1   132.3 GFLOPS (128 runs) | Q8_0   204.1 GFLOPS (128 runs)
 512 x  512: F16    177.7 GFLOPS (128 runs) | F32    135.5 GFLOPS (128 runs)
1024 x 1024: Q4_0   170.2 GFLOPS ( 80 runs) | Q4_1   180.1 GFLOPS ( 84 runs)
1024 x 1024: Q5_0   145.6 GFLOPS ( 68 runs) | Q5_1   143.3 GFLOPS ( 67 runs) | Q8_0   211.4 GFLOPS ( 99 runs)
1024 x 1024: F16    199.5 GFLOPS ( 93 runs) | F32    142.4 GFLOPS ( 67 runs)
2048 x 2048: Q4_0   158.7 GFLOPS ( 10 runs) | Q4_1   159.7 GFLOPS ( 10 runs)
2048 x 2048: Q5_0   124.3 GFLOPS (  8 runs) | Q5_1   127.7 GFLOPS (  8 runs) | Q8_0   217.7 GFLOPS ( 13 runs)
2048 x 2048: F16    191.9 GFLOPS ( 12 runs) | F32    133.1 GFLOPS (  8 runs)
4096 x 4096: Q4_0   172.2 GFLOPS (  3 runs) | Q4_1   164.4 GFLOPS (  3 runs)
4096 x 4096: Q5_0   114.6 GFLOPS (  3 runs) | Q5_1   129.6 GFLOPS (  3 runs) | Q8_0   179.1 GFLOPS (  3 runs)
4096 x 4096: F16    119.7 GFLOPS (  3 runs) | F32     50.1 GFLOPS (  3 runs)

Running benchmark for all models
This can take a while!

| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> |  AVX2 | tiny | 4 | 42 | 414 | 6780c98 |
| <todo> | <todo> |  AVX2 | base | 4 | 62 | 860 | 6780c98 |
| <todo> | <todo> |  AVX2 | small | 4 | 148 | 3048 | 6780c98 |
| <todo> | <todo> |  AVX2 | medium | 4 | 424 | 10190 | 6780c98 |
| <todo> | <todo> |  AVX2 | large | 4 | 839 | 18234 | 6780c98 |

dereklll · 2023-09-12T09:59:29Z

39c4fc5:

./extra/bench-all.sh                     
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 51.55 GB/s (1 thread)
sum:    -536871564.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     8.3 GFLOPS (128 runs) | Q4_1     8.8 GFLOPS (128 runs)
  64 x   64: Q5_0     8.7 GFLOPS (128 runs) | Q5_1    10.1 GFLOPS (128 runs) | Q8_0    14.2 GFLOPS (128 runs)
  64 x   64: F16     13.8 GFLOPS (128 runs) | F32     14.2 GFLOPS (128 runs)
 128 x  128: Q4_0    93.3 GFLOPS (128 runs) | Q4_1    86.5 GFLOPS (128 runs)
 128 x  128: Q5_0    92.4 GFLOPS (128 runs) | Q5_1    92.4 GFLOPS (128 runs) | Q8_0    96.1 GFLOPS (128 runs)
 128 x  128: F16     98.8 GFLOPS (128 runs) | F32    101.9 GFLOPS (128 runs)
 256 x  256: Q4_0   421.7 GFLOPS (128 runs) | Q4_1   375.7 GFLOPS (128 runs)
 256 x  256: Q5_0   409.0 GFLOPS (128 runs) | Q5_1   421.4 GFLOPS (128 runs) | Q8_0   436.2 GFLOPS (128 runs)
 256 x  256: F16    458.4 GFLOPS (128 runs) | F32    477.8 GFLOPS (128 runs)
 512 x  512: Q4_0   813.5 GFLOPS (128 runs) | Q4_1   781.8 GFLOPS (128 runs)
 512 x  512: Q5_0   915.9 GFLOPS (128 runs) | Q5_1   926.3 GFLOPS (128 runs) | Q8_0   771.6 GFLOPS (128 runs)
 512 x  512: F16    890.6 GFLOPS (128 runs) | F32   1211.5 GFLOPS (128 runs)
1024 x 1024: Q4_0  1352.4 GFLOPS (128 runs) | Q4_1   905.6 GFLOPS (128 runs)
1024 x 1024: Q5_0  1328.9 GFLOPS (128 runs) | Q5_1  1436.4 GFLOPS (128 runs) | Q8_0  1087.4 GFLOPS (128 runs)
1024 x 1024: F16   1364.2 GFLOPS (128 runs) | F32   1740.5 GFLOPS (128 runs)
2048 x 2048: Q4_0  3228.5 GFLOPS (128 runs) | Q4_1  2799.6 GFLOPS (128 runs)
2048 x 2048: Q5_0  2968.0 GFLOPS (128 runs) | Q5_1  3094.5 GFLOPS (128 runs) | Q8_0  3214.1 GFLOPS (128 runs)
2048 x 2048: F16   3410.8 GFLOPS (128 runs) | F32   3516.9 GFLOPS (128 runs)
4096 x 4096: Q4_0  3350.7 GFLOPS ( 25 runs) | Q4_1  3132.2 GFLOPS ( 23 runs)
4096 x 4096: Q5_0  3292.5 GFLOPS ( 24 runs) | Q5_1  3289.1 GFLOPS ( 24 runs) | Q8_0  3346.0 GFLOPS ( 25 runs)
4096 x 4096: F16   3368.1 GFLOPS ( 25 runs) | F32   3477.8 GFLOPS ( 26 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
		NEON BLAS	base	4	58	205	`39c4fc5`

6780c98:

 ./extra/bench-all.sh     
Usage: ./bench.sh [n_threads] [encoder-only]

Running memcpy benchmark

memcpy: 51.88 GB/s (1 thread)
sum:    -536871564.000000

Running ggml_mul_mat benchmark with 4 threads

  64 x   64: Q4_0     9.3 GFLOPS (128 runs) | Q4_1     9.6 GFLOPS (128 runs)
  64 x   64: Q5_0     9.9 GFLOPS (128 runs) | Q5_1     9.8 GFLOPS (128 runs) | Q8_0    13.7 GFLOPS (128 runs)
  64 x   64: F16     16.4 GFLOPS (128 runs) | F32     16.4 GFLOPS (128 runs)
 128 x  128: Q4_0   100.0 GFLOPS (128 runs) | Q4_1    75.4 GFLOPS (128 runs)
 128 x  128: Q5_0    87.4 GFLOPS (128 runs) | Q5_1   103.7 GFLOPS (128 runs) | Q8_0   107.5 GFLOPS (128 runs)
 128 x  128: F16    110.4 GFLOPS (128 runs) | F32    112.3 GFLOPS (128 runs)
 256 x  256: Q4_0   442.1 GFLOPS (128 runs) | Q4_1   387.2 GFLOPS (128 runs)
 256 x  256: Q5_0   433.0 GFLOPS (128 runs) | Q5_1   445.9 GFLOPS (128 runs) | Q8_0   463.9 GFLOPS (128 runs)
 256 x  256: F16    476.2 GFLOPS (128 runs) | F32    501.2 GFLOPS (128 runs)
 512 x  512: Q4_0  1005.7 GFLOPS (128 runs) | Q4_1   925.9 GFLOPS (128 runs)
 512 x  512: Q5_0  1094.3 GFLOPS (128 runs) | Q5_1  1101.4 GFLOPS (128 runs) | Q8_0   988.5 GFLOPS (128 runs)
 512 x  512: F16   1211.2 GFLOPS (128 runs) | F32   1561.7 GFLOPS (128 runs)
1024 x 1024: Q4_0  1774.2 GFLOPS (128 runs) | Q4_1  1669.6 GFLOPS (128 runs)
1024 x 1024: Q5_0  1980.2 GFLOPS (128 runs) | Q5_1  1870.5 GFLOPS (128 runs) | Q8_0  1738.3 GFLOPS (128 runs)
1024 x 1024: F16   2071.7 GFLOPS (128 runs) | F32   2358.5 GFLOPS (128 runs)
2048 x 2048: Q4_0  3357.1 GFLOPS (128 runs) | Q4_1  3004.0 GFLOPS (128 runs)
2048 x 2048: Q5_0  3254.1 GFLOPS (128 runs) | Q5_1  3268.6 GFLOPS (128 runs) | Q8_0  3365.0 GFLOPS (128 runs)
2048 x 2048: F16   3356.5 GFLOPS (128 runs) | F32   3702.7 GFLOPS (128 runs)
4096 x 4096: Q4_0  3394.4 GFLOPS ( 25 runs) | Q4_1  3134.9 GFLOPS ( 23 runs)
4096 x 4096: Q5_0  3292.7 GFLOPS ( 24 runs) | Q5_1  3345.6 GFLOPS ( 25 runs) | Q8_0  3334.6 GFLOPS ( 25 runs)
4096 x 4096: F16   3389.3 GFLOPS ( 25 runs) | F32   3450.2 GFLOPS ( 26 runs)

Running benchmark for all models
This can take a while!

CPU	OS	Config	Model	Th	Load	Enc.	Commit
		NEON BLAS	base	4	61	272	`6780c98`

ggerganov · 2023-09-12T10:35:00Z

Thank you for the data. After commit 09a6325 that I just pushed, the performance with OpenBLAS should now be restored

bobqianic · 2023-09-12T10:36:59Z

Thank you for the data. After commit 09a6325 that I just pushed, the performance with OpenBLAS should now be restored

Thanks!

* master: (96 commits) whisper : fix bench regression + fix performance when using CPU BLAS (ggml-org#1275) whisper : faster beam_search sampling via reduced KV cache copies (ggml-org#1243) java : fixed signing of java artifact using gradle (ggml-org#1267) ci : try to fix gradle action (ggml-org#1265) gitignore : update sync : ggml (HBM + Metal + style) (ggml-org#1264) ci : upgrade gradle to 2.4.2 (ggml-org#1263) sync : ggml (CUDA faster rope) cmake : noramlize case (ggml-org#1129) build : do not use _GNU_SOURCE gratuitously (ggml-org#1129) examples : fix build + compile warnings (close ggml-org#1256) models : add quantum models to download-ggml-model.sh (ggml-org#1235) whisper.android : bump gradle plugin and dependencies + a lint pass (ggml-org#1255) sign jar for Maven Central repo whisper.android : address ARM's big.LITTLE arch by checking cpu info (ggml-org#1254) make : fix detection of AVX2 on macOS (ggml-org#1250) ggml : posixify pagesize (ggml-org#1251) configured publishing.repositories ggml : sync latest llama.cpp (view_src + alloc improvements) (ggml-org#1247) make : improve cpuinfo handling on x86 hosts (ggml-org#1238) ...

…gml-org#1275) * whisper : fix bench regression * ggml : use sched_yield when using BLAS + add comment

whisper : fix bench regression

39c4fc5

ggml : use sched_yield when using BLAS + add comment

09a6325

ggerganov merged commit 3fec211 into master Sep 12, 2023

dereklll mentioned this pull request Sep 13, 2023

ggml_new_object: not enough space in the context's memory pool #1272

Closed

didzis pushed a commit to didzis/whisper.cpp that referenced this pull request Sep 30, 2023

whisper : fix bench regression + fix performance when using CPU BLAS (g…

4344816

…gml-org#1275) * whisper : fix bench regression * ggml : use sched_yield when using BLAS + add comment

vonstring pushed a commit to vonstring/whisper.cpp that referenced this pull request Nov 7, 2023

whisper : fix bench regression + fix performance when using CPU BLAS (g…

755359f

…gml-org#1275) * whisper : fix bench regression * ggml : use sched_yield when using BLAS + add comment

landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023

whisper : fix bench regression + fix performance when using CPU BLAS (g…

90e0f6c

…gml-org#1275) * whisper : fix bench regression * ggml : use sched_yield when using BLAS + add comment

iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024

whisper : fix bench regression + fix performance when using CPU BLAS (g…

66d962e

…gml-org#1275) * whisper : fix bench regression * ggml : use sched_yield when using BLAS + add comment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

whisper : fix bench regression #1275

whisper : fix bench regression #1275

Uh oh!

ggerganov commented Sep 12, 2023 •

edited

Loading

Uh oh!

bobqianic commented Sep 12, 2023 •

edited

Loading

Uh oh!

dereklll commented Sep 12, 2023 •

edited by bobqianic

Loading

Uh oh!

ggerganov commented Sep 12, 2023

Uh oh!

bobqianic commented Sep 12, 2023

Uh oh!

Uh oh!

whisper : fix bench regression #1275

whisper : fix bench regression #1275

Uh oh!

Conversation

ggerganov commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bobqianic commented Sep 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dereklll commented Sep 12, 2023 • edited by bobqianic Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Sep 12, 2023

Uh oh!

bobqianic commented Sep 12, 2023

Uh oh!

Uh oh!

ggerganov commented Sep 12, 2023 •

edited

Loading

bobqianic commented Sep 12, 2023 •

edited

Loading

dereklll commented Sep 12, 2023 •

edited by bobqianic

Loading