-
Notifications
You must be signed in to change notification settings - Fork 4.4k
whisper : fix bench regression #1275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 7 5700X 8-Core Processor
CPU family: 25
Model: 33
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Stepping: 2
BogoMIPS: 6800.08
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable
nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignss
e 3dnowprefetch osvw topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerp
tr rdpru arat umip vaes vpclmulqdq rdpid fsrm
Virtualization features:
Hypervisor vendor: Microsoft
Virtualization type: full
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 2 MiB (4 instances)
L3: 32 MiB (1 instance)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-7
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected root@ubuntu:/home/test/whisper.cpp-39c4fc59dd72cd941bbbdeb470162fb42edaea15# ./extra/bench-all.sh
Usage: ./bench.sh [n_threads] [encoder-only]
Running memcpy benchmark
memcpy: 18.99 GB/s (1 thread)
sum: -536869898.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 5.9 GFLOPS (128 runs) | Q4_1 6.8 GFLOPS (128 runs)
64 x 64: Q5_0 6.9 GFLOPS (128 runs) | Q5_1 6.6 GFLOPS (128 runs) | Q8_0 7.0 GFLOPS (128 runs)
64 x 64: F16 7.1 GFLOPS (128 runs) | F32 7.1 GFLOPS (128 runs)
128 x 128: Q4_0 40.3 GFLOPS (128 runs) | Q4_1 38.7 GFLOPS (128 runs)
128 x 128: Q5_0 38.6 GFLOPS (128 runs) | Q5_1 36.5 GFLOPS (128 runs) | Q8_0 44.1 GFLOPS (128 runs)
128 x 128: F16 43.4 GFLOPS (128 runs) | F32 43.5 GFLOPS (128 runs)
256 x 256: Q4_0 112.6 GFLOPS (128 runs) | Q4_1 108.3 GFLOPS (128 runs)
256 x 256: Q5_0 96.2 GFLOPS (128 runs) | Q5_1 87.3 GFLOPS (128 runs) | Q8_0 133.5 GFLOPS (128 runs)
256 x 256: F16 132.9 GFLOPS (128 runs) | F32 131.1 GFLOPS (128 runs)
512 x 512: Q4_0 161.1 GFLOPS (128 runs) | Q4_1 158.0 GFLOPS (128 runs)
512 x 512: Q5_0 137.7 GFLOPS (128 runs) | Q5_1 129.2 GFLOPS (128 runs) | Q8_0 193.3 GFLOPS (128 runs)
512 x 512: F16 187.0 GFLOPS (128 runs) | F32 166.4 GFLOPS (128 runs)
1024 x 1024: Q4_0 182.5 GFLOPS ( 85 runs) | Q4_1 185.8 GFLOPS ( 87 runs)
1024 x 1024: Q5_0 150.3 GFLOPS ( 71 runs) | Q5_1 143.2 GFLOPS ( 67 runs) | Q8_0 223.5 GFLOPS (105 runs)
1024 x 1024: F16 214.7 GFLOPS (100 runs) | F32 169.5 GFLOPS ( 79 runs)
2048 x 2048: Q4_0 182.2 GFLOPS ( 11 runs) | Q4_1 181.8 GFLOPS ( 11 runs)
2048 x 2048: Q5_0 142.8 GFLOPS ( 9 runs) | Q5_1 143.8 GFLOPS ( 9 runs) | Q8_0 234.5 GFLOPS ( 14 runs)
2048 x 2048: F16 188.6 GFLOPS ( 12 runs) | F32 164.2 GFLOPS ( 10 runs)
4096 x 4096: Q4_0 177.7 GFLOPS ( 3 runs) | Q4_1 167.1 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 147.4 GFLOPS ( 3 runs) | Q5_1 126.7 GFLOPS ( 3 runs) | Q8_0 190.6 GFLOPS ( 3 runs)
4096 x 4096: F16 182.1 GFLOPS ( 3 runs) | F32 111.1 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> | AVX2 | tiny | 4 | 47 | 389 | 39c4fc5 |
| <todo> | <todo> | AVX2 | base | 4 | 57 | 756 | 39c4fc5 |
| <todo> | <todo> | AVX2 | small | 4 | 149 | 3352 | 39c4fc5 |
| <todo> | <todo> | AVX2 | medium | 4 | 421 | 9026 | 39c4fc5 |
| <todo> | <todo> | AVX2 | large | 4 | 837 | 17317 | 39c4fc5 | root@ubuntu:/home/test/whisper.cpp-6780c98e193c19decb99157496c74046dd0e4aac# ./extra/bench-all.sh
Usage: ./bench.sh [n_threads] [encoder-only]
Running memcpy benchmark
memcpy: 18.93 GB/s (1 thread)
sum: -536869898.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 6.2 GFLOPS (128 runs) | Q4_1 6.7 GFLOPS (128 runs)
64 x 64: Q5_0 7.2 GFLOPS (128 runs) | Q5_1 6.8 GFLOPS (128 runs) | Q8_0 7.1 GFLOPS (128 runs)
64 x 64: F16 7.3 GFLOPS (128 runs) | F32 7.5 GFLOPS (128 runs)
128 x 128: Q4_0 41.9 GFLOPS (128 runs) | Q4_1 42.5 GFLOPS (128 runs)
128 x 128: Q5_0 40.0 GFLOPS (128 runs) | Q5_1 39.3 GFLOPS (128 runs) | Q8_0 45.9 GFLOPS (128 runs)
128 x 128: F16 43.6 GFLOPS (128 runs) | F32 44.7 GFLOPS (128 runs)
256 x 256: Q4_0 120.6 GFLOPS (128 runs) | Q4_1 120.1 GFLOPS (128 runs)
256 x 256: Q5_0 99.0 GFLOPS (128 runs) | Q5_1 92.6 GFLOPS (128 runs) | Q8_0 138.2 GFLOPS (128 runs)
256 x 256: F16 128.0 GFLOPS (128 runs) | F32 114.9 GFLOPS (128 runs)
512 x 512: Q4_0 158.2 GFLOPS (128 runs) | Q4_1 162.2 GFLOPS (128 runs)
512 x 512: Q5_0 140.5 GFLOPS (128 runs) | Q5_1 132.3 GFLOPS (128 runs) | Q8_0 204.1 GFLOPS (128 runs)
512 x 512: F16 177.7 GFLOPS (128 runs) | F32 135.5 GFLOPS (128 runs)
1024 x 1024: Q4_0 170.2 GFLOPS ( 80 runs) | Q4_1 180.1 GFLOPS ( 84 runs)
1024 x 1024: Q5_0 145.6 GFLOPS ( 68 runs) | Q5_1 143.3 GFLOPS ( 67 runs) | Q8_0 211.4 GFLOPS ( 99 runs)
1024 x 1024: F16 199.5 GFLOPS ( 93 runs) | F32 142.4 GFLOPS ( 67 runs)
2048 x 2048: Q4_0 158.7 GFLOPS ( 10 runs) | Q4_1 159.7 GFLOPS ( 10 runs)
2048 x 2048: Q5_0 124.3 GFLOPS ( 8 runs) | Q5_1 127.7 GFLOPS ( 8 runs) | Q8_0 217.7 GFLOPS ( 13 runs)
2048 x 2048: F16 191.9 GFLOPS ( 12 runs) | F32 133.1 GFLOPS ( 8 runs)
4096 x 4096: Q4_0 172.2 GFLOPS ( 3 runs) | Q4_1 164.4 GFLOPS ( 3 runs)
4096 x 4096: Q5_0 114.6 GFLOPS ( 3 runs) | Q5_1 129.6 GFLOPS ( 3 runs) | Q8_0 179.1 GFLOPS ( 3 runs)
4096 x 4096: F16 119.7 GFLOPS ( 3 runs) | F32 50.1 GFLOPS ( 3 runs)
Running benchmark for all models
This can take a while!
| CPU | OS | Config | Model | Th | Load | Enc. | Commit |
| --- | -- | ------ | ----- | -- | ---- | ---- | ------ |
| <todo> | <todo> | AVX2 | tiny | 4 | 42 | 414 | 6780c98 |
| <todo> | <todo> | AVX2 | base | 4 | 62 | 860 | 6780c98 |
| <todo> | <todo> | AVX2 | small | 4 | 148 | 3048 | 6780c98 |
| <todo> | <todo> | AVX2 | medium | 4 | 424 | 10190 | 6780c98 |
| <todo> | <todo> | AVX2 | large | 4 | 839 | 18234 | 6780c98 | |
./extra/bench-all.sh
Usage: ./bench.sh [n_threads] [encoder-only]
Running memcpy benchmark
memcpy: 51.55 GB/s (1 thread)
sum: -536871564.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 8.3 GFLOPS (128 runs) | Q4_1 8.8 GFLOPS (128 runs)
64 x 64: Q5_0 8.7 GFLOPS (128 runs) | Q5_1 10.1 GFLOPS (128 runs) | Q8_0 14.2 GFLOPS (128 runs)
64 x 64: F16 13.8 GFLOPS (128 runs) | F32 14.2 GFLOPS (128 runs)
128 x 128: Q4_0 93.3 GFLOPS (128 runs) | Q4_1 86.5 GFLOPS (128 runs)
128 x 128: Q5_0 92.4 GFLOPS (128 runs) | Q5_1 92.4 GFLOPS (128 runs) | Q8_0 96.1 GFLOPS (128 runs)
128 x 128: F16 98.8 GFLOPS (128 runs) | F32 101.9 GFLOPS (128 runs)
256 x 256: Q4_0 421.7 GFLOPS (128 runs) | Q4_1 375.7 GFLOPS (128 runs)
256 x 256: Q5_0 409.0 GFLOPS (128 runs) | Q5_1 421.4 GFLOPS (128 runs) | Q8_0 436.2 GFLOPS (128 runs)
256 x 256: F16 458.4 GFLOPS (128 runs) | F32 477.8 GFLOPS (128 runs)
512 x 512: Q4_0 813.5 GFLOPS (128 runs) | Q4_1 781.8 GFLOPS (128 runs)
512 x 512: Q5_0 915.9 GFLOPS (128 runs) | Q5_1 926.3 GFLOPS (128 runs) | Q8_0 771.6 GFLOPS (128 runs)
512 x 512: F16 890.6 GFLOPS (128 runs) | F32 1211.5 GFLOPS (128 runs)
1024 x 1024: Q4_0 1352.4 GFLOPS (128 runs) | Q4_1 905.6 GFLOPS (128 runs)
1024 x 1024: Q5_0 1328.9 GFLOPS (128 runs) | Q5_1 1436.4 GFLOPS (128 runs) | Q8_0 1087.4 GFLOPS (128 runs)
1024 x 1024: F16 1364.2 GFLOPS (128 runs) | F32 1740.5 GFLOPS (128 runs)
2048 x 2048: Q4_0 3228.5 GFLOPS (128 runs) | Q4_1 2799.6 GFLOPS (128 runs)
2048 x 2048: Q5_0 2968.0 GFLOPS (128 runs) | Q5_1 3094.5 GFLOPS (128 runs) | Q8_0 3214.1 GFLOPS (128 runs)
2048 x 2048: F16 3410.8 GFLOPS (128 runs) | F32 3516.9 GFLOPS (128 runs)
4096 x 4096: Q4_0 3350.7 GFLOPS ( 25 runs) | Q4_1 3132.2 GFLOPS ( 23 runs)
4096 x 4096: Q5_0 3292.5 GFLOPS ( 24 runs) | Q5_1 3289.1 GFLOPS ( 24 runs) | Q8_0 3346.0 GFLOPS ( 25 runs)
4096 x 4096: F16 3368.1 GFLOPS ( 25 runs) | F32 3477.8 GFLOPS ( 26 runs) Running benchmark for all models
./extra/bench-all.sh
Usage: ./bench.sh [n_threads] [encoder-only]
Running memcpy benchmark
memcpy: 51.88 GB/s (1 thread)
sum: -536871564.000000
Running ggml_mul_mat benchmark with 4 threads
64 x 64: Q4_0 9.3 GFLOPS (128 runs) | Q4_1 9.6 GFLOPS (128 runs)
64 x 64: Q5_0 9.9 GFLOPS (128 runs) | Q5_1 9.8 GFLOPS (128 runs) | Q8_0 13.7 GFLOPS (128 runs)
64 x 64: F16 16.4 GFLOPS (128 runs) | F32 16.4 GFLOPS (128 runs)
128 x 128: Q4_0 100.0 GFLOPS (128 runs) | Q4_1 75.4 GFLOPS (128 runs)
128 x 128: Q5_0 87.4 GFLOPS (128 runs) | Q5_1 103.7 GFLOPS (128 runs) | Q8_0 107.5 GFLOPS (128 runs)
128 x 128: F16 110.4 GFLOPS (128 runs) | F32 112.3 GFLOPS (128 runs)
256 x 256: Q4_0 442.1 GFLOPS (128 runs) | Q4_1 387.2 GFLOPS (128 runs)
256 x 256: Q5_0 433.0 GFLOPS (128 runs) | Q5_1 445.9 GFLOPS (128 runs) | Q8_0 463.9 GFLOPS (128 runs)
256 x 256: F16 476.2 GFLOPS (128 runs) | F32 501.2 GFLOPS (128 runs)
512 x 512: Q4_0 1005.7 GFLOPS (128 runs) | Q4_1 925.9 GFLOPS (128 runs)
512 x 512: Q5_0 1094.3 GFLOPS (128 runs) | Q5_1 1101.4 GFLOPS (128 runs) | Q8_0 988.5 GFLOPS (128 runs)
512 x 512: F16 1211.2 GFLOPS (128 runs) | F32 1561.7 GFLOPS (128 runs)
1024 x 1024: Q4_0 1774.2 GFLOPS (128 runs) | Q4_1 1669.6 GFLOPS (128 runs)
1024 x 1024: Q5_0 1980.2 GFLOPS (128 runs) | Q5_1 1870.5 GFLOPS (128 runs) | Q8_0 1738.3 GFLOPS (128 runs)
1024 x 1024: F16 2071.7 GFLOPS (128 runs) | F32 2358.5 GFLOPS (128 runs)
2048 x 2048: Q4_0 3357.1 GFLOPS (128 runs) | Q4_1 3004.0 GFLOPS (128 runs)
2048 x 2048: Q5_0 3254.1 GFLOPS (128 runs) | Q5_1 3268.6 GFLOPS (128 runs) | Q8_0 3365.0 GFLOPS (128 runs)
2048 x 2048: F16 3356.5 GFLOPS (128 runs) | F32 3702.7 GFLOPS (128 runs)
4096 x 4096: Q4_0 3394.4 GFLOPS ( 25 runs) | Q4_1 3134.9 GFLOPS ( 23 runs)
4096 x 4096: Q5_0 3292.7 GFLOPS ( 24 runs) | Q5_1 3345.6 GFLOPS ( 25 runs) | Q8_0 3334.6 GFLOPS ( 25 runs)
4096 x 4096: F16 3389.3 GFLOPS ( 25 runs) | F32 3450.2 GFLOPS ( 26 runs) Running benchmark for all models
|
Thank you for the data. After commit 09a6325 that I just pushed, the performance with OpenBLAS should now be restored |
Thanks! |
bdonkey
added a commit
to bdonkey/whisper.cpp
that referenced
this pull request
Sep 13, 2023
* master: (96 commits) whisper : fix bench regression + fix performance when using CPU BLAS (ggml-org#1275) whisper : faster beam_search sampling via reduced KV cache copies (ggml-org#1243) java : fixed signing of java artifact using gradle (ggml-org#1267) ci : try to fix gradle action (ggml-org#1265) gitignore : update sync : ggml (HBM + Metal + style) (ggml-org#1264) ci : upgrade gradle to 2.4.2 (ggml-org#1263) sync : ggml (CUDA faster rope) cmake : noramlize case (ggml-org#1129) build : do not use _GNU_SOURCE gratuitously (ggml-org#1129) examples : fix build + compile warnings (close ggml-org#1256) models : add quantum models to download-ggml-model.sh (ggml-org#1235) whisper.android : bump gradle plugin and dependencies + a lint pass (ggml-org#1255) sign jar for Maven Central repo whisper.android : address ARM's big.LITTLE arch by checking cpu info (ggml-org#1254) make : fix detection of AVX2 on macOS (ggml-org#1250) ggml : posixify pagesize (ggml-org#1251) configured publishing.repositories ggml : sync latest llama.cpp (view_src + alloc improvements) (ggml-org#1247) make : improve cpuinfo handling on x86 hosts (ggml-org#1238) ...
didzis
pushed a commit
to didzis/whisper.cpp
that referenced
this pull request
Sep 30, 2023
…gml-org#1275) * whisper : fix bench regression * ggml : use sched_yield when using BLAS + add comment
jacobwu-b
pushed a commit
to jacobwu-b/Transcriptify-by-whisper.cpp
that referenced
this pull request
Oct 24, 2023
…gml-org#1275) * whisper : fix bench regression * ggml : use sched_yield when using BLAS + add comment
jacobwu-b
pushed a commit
to jacobwu-b/Transcriptify-by-whisper.cpp
that referenced
this pull request
Oct 24, 2023
…gml-org#1275) * whisper : fix bench regression * ggml : use sched_yield when using BLAS + add comment
vonstring
pushed a commit
to vonstring/whisper.cpp
that referenced
this pull request
Nov 7, 2023
…gml-org#1275) * whisper : fix bench regression * ggml : use sched_yield when using BLAS + add comment
landtanin
pushed a commit
to landtanin/whisper.cpp
that referenced
this pull request
Dec 16, 2023
…gml-org#1275) * whisper : fix bench regression * ggml : use sched_yield when using BLAS + add comment
iThalay
pushed a commit
to iThalay/whisper.cpp
that referenced
this pull request
Sep 23, 2024
…gml-org#1275) * whisper : fix bench regression * ggml : use sched_yield when using BLAS + add comment
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
close: #1272 #1273
@bobqianic @dereklll @nchudleigh
Can you try this branch and show me the results of the following command compared to 6780c98:
Here is mine on M2 Ultra:
09a6325de56856490ae9046bf0030ceedc04028a