'-Ofast' and '-march=native' provide significant speedup #252

ttsiodras · 2022-12-10T11:57:37Z

'-Ofast' and '-march=native' cause 2x-speedup in machines with SSE (but no AVX) instructions. Should help other platforms, too.

…ut no AVX) instructions. Should help other platforms, too.

ttsiodras · 2022-12-10T11:58:34Z

See #251 for details.

luke-jr · 2022-12-10T20:09:02Z

I get:

c++: error: unrecognized command-line option ‘-march=native’; did you mean ‘-mcpu=native’?

idk why compilers can't standardise this stuff, but I guess it should be arch-conditional.

ttsiodras · 2022-12-10T20:14:57Z

I get:
c++: error: unrecognized command-line option ‘-march=native’; did you mean ‘-mcpu=native’?
idk why compilers can't standardise this stuff, but I guess it should be arch-conditional.

From the official GCC documentation ( https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html ):

-mcpu=cpu-type

A deprecated synonym for -mtune=cpu-type

So the compiler you tried this on, Luke, is probably a rather old version of GCC.

In fact, when I try -mcpu=native in my machine, I get:

cc  -I.              -O3 -std=c11   -fPIC  -Ofast -mcpu=native -pthread   -c ggml.c -o ggml.o
cc: warning: ‘-mcpu=’ is deprecated; use ‘-mtune=’ or ‘-march=’ instead

Specs of my test: Arch Linux on Celeron N5095, with GCC12.2. The deprecation of -mcpu is quite a ways back, GCC-version wise. As for the speed difference between -march=native and -mtune=native, in my machine there is none.

luke-jr · 2022-12-10T20:22:04Z

That was with GCC 11.3.0 for ppc64le. There is no -march on this platform at all.

ttsiodras · 2022-12-10T20:37:31Z

Well, like all things, this is a balancing act...

The usual autoconf/automake machinery can be used, to have "./configure" emit a Makefile that uses whatever options apply best to the current machine. I can do that for whisper.cpp, if @ggerganov is OK with the involved complexity.

But as-is, -Ofast -march=native will work on all Intel/AMD/ARM machines with a decade old GCC. A quick Google search shows the -mcpu deprecation since 2004! ( https://forums.gentoo.org/viewtopic-t-222477-start-0.html )

ggerganov · 2022-12-11T11:34:24Z

So I'm not 100% sure what to do here.
Btw, I've already done experiments with -ffast-math and -march flags:

https://github.com/ggerganov/whisper.cpp/blob/ea38ad6e70e2b4bd0c1a79f3e2cbfd99ad9393c3/CMakeLists.txt#L77-L80

On my MacBook, building with stock clang, it does not recognise -march flag:

$ make
clang: error: the clang compiler does not support '-march=native'

$ clang -v
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.1.0
Thread model: posix

Using -Ofast is equivalent to -O3 -ffast-math.
Using -ffast-math does bring ~10% performance gain even on my machine. I am also aware of what -ffast-math does and what are the "side-effects" to the computation and at the moment I don't think it really hurts adding it. It will become a problem if someday we want to make whisper.cpp produce the same exact results across different CPUs - this is not the case today.

Let's think about this some more. Maybe we can hear more points of view on this topic and get better insight.

ttsiodras · 2022-12-11T15:16:39Z

So I'm not 100% sure what to do here.

Well, this is what autoconf/automake were built for: to pick the best compilation options possible for the specific target we are building on. IMHO it's a shame to leave a 2x speedup on the table...

I could write the necessary configure.ac/Makefile.am (the sources for autoconf/automake-based builds). We would then automatically get a configure script, that would try a series of compilation options and build a Makefile tailor-made for the machine we work in. Would that be acceptable? Or are you opposed to autoconf/automake?

ttsiodras · 2022-12-11T16:22:06Z

Just one more note: in GCC land, you can ask the compiler to emit instruction-set-specific versions of the functions, and dispatch appropriately at run-time, based on the machine we run on: https://github.com/ttsiodras/MandelbrotSSE/blob/master/src/xaos.cc#L31 I used that to get maximum flexibility in there - worked like a charm. I don't know if clang supports that, though.

ttsiodras · 2022-12-11T16:31:40Z

In order for you to have more information on the autoconf/automake decision, I just pushed a few commits - you can try it out and decide for yourself.

For now, I only implemented SSE-checking ( https://github.com/ttsiodras/whisper.cpp/blob/master/configure.ac#L83 ) but I hope the pattern is clear enough. To add support for any other instructions you want, you just add the relevant assembly check, and then emit the compiler option you want. You also get a #define inside the auto-generated config.h, so you can make more compile-time decisions with #ifdef in your C/C++ code.
I also added SDL2 checks, since I saw 2 of your binaries needed it. They detect and use SDL2 fine in my tests here.

To see for yourself: after you clone my version of the repo, launch ./configure and make.

To modify the logic, edit configure.ac and/or Makefile.am - then launch ./bootstrap. This simply invokes autoreconf and automake, creating an updated version of the ./configure machinery.

luke-jr · 2022-12-11T17:54:42Z

+1 to autotools. That would also make it simpler to libtoolise the library and make the examples link to it.

ggerganov · 2022-12-16T18:12:04Z

@ttsiodras
Thanks for the effort, but the automake stuff is not for this project - it's too complicated

I did a few tests with and without -Ofast -march=native on different machines and here are the results:

-O3

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 9 5950X	Ubuntu 22.04	AVX2	tiny	4	137	297	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	base	4	183	665	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	small	4	373	2328	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	medium	4	923	7346	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	large	4	1681	14053	b8065d9
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2	tiny	4	122	572	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	base	4	153	1303	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	small	4	305	4844	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	medium	4	750	16117	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	large	4	1331	37618	b8065d9
---
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	tiny	4	68	170	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	base	4	97	327	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	4	221	1069	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	medium	4	581	2873	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	large	4	1170	5173	b8065d9

-Ofast -march=native

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Ryzen 9 5950X	Ubuntu 22.04	AVX2	tiny	4	137	320	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	base	4	180	721	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	small	4	366	2554	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	medium	4	900	8181	b8065d9
Ryzen 9 5950X	Ubuntu 22.04	AVX2	large	4	1614	15679	b8065d9
---
Ryzen 9 3900X	Ubuntu 20.04	AVX2	tiny	4	123	558	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	base	4	154	1289	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	small	4	308	4775	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	medium	4	749	16576	b8065d9
Ryzen 9 3900X	Ubuntu 20.04	AVX2	large	4	1320	30650	b8065d9
---
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	tiny	4	69	154	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	base	4	94	291	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	small	4	219	948	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	medium	4	610	2582	b8065d9
MacBook M1 Pro	MacOS 13.0.1	NEON BLAS	large	4	1205	4692	b8065d9

Lower Enc. is better.

On Ryzen 9 5950X these flags actually make the performance worse by ~10%
On Ryzen 9 3900X there is ~20% improvement on the large model and almost no improvement on the other models
On MacBook M1 Pro there is ~10% improvement across all models

Given these results, I don't think it is crucial to have these flags. Sometimes they help, sometimes they don't.
Even if the benefit on a no-AVX CPU is as big as 2 times, I still don't think it is necessary to add them in general.

So I think for now, I will leave the existing Makefile as it is.

'-Ofast' and '-march=native' cause 2x-speedup in machines with SSE (b…

49b2f9b

…ut no AVX) instructions. Should help other platforms, too.

ttsiodras mentioned this pull request Dec 10, 2022

Suboptimal performance in SSE-enabled CPUs that don't have AVX #251

Closed

Showcase autoconf/automake version.

2613ac5

ttsiodras added 3 commits December 11, 2022 17:35

Update with ggerganov contact details.

6db21e2

Remnant cleanup.

71175e4

Also, showcase automatically made #define USE_SSE

a794b6b

ggerganov closed this Dec 16, 2022

ggerganov mentioned this pull request Aug 1, 2023

CUDA: fixed LLAMA_FAST compilation option ggml-org/llama.cpp#2473

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

'-Ofast' and '-march=native' provide significant speedup #252

'-Ofast' and '-march=native' provide significant speedup #252

Uh oh!

ttsiodras commented Dec 10, 2022

Uh oh!

ttsiodras commented Dec 10, 2022

Uh oh!

luke-jr commented Dec 10, 2022

Uh oh!

ttsiodras commented Dec 10, 2022 •

edited

Loading

Uh oh!

luke-jr commented Dec 10, 2022

Uh oh!

ttsiodras commented Dec 10, 2022 •

edited

Loading

Uh oh!

ggerganov commented Dec 11, 2022 •

edited

Loading

Uh oh!

ttsiodras commented Dec 11, 2022

Uh oh!

ttsiodras commented Dec 11, 2022

Uh oh!

ttsiodras commented Dec 11, 2022 •

edited

Loading

Uh oh!

luke-jr commented Dec 11, 2022

Uh oh!

ggerganov commented Dec 16, 2022

Uh oh!

Uh oh!

'-Ofast' and '-march=native' provide significant speedup #252

'-Ofast' and '-march=native' provide significant speedup #252

Uh oh!

Conversation

ttsiodras commented Dec 10, 2022

Uh oh!

ttsiodras commented Dec 10, 2022

Uh oh!

luke-jr commented Dec 10, 2022

Uh oh!

ttsiodras commented Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luke-jr commented Dec 10, 2022

Uh oh!

ttsiodras commented Dec 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Dec 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ttsiodras commented Dec 11, 2022

Uh oh!

ttsiodras commented Dec 11, 2022

Uh oh!

ttsiodras commented Dec 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luke-jr commented Dec 11, 2022

Uh oh!

ggerganov commented Dec 16, 2022

Uh oh!

Uh oh!

ttsiodras commented Dec 10, 2022 •

edited

Loading

ttsiodras commented Dec 10, 2022 •

edited

Loading

ggerganov commented Dec 11, 2022 •

edited

Loading

ttsiodras commented Dec 11, 2022 •

edited

Loading