Skip to content

'-Ofast' and '-march=native' provide significant speedup #252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

ttsiodras
Copy link

'-Ofast' and '-march=native' cause 2x-speedup in machines with SSE (but no AVX) instructions. Should help other platforms, too.

…ut no AVX) instructions. Should help other platforms, too.
@ttsiodras
Copy link
Author

See #251 for details.

@luke-jr
Copy link

luke-jr commented Dec 10, 2022

I get:

c++: error: unrecognized command-line option ‘-march=native’; did you mean ‘-mcpu=native’?

idk why compilers can't standardise this stuff, but I guess it should be arch-conditional.

@ttsiodras
Copy link
Author

ttsiodras commented Dec 10, 2022

I get:

c++: error: unrecognized command-line option ‘-march=native’; did you mean ‘-mcpu=native’?

idk why compilers can't standardise this stuff, but I guess it should be arch-conditional.

From the official GCC documentation ( https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html ):

-mcpu=cpu-type

A deprecated synonym for -mtune=cpu-type

So the compiler you tried this on, Luke, is probably a rather old version of GCC.

In fact, when I try -mcpu=native in my machine, I get:

cc  -I.              -O3 -std=c11   -fPIC  -Ofast -mcpu=native -pthread   -c ggml.c -o ggml.o
cc: warning: ‘-mcpu=’ is deprecated; use ‘-mtune=’ or ‘-march=’ instead

Specs of my test: Arch Linux on Celeron N5095, with GCC12.2. The deprecation of -mcpu is quite a ways back, GCC-version wise. As for the speed difference between -march=native and -mtune=native, in my machine there is none.

@luke-jr
Copy link

luke-jr commented Dec 10, 2022

That was with GCC 11.3.0 for ppc64le. There is no -march on this platform at all.

@ttsiodras
Copy link
Author

ttsiodras commented Dec 10, 2022

Well, like all things, this is a balancing act...

The usual autoconf/automake machinery can be used, to have "./configure" emit a Makefile that uses whatever options apply best to the current machine. I can do that for whisper.cpp, if @ggerganov is OK with the involved complexity.

But as-is, -Ofast -march=native will work on all Intel/AMD/ARM machines with a decade old GCC. A quick Google search shows the -mcpu deprecation since 2004! ( https://forums.gentoo.org/viewtopic-t-222477-start-0.html )

@ggerganov
Copy link
Member

ggerganov commented Dec 11, 2022

So I'm not 100% sure what to do here.
Btw, I've already done experiments with -ffast-math and -march flags:

https://github.com/ggerganov/whisper.cpp/blob/ea38ad6e70e2b4bd0c1a79f3e2cbfd99ad9393c3/CMakeLists.txt#L77-L80

On my MacBook, building with stock clang, it does not recognise -march flag:

$ make
clang: error: the clang compiler does not support '-march=native'

$ clang -v
Apple clang version 14.0.0 (clang-1400.0.29.202)
Target: arm64-apple-darwin22.1.0
Thread model: posix

Using -Ofast is equivalent to -O3 -ffast-math.
Using -ffast-math does bring ~10% performance gain even on my machine. I am also aware of what -ffast-math does and what are the "side-effects" to the computation and at the moment I don't think it really hurts adding it. It will become a problem if someday we want to make whisper.cpp produce the same exact results across different CPUs - this is not the case today.

Let's think about this some more. Maybe we can hear more points of view on this topic and get better insight.

@ttsiodras
Copy link
Author

So I'm not 100% sure what to do here.

Well, this is what autoconf/automake were built for: to pick the best compilation options possible for the specific target we are building on. IMHO it's a shame to leave a 2x speedup on the table...

I could write the necessary configure.ac/Makefile.am (the sources for autoconf/automake-based builds). We would then automatically get a configure script, that would try a series of compilation options and build a Makefile tailor-made for the machine we work in. Would that be acceptable? Or are you opposed to autoconf/automake?

@ttsiodras
Copy link
Author

Just one more note: in GCC land, you can ask the compiler to emit instruction-set-specific versions of the functions, and dispatch appropriately at run-time, based on the machine we run on: https://github.com/ttsiodras/MandelbrotSSE/blob/master/src/xaos.cc#L31 I used that to get maximum flexibility in there - worked like a charm. I don't know if clang supports that, though.

@ttsiodras
Copy link
Author

ttsiodras commented Dec 11, 2022

In order for you to have more information on the autoconf/automake decision, I just pushed a few commits - you can try it out and decide for yourself.

  • For now, I only implemented SSE-checking ( https://github.com/ttsiodras/whisper.cpp/blob/master/configure.ac#L83 ) but I hope the pattern is clear enough. To add support for any other instructions you want, you just add the relevant assembly check, and then emit the compiler option you want. You also get a #define inside the auto-generated config.h, so you can make more compile-time decisions with #ifdef in your C/C++ code.

  • I also added SDL2 checks, since I saw 2 of your binaries needed it. They detect and use SDL2 fine in my tests here.

To see for yourself: after you clone my version of the repo, launch ./configure and make.

To modify the logic, edit configure.ac and/or Makefile.am - then launch ./bootstrap. This simply invokes autoreconf and automake, creating an updated version of the ./configure machinery.

@luke-jr
Copy link

luke-jr commented Dec 11, 2022

+1 to autotools. That would also make it simpler to libtoolise the library and make the examples link to it.

@ggerganov
Copy link
Member

@ttsiodras
Thanks for the effort, but the automake stuff is not for this project - it's too complicated

I did a few tests with and without -Ofast -march=native on different machines and here are the results:

-O3

CPU OS Config Model Th Load Enc. Commit
Ryzen 9 5950X Ubuntu 22.04 AVX2 tiny 4 137 297 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 base 4 183 665 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 small 4 373 2328 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 medium 4 923 7346 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 large 4 1681 14053 b8065d9
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 tiny 4 122 572 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 base 4 153 1303 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 small 4 305 4844 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 medium 4 750 16117 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 large 4 1331 37618 b8065d9
---
MacBook M1 Pro MacOS 13.0.1 NEON BLAS tiny 4 68 170 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS base 4 97 327 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 4 221 1069 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS medium 4 581 2873 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS large 4 1170 5173 b8065d9

-Ofast -march=native

CPU OS Config Model Th Load Enc. Commit
Ryzen 9 5950X Ubuntu 22.04 AVX2 tiny 4 137 320 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 base 4 180 721 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 small 4 366 2554 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 medium 4 900 8181 b8065d9
Ryzen 9 5950X Ubuntu 22.04 AVX2 large 4 1614 15679 b8065d9
---
Ryzen 9 3900X Ubuntu 20.04 AVX2 tiny 4 123 558 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 base 4 154 1289 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 small 4 308 4775 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 medium 4 749 16576 b8065d9
Ryzen 9 3900X Ubuntu 20.04 AVX2 large 4 1320 30650 b8065d9
---
MacBook M1 Pro MacOS 13.0.1 NEON BLAS tiny 4 69 154 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS base 4 94 291 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS small 4 219 948 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS medium 4 610 2582 b8065d9
MacBook M1 Pro MacOS 13.0.1 NEON BLAS large 4 1205 4692 b8065d9

Lower Enc. is better.

  • On Ryzen 9 5950X these flags actually make the performance worse by ~10%
  • On Ryzen 9 3900X there is ~20% improvement on the large model and almost no improvement on the other models
  • On MacBook M1 Pro there is ~10% improvement across all models

Given these results, I don't think it is crucial to have these flags. Sometimes they help, sometimes they don't.
Even if the benefit on a no-AVX CPU is as big as 2 times, I still don't think it is necessary to add them in general.

So I think for now, I will leave the existing Makefile as it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants