-
Notifications
You must be signed in to change notification settings - Fork 349
Add dav1d as potential external project #182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this up! Just tried this on ARM64 macOS and it works as expected!
External/dav1d/CMakeLists.txt
Outdated
string(REGEX REPLACE "^-Wno-" "" warning ${warning}) | ||
string(TOUPPER ${warning} warning_name) | ||
string(REGEX REPLACE "[-=]" "_" warning_name ${warning_name}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? Same for above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this bit probably requires a bit comments.
When testing for support for options that disable warnings, e.g. -Wno-foo
, we actually need to check for the positive form of the option, e.g. -Wfoo
. Because if you pass an unrecognized -Wno-foo
to GCC, it will silently accept it (I guess the logic is, as it doesn't know about -Wfoo
, it surely won't print any such warning, so -Wno-foo
will be fulfilled automatically). However, if GCC then prints any other warning, it will also add a note saying "unrecognized command-line option '-Wno-foo' may have been intended to silence earlier diagnostics".
So this is convenience helpers for testing whether -Wfoo
is supported, when we want to add -Wno-foo
.
Plus convenience to create a reasonably named CMake variable for the result (to let it be cached across reruns) to keep the result, without the caller having to manually name it. I.e. so we can just do check_disable_warning(-Wno-maybe-uninitialized)
rather than e.g. check_disable_warning(SUPPORTS_WMAYBE_UNINITIALIZED -Wno-maybe-uninitialized)
or similar.
We used to have the same pattern in the main LLVM CMake, see e.g. https://github.com/llvm/llvm-project/blob/llvmorg-18.1.0/llvm/cmake/modules/HandleLLVMOptions.cmake#L817-L818, but there it's spelled out manually (both -Wno-foo
and -Wfoo
written manually, plus a corresponding CMake variable name). Same thing in LLDB, https://github.com/llvm/llvm-project/blob/llvmorg-18.1.4/lldb/cmake/modules/LLDBConfig.cmake#L189-L190. But lately a bunch of those checks have been removed or reduced into checks just for the compiler versions, to speed up the cmake configuration process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the context, would be good to just add a quick comment to document
External/dav1d/CMakeLists.txt
Outdated
|
||
if (CMAKE_SYSTEM_PROCESSOR STREQUAL "aarch64" OR CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64") | ||
enable_language(ASM) | ||
set(CMAKE_SYSTEM_PROCESSOR "aarch64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to adjust this? If it is just needed locally in the file, better use a new variable instead of (re-)setting an existing one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's only used locally within the file.
As CMake isn't very consistent about CMAKE_SYSTEM_PROCESSOR
(you'd have different values for aarch64/arm64 on different OSes, likewise for x86), I tried to normalize it here (to a value which is a potential value for CMAKE_SYSTEM_PROCESSOR
anyway), to simplify the later checks.
Originally I had prototyped the CMake wrapping with CMakeLists.txt within the dav1d (and ffmpeg) source tree, with one CMakeLists.txt per subdirectory, so there it felt more natural to just use the standard variable, but assuming normalization of it. But when we have it all within one single file, I guess it's just as easy and clear to just use new variables for it as well.
Yes, great stuff, thanks! |
Nice! Yes, I've tested it across x86 and arm/aarch64, on Linux, macOS and Windows (both mingw style environments and MSVC style, but assuming Clang, not MSVC itself). Also, the note about running benchmarks requiring |
Can we add a README with some descriptions of the benchmark, how to work with it, what to look for? For example, for the auto vectorisation use case, how we could compare the hand-written and optimised assembly functions with the C reference implementation. The description of this merge request would be a very good start for that. |
And does it deserve a mention in the test-suite user-guide? We only mention SPEC in the External Suites section, but do we want to mention dav1d? |
It works for me too, so that is a great start! |
Sure, I can do that.
Yep, it surely does. But as that's a separate repo, I can't include those changes in this PR, so I'd wait until this is complete before starting to make any changes there.
Nice, that's great to hear!
As mentioned in the PR description, and as I'll try to expand on in a README file, benchmarking the total runtime of checkasm might be reasonable on one specific system, but do note that the actual amount of work that is being done varies depending on the number of CPU features that are supported on the host. First the tool does one round through all tests, doing a dummy comparison of the C version of each function against itself. Then if the CPU e.g. supports NEON, then it does another round through all functions, and if enabling the NEON flag exposes a different function than the previous round, then that function is executed and compared with the C version of the same. Then after that, similar rounds are done for later CPU features (dotprod, i8mm, SVE, SVE2). So on a CPU that supports i8mm and dotprod, running checkasm will run more code than on one that doesn't. (In practice, the coverage of i8mm, dotprod and SVE is mostly focused on one function right now, but e.g. on x86, there's quite complete separate versions of most functions for SSSE3, AVX2 and AVX512.) (I'll try to summarize this in the README as well.) |
I think it is fine to first add the version focusing on functional correctness, then go from there. Once it is in, we can easily iterate on it |
Ok, I certainly don't mind that approach of course. That means there are two steps:
What do we exactly get with the patch in its current form? The functional correctness check? Is this based on a crc or golden reference check, or perhaps comparing the outcome of C reference version with the hand-written assembly? |
With the patch in the current form, you get a functional correctness check, by comparing the outcome of the C reference version with the handwritten assembly. The checkasm tool is primarily intended to check the other way around, assuming that the C code is correct and comparing whether the handwritten assembly matches, but the check obviously goes both ways. This correctness check obviously only covers the functions that have assembly; for x86, arm and aarch64, it should be quite complete though. (But building the x86 assembly requires the nasm tool - without it, no assembly gets built and you'd have no correctness check.) Dav1d does contain a small amount of riscv assembly as well, but nowhere near as complete as for x86, arm and aarch64 - I didn't hook that up in the current CMake adaptation, but it shouldn't be hard to add in the same way if someone wants to do it. For architectures without assembly, you don't get any correctness check. Even without the correctness check, just compiling all the code can help catch issues, if issues appear as failing asserts in the compiler. At least with ffmpeg, which is similar code but a much larger codebase, I run into committed changes in LLVM that trigger failed asserts when compiling ffmpeg, maybe once a month, or every other month. So having that easily accessible to compile should make it easier to catch these issues sooner, or easier to reproduce. The checkasm tool also does allow you to evaluate auto-vectorization performance. As the tool primarily is designed for developing assembly, it's perhaps not ideally ergonomical for this purpose, but should be quite usable still. You can run the tool, testing all functions, and get a total runtime for that, and compare if the total runtime goes up or down. But you can also micro-benchmark individual functions. By doing one benchmark run with autovectorization disabled, and one with it enabled, you can easily measure and see which functions benefit from the autovectorization (and how much), and which ones actually get slower due to the autovectorization. |
For evaluating autovectorization performance, I posted some such measurements in https://discourse.llvm.org/t/autovectorization-measurements-from-dav1d-and-ffmpeg/82623 - but let's walk through one such case. Do two separate configurations of llvm-test-suite, one default (with autovectorization enabled), and one with autovectorization disabled. E.g. Build For the case with vectorization disabled, we get this as part of the output:
vs with vectorization enabled:
So the handwritten assembly version (the neon one) has got exactly the same runtime, modulo measurement noise, while the compiler generated version now takes 11.3 timer units, while the non-vectorized one only took 7.4 timer units. So this particular case is a case where the autovectorization is harmful, for this particular case of inputs. As part of the earlier discourse post, I compiled one list at https://martin.st/temp/dav1d-autovectorize/speedup-clang-vect.txt; this contains the "speedup" of autovectorized code vs non-vectorized code. In that list, we find (That listing uses When looking at the actual benchmarks for this function, we can do e.g.
And this for the vectorized case:
So in this measurement, it wasn't 17x faster (I benchmarked on a different machine this time, with a different compiler version than last time), but at least 10x faster. Still not as fast as the handwritten case, but quite close (that one is a quite trivial function). Unfortunately, the mapping between the test areas (e.g. Finding which exact C function gets executed for each test case isn't trivial either, unfortunately. For this particular case, we have e.g. As for adjusting the tool to make it more ergonomical for autovectorization benchmarking? As long as this is integrated as an external project, we have to live with what the tool looks like upstream, we can only apply different build configurations to it, more or less. If the whole upstream source would be integrated into llvm-test-suite, we would have the option to tweak the tool further to make it more convenient and ergonomical for the benchmarks you'd like to do. |
Thank you very much for explaining and elaborating, @mstorsjo ! |
To add the external source, run "git clone -b 1.5.0 https://code.videolan.org/videolan/dav1d.git" to clone the dav1d sources, e.g. in the llvm-test-suite/test-suite-externals directory (which may not exist but can be created manually), or configure with TEST_SUITE_DAV1D_ROOT pointing at a directory containing this source. This builds two targets; the dav1d command line executable (which isn't executed as a test) and the dav1d checkasm test executable (which runs compiler generated functions and compares them with handwritten assembly versions of them). The checkasm execuable can also be run manually to microbenchmark functions, e.g. "External/dav1d/dav1d_checkasm --bench --test=mc_8bpc --function=warp*". It is not very meaningful to benchmark the execution of the whole checkasm executable, as it runs a different numbers of functions depending on the number of SIMD extensions available on the target CPU. (Benchmarking on aarch64 currently requires direct access to the pmccntr_el0 register. To currently use a different timer register, edit dav1d/tests/checkasm/checkasm.h and change pmccntr_el0 into cntvct_el0.) This uses a static configuration of the project (when building the upstream project with their own build system, there's a number of build options that can be configured). Assembly is hooked up and enabled on i386, x86_64, arm and aarch64. For architectures other than those, the checkasm test won't have any reference for detecting e.g. miscompilations of functions, but building it can still be meaningful (for testing compilation, or benchmarking the execution of the C version of functions).
Now I've addressed the comments:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM!
Please wait a day in case @fhahn has some more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks
Since llvm/llvm-test-suite#182 and llvm/llvm-test-suite#188, these projects can now be added as external projects within llvm-test-suite.
Since llvm/llvm-test-suite#182 and llvm/llvm-test-suite#188, these projects can now be added as external projects within llvm-test-suite.
Since llvm/llvm-test-suite#182 and llvm/llvm-test-suite#188, these projects can now be added as external projects within llvm-test-suite.
Since llvm/llvm-test-suite#182 and llvm/llvm-test-suite#188, these projects can now be added as external projects within llvm-test-suite.
To add the external source, run
"git clone -b 1.5.0 https://code.videolan.org/videolan/dav1d.git" to clone the dav1d sources, e.g. in the
llvm-test-suite/test-suite-externals directory (which may not exist but can be created manually), or configure with
TEST_SUITE_DAV1D_ROOT pointing at a directory containing this source.
This builds two targets; the dav1d command line executable (which isn't executed as a test) and the dav1d checkasm test executable (which runs compiler generated functions and compares them with handwritten assembly versions of them). The checkasm execuable can also be run manually to microbenchmark functions, e.g. "External/dav1d/dav1d_checkasm --bench --test=mc_8bpc --function=warp*".
It is not very meaningful to benchmark the execution of the whole checkasm executable, as it runs a different numbers of functions depending on the number of SIMD extensions available on the target CPU, and each individual function only gets executed a couple of times each, so any changes within only a couple of functions would drown in the measurement noise.
(Microbenchmarking on aarch64 currently requires direct access to the pmccntr_el0 register. To currently use a different timer register, edit dav1d/tests/checkasm/checkasm.h and change pmccntr_el0 into cntvct_el0.)
This uses a static configuration of the project (when building the upstream project with their own build system, there's a number of build options that can be configured). Assembly is hooked up and enabled on i386, x86_64, arm and aarch64.
For architectures other than those, the checkasm test won't have any reference for detecting e.g. miscompilations of functions, but building it can still be meaningful (for testing compilation, or benchmarking the execution of the C version of functions).