Skip to content

Add dav1d as potential external project #182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 9, 2024
Merged

Conversation

mstorsjo
Copy link
Member

@mstorsjo mstorsjo commented Nov 27, 2024

To add the external source, run
"git clone -b 1.5.0 https://code.videolan.org/videolan/dav1d.git" to clone the dav1d sources, e.g. in the
llvm-test-suite/test-suite-externals directory (which may not exist but can be created manually), or configure with
TEST_SUITE_DAV1D_ROOT pointing at a directory containing this source.

This builds two targets; the dav1d command line executable (which isn't executed as a test) and the dav1d checkasm test executable (which runs compiler generated functions and compares them with handwritten assembly versions of them). The checkasm execuable can also be run manually to microbenchmark functions, e.g. "External/dav1d/dav1d_checkasm --bench --test=mc_8bpc --function=warp*".

It is not very meaningful to benchmark the execution of the whole checkasm executable, as it runs a different numbers of functions depending on the number of SIMD extensions available on the target CPU, and each individual function only gets executed a couple of times each, so any changes within only a couple of functions would drown in the measurement noise.

(Microbenchmarking on aarch64 currently requires direct access to the pmccntr_el0 register. To currently use a different timer register, edit dav1d/tests/checkasm/checkasm.h and change pmccntr_el0 into cntvct_el0.)

This uses a static configuration of the project (when building the upstream project with their own build system, there's a number of build options that can be configured). Assembly is hooked up and enabled on i386, x86_64, arm and aarch64.

For architectures other than those, the checkasm test won't have any reference for detecting e.g. miscompilations of functions, but building it can still be meaningful (for testing compilation, or benchmarking the execution of the C version of functions).

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this up! Just tried this on ARM64 macOS and it works as expected!

Comment on lines 42 to 44
string(REGEX REPLACE "^-Wno-" "" warning ${warning})
string(TOUPPER ${warning} warning_name)
string(REGEX REPLACE "[-=]" "_" warning_name ${warning_name})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? Same for above

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this bit probably requires a bit comments.

When testing for support for options that disable warnings, e.g. -Wno-foo, we actually need to check for the positive form of the option, e.g. -Wfoo. Because if you pass an unrecognized -Wno-foo to GCC, it will silently accept it (I guess the logic is, as it doesn't know about -Wfoo, it surely won't print any such warning, so -Wno-foo will be fulfilled automatically). However, if GCC then prints any other warning, it will also add a note saying "unrecognized command-line option '-Wno-foo' may have been intended to silence earlier diagnostics".

So this is convenience helpers for testing whether -Wfoo is supported, when we want to add -Wno-foo.

Plus convenience to create a reasonably named CMake variable for the result (to let it be cached across reruns) to keep the result, without the caller having to manually name it. I.e. so we can just do check_disable_warning(-Wno-maybe-uninitialized) rather than e.g. check_disable_warning(SUPPORTS_WMAYBE_UNINITIALIZED -Wno-maybe-uninitialized) or similar.

We used to have the same pattern in the main LLVM CMake, see e.g. https://github.com/llvm/llvm-project/blob/llvmorg-18.1.0/llvm/cmake/modules/HandleLLVMOptions.cmake#L817-L818, but there it's spelled out manually (both -Wno-foo and -Wfoo written manually, plus a corresponding CMake variable name). Same thing in LLDB, https://github.com/llvm/llvm-project/blob/llvmorg-18.1.4/lldb/cmake/modules/LLDBConfig.cmake#L189-L190. But lately a bunch of those checks have been removed or reduced into checks just for the compiler versions, to speed up the cmake configuration process.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context, would be good to just add a quick comment to document


if (CMAKE_SYSTEM_PROCESSOR STREQUAL "aarch64" OR CMAKE_SYSTEM_PROCESSOR STREQUAL "arm64")
enable_language(ASM)
set(CMAKE_SYSTEM_PROCESSOR "aarch64")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to adjust this? If it is just needed locally in the file, better use a new variable instead of (re-)setting an existing one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's only used locally within the file.

As CMake isn't very consistent about CMAKE_SYSTEM_PROCESSOR (you'd have different values for aarch64/arm64 on different OSes, likewise for x86), I tried to normalize it here (to a value which is a potential value for CMAKE_SYSTEM_PROCESSOR anyway), to simplify the later checks.

Originally I had prototyped the CMake wrapping with CMakeLists.txt within the dav1d (and ffmpeg) source tree, with one CMakeLists.txt per subdirectory, so there it felt more natural to just use the standard variable, but assuming normalization of it. But when we have it all within one single file, I guess it's just as easy and clear to just use new variables for it as well.

@sjoerdmeijer
Copy link
Contributor

Yes, great stuff, thanks!
I am going to try this too on our platforms, will let you know tomorrow.

@mstorsjo
Copy link
Member Author

Thanks for putting this up! Just tried this on ARM64 macOS and it works as expected!

Nice! Yes, I've tested it across x86 and arm/aarch64, on Linux, macOS and Windows (both mingw style environments and MSVC style, but assuming Clang, not MSVC itself).

Also, the note about running benchmarks requiring pmccntr_el0, that's only relevant on Linux (where it required using a kernel module to make it accessible to user space) and Windows (where that register is documented to always be accessible); on macOS we just use mach_absolute_time() instead, which IIRC has the same precision as cntvct_el0 as well).

@sjoerdmeijer
Copy link
Contributor

sjoerdmeijer commented Nov 30, 2024

Can we add a README with some descriptions of the benchmark, how to work with it, what to look for? For example, for the auto vectorisation use case, how we could compare the hand-written and optimised assembly functions with the C reference implementation.

The description of this merge request would be a very good start for that.

@sjoerdmeijer
Copy link
Contributor

And does it deserve a mention in the test-suite user-guide? We only mention SPEC in the External Suites section, but do we want to mention dav1d?

@sjoerdmeijer
Copy link
Contributor

It works for me too, so that is a great start!
I am now going to look how to work with it, hence my previous questions, and see how I can use it for performance analysis and evaluation. For example, I noticed that the total runtime is ~2.3 seconds on my system, and going to see how stable the numbers are, and what we are running/testing.

@mstorsjo
Copy link
Member Author

mstorsjo commented Dec 5, 2024

Can we add a README with some descriptions of the benchmark, how to work with it, what to look for? For example, for the auto vectorisation use case, how we could compare the hand-written and optimised assembly functions with the C reference implementation.

Sure, I can do that.

And does it deserve a mention in the test-suite user-guide? We only mention SPEC in the External Suites section, but do we want to mention dav1d?

Yep, it surely does. But as that's a separate repo, I can't include those changes in this PR, so I'd wait until this is complete before starting to make any changes there.

It works for me too, so that is a great start!

Nice, that's great to hear!

I am now going to look how to work with it, hence my previous questions, and see how I can use it for performance analysis and evaluation. For example, I noticed that the total runtime is ~2.3 seconds on my system, and going to see how stable the numbers are, and what we are running/testing.

As mentioned in the PR description, and as I'll try to expand on in a README file, benchmarking the total runtime of checkasm might be reasonable on one specific system, but do note that the actual amount of work that is being done varies depending on the number of CPU features that are supported on the host.

First the tool does one round through all tests, doing a dummy comparison of the C version of each function against itself. Then if the CPU e.g. supports NEON, then it does another round through all functions, and if enabling the NEON flag exposes a different function than the previous round, then that function is executed and compared with the C version of the same. Then after that, similar rounds are done for later CPU features (dotprod, i8mm, SVE, SVE2). So on a CPU that supports i8mm and dotprod, running checkasm will run more code than on one that doesn't. (In practice, the coverage of i8mm, dotprod and SVE is mostly focused on one function right now, but e.g. on x86, there's quite complete separate versions of most functions for SSSE3, AVX2 and AVX512.)

(I'll try to summarize this in the README as well.)

@fhahn
Copy link
Contributor

fhahn commented Dec 6, 2024

Can we add a README with some descriptions of the benchmark, how to work with it, what to look for? For example, for the auto vectorisation use case, how we could compare the hand-written and optimised assembly functions with the C reference implementation.

Sure, I can do that.

And does it deserve a mention in the test-suite user-guide? We only mention SPEC in the External Suites section, but do we want to mention dav1d?

Yep, it surely does. But as that's a separate repo, I can't include those changes in this PR, so I'd wait until this is complete before starting to make any changes there.

It works for me too, so that is a great start!

Nice, that's great to hear!

I am now going to look how to work with it, hence my previous questions, and see how I can use it for performance analysis and evaluation. For example, I noticed that the total runtime is ~2.3 seconds on my system, and going to see how stable the numbers are, and what we are running/testing.

As mentioned in the PR description, and as I'll try to expand on in a README file, benchmarking the total runtime of checkasm might be reasonable on one specific system, but do note that the actual amount of work that is being done varies depending on the number of CPU features that are supported on the host.

First the tool does one round through all tests, doing a dummy comparison of the C version of each function against itself. Then if the CPU e.g. supports NEON, then it does another round through all functions, and if enabling the NEON flag exposes a different function than the previous round, then that function is executed and compared with the C version of the same. Then after that, similar rounds are done for later CPU features (dotprod, i8mm, SVE, SVE2). So on a CPU that supports i8mm and dotprod, running checkasm will run more code than on one that doesn't. (In practice, the coverage of i8mm, dotprod and SVE is mostly focused on one function right now, but e.g. on x86, there's quite complete separate versions of most functions for SSSE3, AVX2 and AVX512.)

(I'll try to summarize this in the README as well.)

I think it is fine to first add the version focusing on functional correctness, then go from there. Once it is in, we can easily iterate on it

@sjoerdmeijer
Copy link
Contributor

I think it is fine to first add the version focusing on functional correctness, then go from there. Once it is in, we can easily iterate on it

Ok, I certainly don't mind that approach of course. That means there are two steps:

  • we get it running first, a.k.a. we test functional correctness,
  • evaluating auto-vec performance.

What do we exactly get with the patch in its current form? The functional correctness check? Is this based on a crc or golden reference check, or perhaps comparing the outcome of C reference version with the hand-written assembly?
As you can tell from these questions, I have not looked at all details...I have browsed the source code a bit, but there's quite a lot to see.

@mstorsjo
Copy link
Member Author

mstorsjo commented Dec 6, 2024

I think it is fine to first add the version focusing on functional correctness, then go from there. Once it is in, we can easily iterate on it

Ok, I certainly don't mind that approach of course. That means there are two steps:

  • we get it running first, a.k.a. we test functional correctness,
  • evaluating auto-vec performance.

What do we exactly get with the patch in its current form? The functional correctness check? Is this based on a crc or golden reference check, or perhaps comparing the outcome of C reference version with the hand-written assembly? As you can tell from these questions, I have not looked at all details...I have browsed the source code a bit, but there's quite a lot to see.

With the patch in the current form, you get a functional correctness check, by comparing the outcome of the C reference version with the handwritten assembly. The checkasm tool is primarily intended to check the other way around, assuming that the C code is correct and comparing whether the handwritten assembly matches, but the check obviously goes both ways. This correctness check obviously only covers the functions that have assembly; for x86, arm and aarch64, it should be quite complete though. (But building the x86 assembly requires the nasm tool - without it, no assembly gets built and you'd have no correctness check.)

Dav1d does contain a small amount of riscv assembly as well, but nowhere near as complete as for x86, arm and aarch64 - I didn't hook that up in the current CMake adaptation, but it shouldn't be hard to add in the same way if someone wants to do it.

For architectures without assembly, you don't get any correctness check.


Even without the correctness check, just compiling all the code can help catch issues, if issues appear as failing asserts in the compiler. At least with ffmpeg, which is similar code but a much larger codebase, I run into committed changes in LLVM that trigger failed asserts when compiling ffmpeg, maybe once a month, or every other month. So having that easily accessible to compile should make it easier to catch these issues sooner, or easier to reproduce.


The checkasm tool also does allow you to evaluate auto-vectorization performance. As the tool primarily is designed for developing assembly, it's perhaps not ideally ergonomical for this purpose, but should be quite usable still. You can run the tool, testing all functions, and get a total runtime for that, and compare if the total runtime goes up or down. But you can also micro-benchmark individual functions. By doing one benchmark run with autovectorization disabled, and one with it enabled, you can easily measure and see which functions benefit from the autovectorization (and how much), and which ones actually get slower due to the autovectorization.

@mstorsjo
Copy link
Member Author

mstorsjo commented Dec 6, 2024

For evaluating autovectorization performance, I posted some such measurements in https://discourse.llvm.org/t/autovectorization-measurements-from-dav1d-and-ffmpeg/82623 - but let's walk through one such case.

Do two separate configurations of llvm-test-suite, one default (with autovectorization enabled), and one with autovectorization disabled. E.g. -DCMAKE_C_FLAGS_RELEASE="-O3" vs -DCMAKE_C_FLAGS_RELEASE="-O3 -fno-vectorize -fno-slp-vectorize". (The same would go for CXX, but dav1d is C only, so keeping it short and simple here.)

Build dav1d_checkasm and run e.g. ./External/dav1d/dav1d_checkasm --bench --test=cdef_8bpc 0. (This restricts the execution to just one set of functions, as benchmarking it all takes a while, and sets a fixed random seed for the random input data.)

For the case with vectorization disabled, we get this as part of the output:

cdef_filter_4x8_10_8bpc_c:         7.4 ( 1.00x)
cdef_filter_4x8_10_8bpc_neon:      1.6 ( 4.51x)

vs with vectorization enabled:

cdef_filter_4x8_10_8bpc_c:        11.3 ( 1.00x)
cdef_filter_4x8_10_8bpc_neon:      1.7 ( 6.84x)

So the handwritten assembly version (the neon one) has got exactly the same runtime, modulo measurement noise, while the compiler generated version now takes 11.3 timer units, while the non-vectorized one only took 7.4 timer units. So this particular case is a case where the autovectorization is harmful, for this particular case of inputs.


As part of the earlier discourse post, I compiled one list at https://martin.st/temp/dav1d-autovectorize/speedup-clang-vect.txt; this contains the "speedup" of autovectorized code vs non-vectorized code. In that list, we find cdef_filter_4x8_10_8bpc_vect: 0.47, i.e. in that measurement round, I also noticed that compiler autovectorization made this case 0.47x as fast as it was before - roughly similar to this case. In the other end of the spectrum, we have mct_8tap_regular_w128_0_8bpc_vect: 17.06, where autovectorization does help make the code 17x faster.

(That listing uses _vect as suffix for autovectorized C code, as opposed to non-vectorized C code, as part of manual extra postprocessing/summarizing of the results - see the discourse post and shell scripts in the same directory.)

When looking at the actual benchmarks for this function, we can do e.g. ./External/dav1d/dav1d_checkasm --bench --test=mc_8bpc --function=mct_8tap_regular_w128_0_8bpc 0, and we get this for the nonvectorized case:

mct_8tap_regular_w128_0_8bpc_c:          180.9 ( 1.00x)
mct_8tap_regular_w128_0_8bpc_neon:        10.8 (16.69x)
mct_8tap_regular_w128_0_8bpc_dotprod:     10.8 (16.74x)

And this for the vectorized case:

mct_8tap_regular_w128_0_8bpc_c:           18.1 ( 1.00x)
mct_8tap_regular_w128_0_8bpc_neon:        10.8 ( 1.68x)
mct_8tap_regular_w128_0_8bpc_dotprod:     10.8 ( 1.67x)

So in this measurement, it wasn't 17x faster (I benchmarked on a different machine this time, with a different compiler version than last time), but at least 10x faster. Still not as fast as the handwritten case, but quite close (that one is a quite trivial function).


Unfortunately, the mapping between the test areas (e.g. cdef_8bpc or mc_8bpc and test case names (cdef_filter_4x8_10_8bpc or mct_8tap_regular_w128_0_8bpc) and actual function names in the code isn't always systematic. But you can leave out the --test= parameter, especially if benchmarking with a specific --function parameter - it just takes a couple seconds longer to run then.

Finding which exact C function gets executed for each test case isn't trivial either, unfortunately. For this particular case, we have e.g. dav1d/tests/checkasm/mc.c which sets up a test of if (check_func(c->mct[filter], "mct_%s_w%d_%s_%dbpc", filter_names[filter], w, mxy_names[mxy], BITDEPTH)). This tests a function which is hooked up in dav1d/src/mc_tmpl.c, where c->mct[] is assigned e.g. prep_8tap_regular_c, which calls prep_8tap_c, which for the _0_ case (mx=0, my=0) calls prep_c; the codegen for this function can be seen in External/dav1d/CMakeFiles/dav1d_bitdepth_8.dir/__/__/test-suite-externals/dav1d/src/mc_tmpl.c.o.


As for adjusting the tool to make it more ergonomical for autovectorization benchmarking? As long as this is integrated as an external project, we have to live with what the tool looks like upstream, we can only apply different build configurations to it, more or less. If the whole upstream source would be integrated into llvm-test-suite, we would have the option to tweak the tool further to make it more convenient and ergonomical for the benchmarks you'd like to do.

@sjoerdmeijer
Copy link
Contributor

Thank you very much for explaining and elaborating, @mstorsjo !
Your last two comments are the README file I was hoping and looking for. If you copy the essence of these 2 comments to a file, we are good to go with this patch in my opinion. It will then contain essential information how the correctness is performed (asm vs. C-reference implementation), and the instructions for tinkerers that would like to compare autovec (the limitations are fine).

To add the external source, run
"git clone -b 1.5.0 https://code.videolan.org/videolan/dav1d.git"
to clone the dav1d sources, e.g. in the
llvm-test-suite/test-suite-externals directory (which may not exist
but can be created manually), or configure with
TEST_SUITE_DAV1D_ROOT pointing at a directory containing this source.

This builds two targets; the dav1d command line executable (which
isn't executed as a test) and the dav1d checkasm test executable
(which runs compiler generated functions and compares them with
handwritten assembly versions of them). The checkasm execuable
can also be run manually to microbenchmark functions, e.g.
"External/dav1d/dav1d_checkasm --bench --test=mc_8bpc --function=warp*".

It is not very meaningful to benchmark the execution of the whole
checkasm executable, as it runs a different numbers of functions
depending on the number of SIMD extensions available on the target
CPU.

(Benchmarking on aarch64 currently requires direct access to
the pmccntr_el0 register. To currently use a different
timer register, edit dav1d/tests/checkasm/checkasm.h and change
pmccntr_el0 into cntvct_el0.)

This uses a static configuration of the project (when building the
upstream project with their own build system, there's a number of
build options that can be configured). Assembly is hooked up and
enabled on i386, x86_64, arm and aarch64.

For architectures other than those, the checkasm test won't have
any reference for detecting e.g. miscompilations of functions,
but building it can still be meaningful (for testing compilation,
or benchmarking the execution of the C version of functions).
@mstorsjo
Copy link
Member Author

mstorsjo commented Dec 8, 2024

Now I've addressed the comments:

  • The CMake functions for testing for warning options have been merged into one single function, and it now contains a lot of comments explaining what it does and why
  • The CMake setup no longer tries to normalize CMAKE_SYSTEM_PROCESSOR, but instead sets a local variable to indicate which case is chosen
  • I've added a README.md where I've tried to summarize things that I've explained above, and a bit more. It's a fair bit to read, but it should hopefully work as tour guide into the checkasm tool, from the compiler developer point of view.

Copy link
Contributor

@sjoerdmeijer sjoerdmeijer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

Please wait a day in case @fhahn has some more comments.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

@mstorsjo mstorsjo merged commit 90be4b7 into llvm:main Dec 9, 2024
1 check passed
@mstorsjo mstorsjo deleted the dav1d branch December 9, 2024 21:01
mstorsjo added a commit to mstorsjo/llvm-project that referenced this pull request Dec 19, 2024
Since llvm/llvm-test-suite#182 and
llvm/llvm-test-suite#188, these projects
can now be added as external projects within llvm-test-suite.
mstorsjo added a commit to mstorsjo/llvm-project that referenced this pull request Dec 19, 2024
Since llvm/llvm-test-suite#182 and
llvm/llvm-test-suite#188, these projects
can now be added as external projects within llvm-test-suite.
mstorsjo added a commit to llvm/llvm-project that referenced this pull request Dec 20, 2024
Since llvm/llvm-test-suite#182 and
llvm/llvm-test-suite#188, these projects can now
be added as external projects within llvm-test-suite.
github-actions bot pushed a commit to arm/arm-toolchain that referenced this pull request Jan 10, 2025
Since llvm/llvm-test-suite#182 and
llvm/llvm-test-suite#188, these projects can now
be added as external projects within llvm-test-suite.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants