Skip to content

Implement faster memcmp for x86_64 #467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 31, 2022
Merged

Conversation

Demindiro
Copy link
Contributor

@Demindiro Demindiro commented May 27, 2022

On x86_64 it is generally fast to read unaligned words, so there is no need to do byte-by-byte comparisons.

The new algorithm compares 32 bytes at a time. If any doesn't match it effectively does a binary search to find the first non-matching byte.

Benchmark results on a Ryzen 7 2700X:

master (with extra benchmarks)
test memcmp_rust_1048576              ... bench:     293,975 ns/iter (+/- 3,188) = 3566 MB/s
test memcmp_rust_16                   ... bench:           7 ns/iter (+/- 0) = 2285 MB/s
test memcmp_rust_32                   ... bench:          11 ns/iter (+/- 1) = 2909 MB/s
test memcmp_rust_4096                 ... bench:       1,049 ns/iter (+/- 203) = 3904 MB/s
test memcmp_rust_64                   ... bench:          32 ns/iter (+/- 4) = 2000 MB/s
test memcmp_rust_8                    ... bench:           4 ns/iter (+/- 1) = 2000 MB/s
test memcmp_rust_unaligned_1048575    ... bench:     276,654 ns/iter (+/- 78,924) = 3790 MB/s
test memcmp_rust_unaligned_15         ... bench:           8 ns/iter (+/- 0) = 2000 MB/s
test memcmp_rust_unaligned_31         ... bench:          11 ns/iter (+/- 2) = 2909 MB/s
test memcmp_rust_unaligned_4095       ... bench:       1,234 ns/iter (+/- 301) = 3319 MB/s
test memcmp_rust_unaligned_63         ... bench:          33 ns/iter (+/- 11) = 1939 MB/s
test memcmp_rust_unaligned_7          ... bench:           4 ns/iter (+/- 0) = 2000 MB/s
memcmp-x86_64
test memcmp_rust_1048576              ... bench:      24,682 ns/iter (+/- 240) = 42483 MB/s
test memcmp_rust_16                   ... bench:           3 ns/iter (+/- 0) = 5333 MB/s
test memcmp_rust_32                   ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test memcmp_rust_4096                 ... bench:         113 ns/iter (+/- 1) = 36247 MB/s
test memcmp_rust_64                   ... bench:           5 ns/iter (+/- 0) = 12800 MB/s
test memcmp_rust_8                    ... bench:           3 ns/iter (+/- 0) = 2666 MB/s
test memcmp_rust_unaligned_1048575    ... bench:      27,049 ns/iter (+/- 6,989) = 38765 MB/s
test memcmp_rust_unaligned_15         ... bench:           3 ns/iter (+/- 0) = 5333 MB/s
test memcmp_rust_unaligned_31         ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test memcmp_rust_unaligned_4095       ... bench:         100 ns/iter (+/- 6) = 40960 MB/s
test memcmp_rust_unaligned_63         ... bench:           5 ns/iter (+/- 0) = 12800 MB/s
test memcmp_rust_unaligned_7          ... bench:           3 ns/iter (+/- 0) = 2666 MB/s
builtin
test memcmp_builtin_1048576           ... bench:      23,202 ns/iter (+/- 1,283) = 45193 MB/s
test memcmp_builtin_16                ... bench:           4 ns/iter (+/- 0) = 4000 MB/s
test memcmp_builtin_32                ... bench:           3 ns/iter (+/- 1) = 10666 MB/s
test memcmp_builtin_4096              ... bench:          73 ns/iter (+/- 16) = 56109 MB/s
test memcmp_builtin_64                ... bench:           4 ns/iter (+/- 1) = 16000 MB/s
test memcmp_builtin_8                 ... bench:           4 ns/iter (+/- 1) = 2000 MB/s
test memcmp_builtin_unaligned_1048575 ... bench:      25,353 ns/iter (+/- 793) = 41359 MB/s
test memcmp_builtin_unaligned_15      ... bench:           5 ns/iter (+/- 0) = 3200 MB/s
test memcmp_builtin_unaligned_31      ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test memcmp_builtin_unaligned_4095    ... bench:          85 ns/iter (+/- 4) = 48188 MB/s
test memcmp_builtin_unaligned_63      ... bench:           4 ns/iter (+/- 0) = 16000 MB/s
test memcmp_builtin_unaligned_7       ... bench:           4 ns/iter (+/- 0) = 2000 MB/s

Note that the results may vary a lot, e.g. memcmp_rust_unaligned_1048575 can be as slow as 26933 MB/s at times.

I also added some extra tests though I don't think there are enough yet. I am not sure what edge cases should be considered though.

Demindiro added 7 commits May 27, 2022 21:58
x86_64 can load unaligned words in a single cache line as fast as
aligned words. Even when crossing cache or page boundaries it is just as
fast to do an unaligned word read instead of multiple byte reads.

Also add a couple more tests & benchmarks.
I don't know why it isn't being optimized out though, which worries
me.
It only seems to save a single instruction at first sight yet the
effects are significant.
@Amanieu Amanieu merged commit 6b96d90 into rust-lang:master May 31, 2022
@Demindiro Demindiro deleted the memcmp-x86_64 branch May 31, 2022 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants