Implement faster memcmp for x86_64 #467

Demindiro · 2022-05-27T20:32:23Z

On x86_64 it is generally fast to read unaligned words, so there is no need to do byte-by-byte comparisons.

The new algorithm compares 32 bytes at a time. If any doesn't match it effectively does a binary search to find the first non-matching byte.

Benchmark results on a Ryzen 7 2700X:

master (with extra benchmarks)

test memcmp_rust_1048576              ... bench:     293,975 ns/iter (+/- 3,188) = 3566 MB/s
test memcmp_rust_16                   ... bench:           7 ns/iter (+/- 0) = 2285 MB/s
test memcmp_rust_32                   ... bench:          11 ns/iter (+/- 1) = 2909 MB/s
test memcmp_rust_4096                 ... bench:       1,049 ns/iter (+/- 203) = 3904 MB/s
test memcmp_rust_64                   ... bench:          32 ns/iter (+/- 4) = 2000 MB/s
test memcmp_rust_8                    ... bench:           4 ns/iter (+/- 1) = 2000 MB/s
test memcmp_rust_unaligned_1048575    ... bench:     276,654 ns/iter (+/- 78,924) = 3790 MB/s
test memcmp_rust_unaligned_15         ... bench:           8 ns/iter (+/- 0) = 2000 MB/s
test memcmp_rust_unaligned_31         ... bench:          11 ns/iter (+/- 2) = 2909 MB/s
test memcmp_rust_unaligned_4095       ... bench:       1,234 ns/iter (+/- 301) = 3319 MB/s
test memcmp_rust_unaligned_63         ... bench:          33 ns/iter (+/- 11) = 1939 MB/s
test memcmp_rust_unaligned_7          ... bench:           4 ns/iter (+/- 0) = 2000 MB/s

memcmp-x86_64

test memcmp_rust_1048576              ... bench:      24,682 ns/iter (+/- 240) = 42483 MB/s
test memcmp_rust_16                   ... bench:           3 ns/iter (+/- 0) = 5333 MB/s
test memcmp_rust_32                   ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test memcmp_rust_4096                 ... bench:         113 ns/iter (+/- 1) = 36247 MB/s
test memcmp_rust_64                   ... bench:           5 ns/iter (+/- 0) = 12800 MB/s
test memcmp_rust_8                    ... bench:           3 ns/iter (+/- 0) = 2666 MB/s
test memcmp_rust_unaligned_1048575    ... bench:      27,049 ns/iter (+/- 6,989) = 38765 MB/s
test memcmp_rust_unaligned_15         ... bench:           3 ns/iter (+/- 0) = 5333 MB/s
test memcmp_rust_unaligned_31         ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test memcmp_rust_unaligned_4095       ... bench:         100 ns/iter (+/- 6) = 40960 MB/s
test memcmp_rust_unaligned_63         ... bench:           5 ns/iter (+/- 0) = 12800 MB/s
test memcmp_rust_unaligned_7          ... bench:           3 ns/iter (+/- 0) = 2666 MB/s

builtin

test memcmp_builtin_1048576           ... bench:      23,202 ns/iter (+/- 1,283) = 45193 MB/s
test memcmp_builtin_16                ... bench:           4 ns/iter (+/- 0) = 4000 MB/s
test memcmp_builtin_32                ... bench:           3 ns/iter (+/- 1) = 10666 MB/s
test memcmp_builtin_4096              ... bench:          73 ns/iter (+/- 16) = 56109 MB/s
test memcmp_builtin_64                ... bench:           4 ns/iter (+/- 1) = 16000 MB/s
test memcmp_builtin_8                 ... bench:           4 ns/iter (+/- 1) = 2000 MB/s
test memcmp_builtin_unaligned_1048575 ... bench:      25,353 ns/iter (+/- 793) = 41359 MB/s
test memcmp_builtin_unaligned_15      ... bench:           5 ns/iter (+/- 0) = 3200 MB/s
test memcmp_builtin_unaligned_31      ... bench:           4 ns/iter (+/- 0) = 8000 MB/s
test memcmp_builtin_unaligned_4095    ... bench:          85 ns/iter (+/- 4) = 48188 MB/s
test memcmp_builtin_unaligned_63      ... bench:           4 ns/iter (+/- 0) = 16000 MB/s
test memcmp_builtin_unaligned_7       ... bench:           4 ns/iter (+/- 0) = 2000 MB/s

Note that the results may vary a lot, e.g. memcmp_rust_unaligned_1048575 can be as slow as 26933 MB/s at times.

I also added some extra tests though I don't think there are enough yet. I am not sure what edge cases should be considered though.

x86_64 can load unaligned words in a single cache line as fast as aligned words. Even when crossing cache or page boundaries it is just as fast to do an unaligned word read instead of multiple byte reads. Also add a couple more tests & benchmarks.

I don't know why it isn't being optimized out though, which worries me.

It only seems to save a single instruction at first sight yet the effects are significant.

src/mem/x86_64.rs

Demindiro added 7 commits May 27, 2022 21:58

Implement faster memcmp for x86_64

f18ce3c

x86_64 can load unaligned words in a single cache line as fast as aligned words. Even when crossing cache or page boundaries it is just as fast to do an unaligned word read instead of multiple byte reads. Also add a couple more tests & benchmarks.

Fix formatting

83b4edd

Fix CI, better memcmp tests

5110338

Always inline compare_bytes::cmp

6c1aded

Fix panic not being optimized out.

03c8beb

I don't know why it isn't being optimized out though, which worries me.

Fix rustfmt sillyness

ae069f1

Slightly optimize main (32b) memcmp loop

f15f99f

It only seems to save a single instruction at first sight yet the effects are significant.

Amanieu reviewed May 31, 2022

View reviewed changes

src/mem/x86_64.rs Outdated Show resolved Hide resolved

src/mem/x86_64.rs Outdated Show resolved Hide resolved

Use unchecked_div/rem

22c06e4

Amanieu merged commit 6b96d90 into rust-lang:master May 31, 2022

Demindiro deleted the memcmp-x86_64 branch May 31, 2022 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement faster memcmp for x86_64 #467

Implement faster memcmp for x86_64 #467

Uh oh!

Demindiro commented May 27, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Implement faster memcmp for x86_64 #467

Implement faster memcmp for x86_64 #467

Uh oh!

Conversation

Demindiro commented May 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Demindiro commented May 27, 2022 •

edited

Loading