Skip to content

Always inline byteswap functions #11280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 4, 2014
Merged

Always inline byteswap functions #11280

merged 3 commits into from
Jan 4, 2014

Conversation

c-a
Copy link
Contributor

@c-a c-a commented Jan 2, 2014

After writing some benchmarks for ebml::reader::vuint_at() I noticed that LLVM doesn't seem to inline the from_be32 function even though it only does a call to the bswap32 intrinsic in the x86_64 case. Marking the functions with #[inline(always)] fixes that and seems to me a reasonable thing to do. I got the following measurements in my vuint_at() benchmarks:

  • Before
    test ebml::bench::vuint_at_A_aligned ... bench: 1075 ns/iter (+/- 58)
    test ebml::bench::vuint_at_A_unaligned ... bench: 1073 ns/iter (+/- 5)
    test ebml::bench::vuint_at_D_aligned ... bench: 1150 ns/iter (+/- 5)
    test ebml::bench::vuint_at_D_unaligned ... bench: 1151 ns/iter (+/- 6)
  • Inline from_be32
    test ebml::bench::vuint_at_A_aligned ... bench: 769 ns/iter (+/- 9)
    test ebml::bench::vuint_at_A_unaligned ... bench: 795 ns/iter (+/- 6)
    test ebml::bench::vuint_at_D_aligned ... bench: 758 ns/iter (+/- 8)
    test ebml::bench::vuint_at_D_unaligned ... bench: 759 ns/iter (+/- 8)
  • Using vuint_at_slow()
    test ebml::bench::vuint_at_A_aligned ... bench: 646 ns/iter (+/- 7)
    test ebml::bench::vuint_at_A_unaligned ... bench: 645 ns/iter (+/- 3)
    test ebml::bench::vuint_at_D_aligned ... bench: 907 ns/iter (+/- 4)
    test ebml::bench::vuint_at_D_unaligned ... bench: 1085 ns/iter (+/- 16)

As expected inlining from_be32() gave a considerable speedup.
I also tried how the "slow" version fared against the optimized version and noticed that it's
actually a bit faster for small A class integers (using only two bytes) but slower for big D class integers (using four bytes)

@thestinger
Copy link
Contributor

They're not inlined because cross-crate inlining without link-time optimization requires an explicit #[inline] or #[inline(always)] in Rust.

@alexcrichton
Copy link
Member

We generally try to shy away from inline(always). I believe the reason for the slowness previously was according to what @thestinger mentioned, so can you see if you get the same speedups if you mark these functions as #[inline] instead of #[inline(always)]?

@c-a c-a closed this Jan 2, 2014
@c-a c-a reopened this Jan 2, 2014
@c-a
Copy link
Contributor Author

c-a commented Jan 2, 2014

We generally try to shy away from inline(always). I believe the reason for the slowness previously was according to what @thestinger mentioned, so can you see if you get the same speedups if you mark these functions as #[inline] instead of #[inline(always)]?

Yep seems to work with #[inline] too. Pushed a fixup commit that changes it to use #[inline] instead. (As a bonus it fixes the code to not exceed 100 columns too)

bors added a commit that referenced this pull request Jan 4, 2014
After writing some benchmarks for ebml::reader::vuint_at() I noticed that LLVM doesn't seem to inline the from_be32 function even though it only does a call to the bswap32 intrinsic in the x86_64 case. Marking the functions with #[inline(always)] fixes that and seems to me a reasonable thing to do. I got the following measurements in my vuint_at() benchmarks:

- Before
test ebml::bench::vuint_at_A_aligned          ... bench:      1075 ns/iter (+/- 58)
test ebml::bench::vuint_at_A_unaligned        ... bench:      1073 ns/iter (+/- 5)
test ebml::bench::vuint_at_D_aligned          ... bench:      1150 ns/iter (+/- 5)
test ebml::bench::vuint_at_D_unaligned        ... bench:      1151 ns/iter (+/- 6)

- Inline from_be32
test ebml::bench::vuint_at_A_aligned          ... bench:       769 ns/iter (+/- 9)
test ebml::bench::vuint_at_A_unaligned        ... bench:       795 ns/iter (+/- 6)
test ebml::bench::vuint_at_D_aligned          ... bench:       758 ns/iter (+/- 8)
test ebml::bench::vuint_at_D_unaligned        ... bench:       759 ns/iter (+/- 8)

- Using vuint_at_slow()
test ebml::bench::vuint_at_A_aligned          ... bench:       646 ns/iter (+/- 7)
test ebml::bench::vuint_at_A_unaligned        ... bench:       645 ns/iter (+/- 3)
test ebml::bench::vuint_at_D_aligned          ... bench:       907 ns/iter (+/- 4)
test ebml::bench::vuint_at_D_unaligned        ... bench:      1085 ns/iter (+/- 16)

As expected inlining from_be32() gave a considerable speedup.
I also tried how the "slow" version fared against the optimized version and noticed that it's
actually a bit faster for small A class integers (using only two bytes) but slower for big D class integers (using four bytes)
@bors bors closed this Jan 4, 2014
@bors bors merged commit a82f32b into rust-lang:master Jan 4, 2014
flip1995 pushed a commit to flip1995/rust that referenced this pull request Aug 24, 2023
[new_without_default]: include `where` clause in suggestions, make applicable

changelog: [`new_without_default`]: include `where` clause in suggestions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants