Skip to content

Add AVX 512f gather instructions #862

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 30 commits into from

Conversation

Daniel-B-Smith
Copy link
Contributor

@Daniel-B-Smith Daniel-B-Smith commented May 30, 2020

Adds the gather intrinsics for the AVX 512f instruction set.

@rust-highfive
Copy link

r? @gnzlbg

(rust_highfive has picked a reviewer for you, use r? to override)

#[target_feature(enable = "avx512f")]
#[cfg_attr(test, assert_instr(vpgatherdq))]
pub unsafe fn _mm512_i32gather_epi64(offsets: __m256i, slice: *const u8, scale: i32) -> __m512i {
let zero = _mm512_setzero_si512().as_i64x8();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should use _mm512_undefined here instead to match what Clang is doing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm actually it seems that Clang defines _mm512_undefined as zero-initialization, so it doesn't matter either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I see it defined as a particular builtin, but _mm512_setzero is explicitly defined as zero initialization. I'm not sure of the behavior of __builtin_ia32_undef512, however.

https://github.com/llvm/llvm-project/blob/1b02db52b79e01f038775f59193a49850a34184d/clang/lib/Headers/avx512fintrin.h#L189
https://github.com/llvm/llvm-project/blob/a3dc9490004ce1601fb1bc67cf218b86a6fdf652/clang/include/clang/Basic/BuiltinsX86.def#L40
https://github.com/llvm/llvm-project/blob/1b02db52b79e01f038775f59193a49850a34184d/clang/lib/Headers/avx512fintrin.h#L259
https://github.com/llvm/llvm-project/blob/1b02db52b79e01f038775f59193a49850a34184d/clang/lib/Headers/avx512fintrin.h#L253

LLVM should be able to optimize away the dead store, but I'm happy to change the code regardless. I'm not quite sure how/if I can implement _mm512_undefined since my reading of the std::mem::MaybeUninit is that I couldn't create an unitialized __m512i without inviting UB. Assuming the calling convention allows it, I should be able to create a MaybeUninit<__m512i> and pass that to vpgatherdq.

@Daniel-B-Smith
Copy link
Contributor Author

Do you have any suggestions about the CI failure? I see the following assembler repeated over and over:

kxnorw %k0,%k0,%k1
vpxor %xmm1,%xmm1,%xmm1
vpgatherdq (%eax,%ymm0,2),%zmm1{%k1}
vmovdqa64 %zmm1,%zmm0
ret

I can't tell if I'm doing something wrong, hit a compiler bug, or if vpgatherdq just isn't supported on 32 bit systems.

@Amanieu
Copy link
Member

Amanieu commented May 31, 2020

You need to provide a value for imm8 in the assert_instr macro. This avoids expanding the entire const_imm8 for the disassembly. Search the code for other uses of assert_instr.

@Daniel-B-Smith
Copy link
Contributor Author

I've finished all 12 AVX512f gather intrinsics and the intrinsics needed to test them. Unfortunately, the __m512d comparison assertion is not using a cmpeq intrinsic like all of the other assertions: https://github.com/rust-lang/stdarch/pull/862/files#diff-927c7e8bc826b00593557eb6928a092eR149 I did it that way because _mm512_cmpeq_pd_mask requires Knights Corner instructions. It seems that KNCNI is more complicated than just adding another feature detection, so I punted on adding that.

@@ -74,6 +74,7 @@ features! {
/// * `"avx512bitalg"`
/// * `"avx512bf16"`
/// * `"avx512vp2intersect"`
/// * `"knc"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line if you're not adding knc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Fixed.

vgatherdpd(zero, slice, offsets, neg_one, $imm8)
};
}
let r = constify_imm8!(scale, call);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the intrinsic documentation, only 1, 2, 4, 8 are valid values for the scale. You should use a custom constify macro that handles this.

Also you should use #[rustc_args_required_const(<arg index of scale>)] since scale is required to be a compile-time constant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

The new macro panics if it gets a value outside of 1, 2, 4, 8. Arguably, it should catch errors at compile time, but the one thing I tried std::compile_error would not compile at all. If you have any suggestions, I'm more than happy to fix this.

I've added the new macro to src/x86/macros.rs because we should probably change the other existing gather intrinsics to use the new macro as well. That would be a backwards incompatible change, but it would only affect broken code. I don't know what the official policy is in cases like this. I'm happy to make the change in a separate PR if you would like.

Copy link
Contributor Author

@Daniel-B-Smith Daniel-B-Smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll work on the macro fix later this week.

#[target_feature(enable = "avx512f")]
#[cfg_attr(test, assert_instr(vpgatherdq))]
pub unsafe fn _mm512_i32gather_epi64(offsets: __m256i, slice: *const u8, scale: i32) -> __m512i {
let zero = _mm512_setzero_si512().as_i64x8();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? I see it defined as a particular builtin, but _mm512_setzero is explicitly defined as zero initialization. I'm not sure of the behavior of __builtin_ia32_undef512, however.

https://github.com/llvm/llvm-project/blob/1b02db52b79e01f038775f59193a49850a34184d/clang/lib/Headers/avx512fintrin.h#L189
https://github.com/llvm/llvm-project/blob/a3dc9490004ce1601fb1bc67cf218b86a6fdf652/clang/include/clang/Basic/BuiltinsX86.def#L40
https://github.com/llvm/llvm-project/blob/1b02db52b79e01f038775f59193a49850a34184d/clang/lib/Headers/avx512fintrin.h#L259
https://github.com/llvm/llvm-project/blob/1b02db52b79e01f038775f59193a49850a34184d/clang/lib/Headers/avx512fintrin.h#L253

LLVM should be able to optimize away the dead store, but I'm happy to change the code regardless. I'm not quite sure how/if I can implement _mm512_undefined since my reading of the std::mem::MaybeUninit is that I couldn't create an unitialized __m512i without inviting UB. Assuming the calling convention allows it, I should be able to create a MaybeUninit<__m512i> and pass that to vpgatherdq.

@@ -74,6 +74,7 @@ features! {
/// * `"avx512bitalg"`
/// * `"avx512bf16"`
/// * `"avx512vp2intersect"`
/// * `"knc"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Fixed.

@Amanieu
Copy link
Member

Amanieu commented Jun 1, 2020

undef512 is defined here. As you can see clang defines it to zero-initialize.

@Daniel-B-Smith
Copy link
Contributor Author

Interesting, thanks! The discussion at llvm.org/PR32176 was also very informative. GitHub search only does exact match (by default at least), so my search for __builtin_ia32_undef512 didn't bring up that line.

Daniel Smith and others added 2 commits June 6, 2020 12:15
@Daniel-B-Smith
Copy link
Contributor Author

This should be ready for review.

@Amanieu
Copy link
Member

Amanieu commented Jun 9, 2020

You missing these 4 gather intrinsics that are part of AVX512F:

  • _mm512_i32gather_ps
  • _mm512_mask_i32gather_ps
  • _mm512_i32gather_epi32
  • _mm512_mask_i32gather_epi32

@Daniel-B-Smith
Copy link
Contributor Author

Added those four intrinsics and some of the helpers needed. Even though (as you pointed out elsewhere), we can use the AVX512F/KNCNI intrinsics with just AVX512F CPUs, I didn't do the TODO to make the assert_eq_m512d more correct. After I finish the three open PRs, I will implement the floating point comparison intrinsics and fix that TODO.

@Daniel-B-Smith
Copy link
Contributor Author

Closing since #866 also contains these changes.

@Daniel-B-Smith Daniel-B-Smith deleted the avx-512-cmp branch June 14, 2020 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants