CUDA: fix race condition in FA vector kernels #13742
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #13733 .
Looking at the code I added in #13584 again, I think I accidentally introduced a race condition. The mask is being written to shared memory anyways, so the synchronization between warps is achieved by each warp just checking all of the mask values, and then reducing
skip
within the warp. Each warp will come to the same conclusion regarding whether or not to execute thecontinue
. However, warps are not guaranteed to execute thecontinue
at the same time, and after they do they will write new values tomaskf_shared
which can in turn influence whether other warps will execute thecontinue
, potentially causing the warps to become desynchronized.