[ESIMD] Fix perf regression caused by assumed align in block_load(usm) #11850
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The element-size address alignment is valid from correctness point of view, but using 1-byte and 2-byte alignment implicitly causes performance regression for block_load(const int8_t *, ...) and block_load(const int16_t *, ...) because GPU BE have to generate slower GATHER instead of more efficient BLOCK-LOAD. Without this fix block-load causes up to 44% performance slow-down on some apps that used block_load() with alignment assumptions used before block_load(usm, ..., compile_time_props) was implemented.
The reasoning for the expected/assumed alignment from element-size to 4-bytes for byte- and word-vectors is such:
The idea of block_load() call (opposing to gather() call) is to have
efficient block-load, and thus the assumed alignment is such that
allows to generate block-load. This is a bit more tricky for user
but that is how block_load/store API always worked before: block-load
had restrictions that needed to be honored.
To be on safer side, user can always pass the guaranteed alignment.