Skip to content

[ESIMD] Fix perf regression caused by assumed align in block_load(usm) #11850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

v-klochkov
Copy link
Contributor

The element-size address alignment is valid from correctness point of view, but using 1-byte and 2-byte alignment implicitly causes performance regression for block_load(const int8_t *, ...) and block_load(const int16_t *, ...) because GPU BE have to generate slower GATHER instead of more efficient BLOCK-LOAD. Without this fix block-load causes up to 44% performance slow-down on some apps that used block_load() with alignment assumptions used before block_load(usm, ..., compile_time_props) was implemented.

The reasoning for the expected/assumed alignment from element-size to 4-bytes for byte- and word-vectors is such:
The idea of block_load() call (opposing to gather() call) is to have
efficient block-load, and thus the assumed alignment is such that
allows to generate block-load. This is a bit more tricky for user
but that is how block_load/store API always worked before: block-load
had restrictions that needed to be honored.
To be on safer side, user can always pass the guaranteed alignment.

The element-size address alignment is valid from correctness point of
view, but using 1-byte and 2-byte alignment implicitly causes
performance regression for block_load(const int8_t *, ...) and
block_load(const int16_t *, ...) because GPU BE have to generate
slower GATHER instead of more efficient BLOCK-LOAD.
Without this fix block-load causes up to 44% performance slow-down
on some apps that used block_load() with alignment assumptions used
before block_load(usm, ..., compile_time_props) was implemented.

The reasoning for the expected/assumed alignment from element-size
to 4-bytes for byte- and word-vectors is such:
   The idea of block_load() call (opposing to gather() call) is to have
   efficient block-load, and thus the assumed alignment is such that
   allows to generate block-load. This is a bit more tricky for user
   but that is how block_load/store API always worked before: block-load
   had restrictions that needed to be honored.
   To be on safer side, user can always pass the guaranteed alignment.

Signed-off-by: Klochkov, Vyacheslav N <[email protected]>
@v-klochkov v-klochkov requested a review from a team as a code owner November 9, 2023 23:58
Signed-off-by: Klochkov, Vyacheslav N <[email protected]>
Copy link
Contributor

@turinevgeny turinevgeny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

@v-klochkov v-klochkov merged commit c6362a0 into intel:sycl Nov 10, 2023
@v-klochkov v-klochkov deleted the esimd_fix_block_load_usm_perf_alignment branch November 10, 2023 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants