Skip to content

[SYCL] Fix memory leak in reduction resources #5653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Mar 18, 2022

Conversation

steffenlarsen
Copy link
Contributor

Reductions that require additional resources, such as buffers, can currently create a circular dependency between the resources and the commands issued by the reductions. These changes clear up this dependence in a similar way to how streams are transferred by transferring ownership of the resources to the commands and ensuring release when cleaning up the commands.

Reductions that require additional resources, such as buffers, can
currently create a circular dependency between the resources and the
commands issued by the reductions. These changes clear up this
dependence in a similar way to how streams are transferred by
transferring ownership of the resources to the commands and ensuring
release when cleaning up the commands.

Signed-off-by: Steffen Larsen <[email protected]>
@steffenlarsen steffenlarsen requested a review from a team as a code owner February 24, 2022 13:41
@steffenlarsen
Copy link
Contributor Author

This is a reopening of #5162. This is the strictest fix to the reduction leak issue and is needed for an upcoming optimization patch. This makes #5588 obsolete.

@steffenlarsen
Copy link
Contributor Author

Solution has been discussed with @sergey-semenov offline (w.r.t #5162 (comment)) and given it is very similar to streams, it should not add much more work to making it support post-enqueue cleanup.

@steffenlarsen
Copy link
Contributor Author

/verify with intel/llvm-test-suite#624

Signed-off-by: Steffen Larsen <[email protected]>
Signed-off-by: Steffen Larsen <[email protected]>
sergey-semenov
sergey-semenov previously approved these changes Feb 25, 2022
Copy link
Contributor

@sergey-semenov sergey-semenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Steffen Larsen <[email protected]>
sergey-semenov
sergey-semenov previously approved these changes Feb 25, 2022
@bader
Copy link
Contributor

bader commented Feb 25, 2022

@alexbatashev, it looks like GHA uses outdated driver version. Am I right? Can you help to fix this, please?

@alexbatashev
Copy link
Contributor

@alexbatashev, it looks like GHA uses outdated driver version. Am I right? Can you help to fix this, please?

Patch: #5679

@bader
Copy link
Contributor

bader commented Mar 5, 2022

@steffenlarsen, is it ready for merge? I'm not sure because I see failures in pre-commit validation and Sergey requested some additional changes to make in #5120.

@steffenlarsen
Copy link
Contributor Author

@bader This is ready for merge, but I think we may have to disable the failing tests temporarily. The precommit failures are caused by the compute-runtime and newer versions are believed to have fixed this issue.

#5120 was reverted, so I will make the required common change in the re-opened PR for it (assuming this is merged first.)

@bader
Copy link
Contributor

bader commented Mar 7, 2022

The precommit failures are caused by the compute-runtime and newer versions are believed to have fixed this issue.

I suggest we update the compute-runtime and re-test this change to get the status for pre-commit tests. We should be able to merge #5742 in a few hours if tests pass. Based on updated results we either merge this PR as is or update with tests if needed.
Does it sound okay to you?

@steffenlarsen
Copy link
Contributor Author

I suggest we update the compute-runtime and re-test this change to get the status for pre-commit tests. We should be able to merge #5742 in a few hours if tests pass. Based on updated results we either merge this PR as is or update with tests if needed. Does it sound okay to you?

Yes, that sounds like the best solution to me, thanks!

@bader
Copy link
Contributor

bader commented Mar 7, 2022

@steffenlarsen, #5742 is merged. To get updated test status, could you merge with the tip of the sycl branch, please?

@steffenlarsen
Copy link
Contributor Author

@bader Tests still ran with the old drivers. Do you know why?

@bader
Copy link
Contributor

bader commented Mar 8, 2022

@bader Tests still ran with the old drivers. Do you know why?

Sorry, my fault. #5742 updated the driver for Jenkins CI only. You can see that all Linux tests pass in Jenkins, but there are Windows driver yet, so we still have to update the test status on Windows.
#5751 updates the driver version for GitHub actions config. Once it's merged, GitHub will create a docker container with updated driver and will use it in CI.

@bader
Copy link
Contributor

bader commented Mar 8, 2022

Update: we need to wait for #5752 to be merged. It updates the image we use in pre-commit. #5751 updated the image we use for nightly testing.

@bader
Copy link
Contributor

bader commented Mar 8, 2022

@steffenlarsen, GitHub Actions pre-commit passed. Could you create a PR for llvm-test-suite update on Windows, please?

@steffenlarsen
Copy link
Contributor Author

Failing precommit tests are on Windows because CI drivers for Windows haven't been uplifted yet. @bader do you prefer we disable the sporadic failures (majority of reduction tests) on Windows or wait for the Windows drivers to be there?

@bader
Copy link
Contributor

bader commented Mar 11, 2022

AFAIK, @tfzhu is already updating Windows drivers. I think it should be done soon and we can pass pre-commit for this change.

@bader
Copy link
Contributor

bader commented Mar 16, 2022

/verify with intel/llvm-test-suite#624

Signed-off-by: Steffen Larsen <[email protected]>
@steffenlarsen
Copy link
Contributor Author

There was a bug in the reduction implementation where auxiliary resources would initialize with pointers to data kept in another (non-buffer) auxiliary resource but not prevent an unnecessary copy-back. With the leak fixed, the initializer resources would die before the buffer resource it was used to initialize, which would cause the copy-back to copy to unexpected memory. This is fixed by disabling the unnecessary copy-back on affected auxiliary reduction resources.

@steffenlarsen
Copy link
Contributor Author

/verify with intel/llvm-test-suite#624

1 similar comment
@steffenlarsen
Copy link
Contributor Author

/verify with intel/llvm-test-suite#624

@steffenlarsen
Copy link
Contributor Author

Test has been fixed. @sergey-semenov could you please have another look?

@bader bader requested a review from sergey-semenov March 17, 2022 16:59
@steffenlarsen
Copy link
Contributor Author

@intel/llvm-gatekeepers - This should be ready for merge.

@bader bader merged commit 9aefea0 into intel:sycl Mar 18, 2022
bso-intel added a commit to bso-intel/llvm-test-suite that referenced this pull request Mar 22, 2022
PR intel/llvm#5653 fixed memory leaks when ZE_DEBUG=4 is set

Signed-off-by: Byoungro So <[email protected]>
vladimirlaz pushed a commit to intel/llvm-test-suite that referenced this pull request Mar 23, 2022
PR intel/llvm#5653 fixed memory leaks when ZE_DEBUG=4 is set

Signed-off-by: Byoungro So <[email protected]>
aelovikov-intel pushed a commit to aelovikov-intel/llvm that referenced this pull request Mar 27, 2023
…e#930)

PR intel#5653 fixed memory leaks when ZE_DEBUG=4 is set

Signed-off-by: Byoungro So <[email protected]>
@steffenlarsen steffenlarsen deleted the steffen/fix_reduction_resource_leak branch December 6, 2023 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants