Skip to content

[SYCL][Fusion] Improve circular dependency detection #8148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sommerlukas
Copy link
Contributor

Kernel fusion can potentially create cycles in the SYCL dependency graph. These potential cycles must be detected and fusion must be cancelled, if it would give rise to such a cycle.

This PR improves the detection process for cycles to traverse the dependency graph. As this is more effort than the old solution, cycle detection has been moved from queue::submit, where it would be happening for each kernel submission to a queue in fusion mode, to fusion_wrapper::complete_fusion, where it happens at most once per fusion.

Signed-off-by: Lukas Sommer [email protected]

@sommerlukas sommerlukas self-assigned this Jan 30, 2023
@sommerlukas sommerlukas requested a review from a team as a code owner January 30, 2023 13:08
@sommerlukas
Copy link
Contributor Author

For more background, we can have a look at this test case.

Without fusion, the dependency graph looks like this:

graph TD
  A[KernelTwo] --> B[KernelThree]
  B --> C[KernelOne]
Loading

After fusion, this would result in the following dependency graph:

graph TD
  A[Fused from KernelOne and KernelTwo] -->B[KernelThree]
  B --> A
Loading

As cycles in the dependency graph are not allowed, fusion needs to be cancelled in this case.

So far, this scenario would be detected when KernelThree is submitted and fusion would be cancelled, by simply checking the edge between KernelThree and KernelOne.

However, in reality the dependency graph can be more complicated:

graph TD
  A[KernelTwo] --> B[...]
  B --> C[Map memory]
  C --> D[KernelThree]
  D --> E[...]
  E --> F[Map memory]
  F --> G[KernelOne]
Loading

In this case, it is not sufficient to check the direct edge of KernelThree to detect the other ongoing fusion. Instead, the graph needs to traversed until finding another in-flight fusion or ending at the root of graph (allocation nodes/commands).

As this traversal is too expensive to perform it on every queue::submit (where the check happened so far), the new check has been moved to fusion_wrapper::complete_fusion, to only perform it once and still detect potential cycles before completing fusion.

@sommerlukas
Copy link
Contributor Author

The necessary updates to the tests are in intel/llvm-test-suite#1557.

@sommerlukas
Copy link
Contributor Author

/verify with intel/llvm-test-suite#1557

@sommerlukas sommerlukas temporarily deployed to aws January 30, 2023 13:41 — with GitHub Actions Inactive
@sommerlukas sommerlukas temporarily deployed to aws January 30, 2023 14:15 — with GitHub Actions Inactive
@sommerlukas sommerlukas temporarily deployed to aws January 30, 2023 14:36 — with GitHub Actions Inactive
@sommerlukas sommerlukas temporarily deployed to aws January 30, 2023 14:38 — with GitHub Actions Inactive
@sommerlukas sommerlukas temporarily deployed to aws January 30, 2023 15:53 — with GitHub Actions Inactive
@sommerlukas sommerlukas temporarily deployed to aws January 30, 2023 16:39 — with GitHub Actions Inactive
@sommerlukas sommerlukas force-pushed the kernel-fusion/detect-circular-dep branch from d04fa48 to e6f9925 Compare January 30, 2023 17:49
@sommerlukas sommerlukas requested a review from Naghasan January 30, 2023 17:49
@sommerlukas sommerlukas temporarily deployed to aws January 30, 2023 18:54 — with GitHub Actions Inactive
@sommerlukas sommerlukas temporarily deployed to aws January 30, 2023 19:57 — with GitHub Actions Inactive
@sommerlukas
Copy link
Contributor Author

sommerlukas commented Jan 31, 2023

@sergey-semenov: The failure in the checks seems to be due to CI using the wrong version of intel/llvm-test-suite.

The version of sync_two_queues_requirement.cpp in intel/llvm-test-suite#1557 doesn't even have an assert in line 63, as reported by the failed check.

I put the /verify with comment on this PR, is there something else I need to do?

@sergey-semenov
Copy link
Contributor

@sommerlukas The failing jobs are always run against the current version of llvm-test-suite, so that's expected. Your verify command should have launched a "Jenkins/llvm-test-suite" alongside "Jenkins/Precommit", but it looks like that didn't work for some reason. I've reported this issue, your verify results from the llvm-test-suite PR should be enough here.

@sommerlukas
Copy link
Contributor Author

@sergey-semenov: Thanks for the information!

FWIW, "Jenkins/llvm-test-suite" was successfully executed as part of the test-suite PR: intel/llvm-test-suite#1557

The other CI pipelines in the test-suite PR were executed against the latest nightly, so it is expected that they fail, as they do not include the new logic from this PR.

@bader
Copy link
Contributor

bader commented Jan 31, 2023

Your verify command should have launched a "Jenkins/llvm-test-suite" alongside "Jenkins/Precommit", but it looks like that didn't work for some reason. I've reported this issue, your verify results from the llvm-test-suite PR should be enough here.

The job has launched and passed successfully. See results for this commit: d04fa48. Unfortunately, it's hard to find because @sommerlukas did rebase. @sommerlukas, please, use git merge and do not use git rebase.

Copy link
Contributor

@sergey-semenov sergey-semenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some minor stylistic suggestions.

Signed-off-by: Lukas Sommer <[email protected]>
@sommerlukas
Copy link
Contributor Author

@sergey-semenov Thanks for your feedback, addressed it in the latest commit.

@bader: Sorry for the CI confusion, I amended and force-pushed the branch when addressing @Naghasan's feedback. For the new feedback, I've created a separate commit to avoid the issue.

@sommerlukas sommerlukas temporarily deployed to aws February 1, 2023 16:03 — with GitHub Actions Inactive
@sommerlukas sommerlukas temporarily deployed to aws February 1, 2023 17:08 — with GitHub Actions Inactive
@sommerlukas
Copy link
Contributor Author

The checks for OpenCL and LevelZero failed expectedly, as they execute against intel/llvm-test-suite and do not include the changes made in intel/llvm-test-suite#1557.

Checks for AMD and CUDA failed due to some unrelated test failure for ESIMD:

Failed Tests (1):
  SYCL :: ESIMD/dword_atomic_smoke_scalar_off.cpp

@bader @sergey-semenov: Can someone please merge, I'm not authorized.

@sergey-semenov
Copy link
Contributor

The failing test has already been disabled on CUDA and HIP in intel/llvm-test-suite#1564

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants