Skip to content

[SYCL][CUDA] support launch bounds #9772

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

[SYCL][CUDA] support launch bounds #9772

wants to merge 2 commits into from

Conversation

jinz2014
Copy link
Contributor

@jinz2014 jinz2014 commented Jun 7, 2023

@steffenlarsen @npmiller

Many users have requested the feature. I tried to get started. Help is needed to finish the definition and implementation of the kernel attribute for "minBlocksPerMultiprocessor" to support launch bounds in CUDA. Thanks.

def SYCLMinWorkGroupsPerComputeUnit : InheritableAttr {
  let Spellings = [CXX11<"sycl", "min_workgroups_per_cu">];
  let Args = [ExprArgument<"Value">];
  let LangOpts = [SYCLIsDevice, SilentlyIgnoreSYCLIsHost];
  let Subjects = SubjectList<[Function], ErrorDiag>;
  let Documentation = [Undocumented];
  let SupportsNonconformingLambdaSyntax = 1;
}

@jinz2014 jinz2014 requested a review from a team as a code owner June 7, 2023 14:10
Copy link
Contributor

@elizabethandrews elizabethandrews left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update the PR description to describe this change and add a test?

@steffenlarsen
Copy link
Contributor

Interesting! Does this need to be an attribute or could we make it a compile-time kernel property instead?

@steffenlarsen
Copy link
Contributor

@jinz2014 jinz2014 temporarily deployed to aws June 7, 2023 16:32 — with GitHub Actions Inactive
@jinz2014 jinz2014 temporarily deployed to aws June 7, 2023 18:40 — with GitHub Actions Inactive
@jinz2014
Copy link
Contributor Author

jinz2014 commented Jun 9, 2023

When a sycl::local_accessor is needed for a SYCL kernel, the Clang compiler will create some metadata for the local accessor. The metadata, somehow, overrides the metadata for the bound value (e.g. max threads of a block). In the end, the bound value is not written to the NVIDIA PTX assembly file even though it is added by the function "addNVVMMetadata" in clang/lib/CodeGen/TargetInfo.cpp. Do you have more clues about the issue ?

addNVVMMetadata(F, "maxntidx", bound value);

@jinz2014 jinz2014 marked this pull request as draft June 12, 2023 21:36
@jchlanda
Copy link
Contributor

When a sycl::local_accessor is needed for a SYCL kernel, the Clang compiler will create some metadata for the local accessor. The metadata, somehow, overrides the metadata for the bound value (e.g. max threads of a block). In the end, the bound value is not written to the NVIDIA PTX assembly file even though it is added by the function "addNVVMMetadata" in clang/lib/CodeGen/TargetInfo.cpp. Do you have more clues about the issue ?

addNVVMMetadata(F, "maxntidx", bound value);

Do you happen to have a repro for it, I'd like to have a look at it?

@jinz2014
Copy link
Contributor Author

Any SYCL program using a SYCL local accessor is a reproducer. Thanks.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <random>
#include <sycl/sycl.hpp>

int main(int argc, char* argv[]) {

  if (argc != 2) {
    printf("Usage: ./%s <iterations>\n", argv[0]);
    return 1;
  }

  // specify the number of test cases
  const int iteration = atoi(argv[1]);

  // number of elements to reverse
  const int len = 256;

  // save device result
  int test[len];

  // save expected results after performing reverse operations even/odd times
  int error = 0;
  int gold_odd[len];
  int gold_even[len];

  for (int i = 0; i < len; i++) {
    gold_odd[i] = len-i-1;
    gold_even[i] = i;
  }

#ifdef USE_GPU
  sycl::queue q(sycl::gpu_selector_v, sycl::property::queue::in_order());
#else
  sycl::queue q(sycl::cpu_selector_v, sycl::property::queue::in_order());
#endif

  int *d_test = sycl::malloc_device<int>(len, q);
  sycl::range<1> gws (len);
  sycl::range<1> lws (len);

  std::default_random_engine generator (123);
  // bound the number of reverse operations
  std::uniform_int_distribution<int> distribution(100,9999);

  for (int i = 0; i < iteration; i++) {
    const int count = distribution(generator);

    q.memcpy(d_test, gold_even, sizeof(int) * len);

    q.wait();

    for (int j = 0; j < count; j++) {
      q.submit([&](sycl::handler &cgh) {
        sycl::local_accessor <int, 1> s (lws, cgh);
        cgh.parallel_for<class blockReverse>(
          sycl::nd_range<1>(gws, lws), [=](sycl::nd_item<1> item) [[intel::max_work_group_size(1,1,256)]] {
          int t = item.get_local_id(0);
          s[t] = d_test[t];
          item.barrier(sycl::access::fence_space::local_space);
          d_test[t] = s[len-t-1];
        });
      });
    }
   q.wait();
      q.memcpy(test, d_test, sizeof(int) * len).wait();

    if (count % 2 == 0)
      error = memcmp(test, gold_even, len*sizeof(int));
    else
      error = memcmp(test, gold_odd, len*sizeof(int));

    if (error) break;
  }

  printf("%s\n", error ? "FAIL" : "PASS");

  free(d_test, q);

  return 0;
}


@jinz2014
Copy link
Contributor Author

jinz2014 commented Aug 1, 2023

@jchlanda Do you know the cause after taking a look ?

@jchlanda
Copy link
Contributor

jchlanda commented Aug 2, 2023

@jchlanda Do you know the cause after taking a look ?

This has slipped of my radar completely, sorry. Let me have a look.

@jchlanda
Copy link
Contributor

jchlanda commented Aug 4, 2023

I can reproduce the error and I think I know what's going on. The metadata is correctly generated and the addition of maxntid is in the module, however, for local accessors, the link between the kernel and the launch bound is lost and the asm printer ignores it, see:

!0 = distinct !{ptr @_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_E12blockReverse, !"kernel", i32 1}
!1 = distinct !{null, !"maxntidx", i32 256}

For it to be valid, maxntidx should be link with the kernel:

!1 = distinct !{ptr @_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_E12blockReverse, !"maxntidx", i32 256}

Let me find who's misbehaving here.

@jchlanda
Copy link
Contributor

jchlanda commented Aug 8, 2023

@jinz2014 I've got a patch to fix this issue here: #9772
Please let me know if it works for you.

@jchlanda
Copy link
Contributor

@jinz2014 FYI: #11192

@jinz2014 jinz2014 closed this Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants