[SYCL][CUDA] support launch bounds #9772

jinz2014 · 2023-06-07T14:10:18Z

Many users have requested the feature. I tried to get started. Help is needed to finish the definition and implementation of the kernel attribute for "minBlocksPerMultiprocessor" to support launch bounds in CUDA. Thanks.

def SYCLMinWorkGroupsPerComputeUnit : InheritableAttr {
  let Spellings = [CXX11<"sycl", "min_workgroups_per_cu">];
  let Args = [ExprArgument<"Value">];
  let LangOpts = [SYCLIsDevice, SilentlyIgnoreSYCLIsHost];
  let Subjects = SubjectList<[Function], ErrorDiag>;
  let Documentation = [Undocumented];
  let SupportsNonconformingLambdaSyntax = 1;
}

elizabethandrews

Can you please update the PR description to describe this change and add a test?

steffenlarsen · 2023-06-07T15:08:48Z

Interesting! Does this need to be an attribute or could we make it a compile-time kernel property instead?

steffenlarsen · 2023-06-07T15:11:40Z

@jchlanda

jinz2014 · 2023-06-09T17:19:59Z

When a sycl::local_accessor is needed for a SYCL kernel, the Clang compiler will create some metadata for the local accessor. The metadata, somehow, overrides the metadata for the bound value (e.g. max threads of a block). In the end, the bound value is not written to the NVIDIA PTX assembly file even though it is added by the function "addNVVMMetadata" in clang/lib/CodeGen/TargetInfo.cpp. Do you have more clues about the issue ?

addNVVMMetadata(F, "maxntidx", bound value);

jchlanda · 2023-06-13T09:51:33Z

When a sycl::local_accessor is needed for a SYCL kernel, the Clang compiler will create some metadata for the local accessor. The metadata, somehow, overrides the metadata for the bound value (e.g. max threads of a block). In the end, the bound value is not written to the NVIDIA PTX assembly file even though it is added by the function "addNVVMMetadata" in clang/lib/CodeGen/TargetInfo.cpp. Do you have more clues about the issue ?
addNVVMMetadata(F, "maxntidx", bound value);

Do you happen to have a repro for it, I'd like to have a look at it?

jinz2014 · 2023-06-14T13:41:46Z

Any SYCL program using a SYCL local accessor is a reproducer. Thanks.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <random>
#include <sycl/sycl.hpp>

int main(int argc, char* argv[]) {

  if (argc != 2) {
    printf("Usage: ./%s <iterations>\n", argv[0]);
    return 1;
  }

  // specify the number of test cases
  const int iteration = atoi(argv[1]);

  // number of elements to reverse
  const int len = 256;

  // save device result
  int test[len];

  // save expected results after performing reverse operations even/odd times
  int error = 0;
  int gold_odd[len];
  int gold_even[len];

  for (int i = 0; i < len; i++) {
    gold_odd[i] = len-i-1;
    gold_even[i] = i;
  }

#ifdef USE_GPU
  sycl::queue q(sycl::gpu_selector_v, sycl::property::queue::in_order());
#else
  sycl::queue q(sycl::cpu_selector_v, sycl::property::queue::in_order());
#endif

  int *d_test = sycl::malloc_device<int>(len, q);
  sycl::range<1> gws (len);
  sycl::range<1> lws (len);

  std::default_random_engine generator (123);
  // bound the number of reverse operations
  std::uniform_int_distribution<int> distribution(100,9999);

  for (int i = 0; i < iteration; i++) {
    const int count = distribution(generator);

    q.memcpy(d_test, gold_even, sizeof(int) * len);

    q.wait();

    for (int j = 0; j < count; j++) {
      q.submit([&](sycl::handler &cgh) {
        sycl::local_accessor <int, 1> s (lws, cgh);
        cgh.parallel_for<class blockReverse>(
          sycl::nd_range<1>(gws, lws), [=](sycl::nd_item<1> item) [[intel::max_work_group_size(1,1,256)]] {
          int t = item.get_local_id(0);
          s[t] = d_test[t];
          item.barrier(sycl::access::fence_space::local_space);
          d_test[t] = s[len-t-1];
        });
      });
    }
   q.wait();
      q.memcpy(test, d_test, sizeof(int) * len).wait();

    if (count % 2 == 0)
      error = memcmp(test, gold_even, len*sizeof(int));
    else
      error = memcmp(test, gold_odd, len*sizeof(int));

    if (error) break;
  }

  printf("%s\n", error ? "FAIL" : "PASS");

  free(d_test, q);

  return 0;
}

jinz2014 · 2023-08-01T20:42:02Z

@jchlanda Do you know the cause after taking a look ?

jchlanda · 2023-08-02T08:52:46Z

@jchlanda Do you know the cause after taking a look ?

This has slipped of my radar completely, sorry. Let me have a look.

jchlanda · 2023-08-04T07:48:17Z

I can reproduce the error and I think I know what's going on. The metadata is correctly generated and the addition of maxntid is in the module, however, for local accessors, the link between the kernel and the launch bound is lost and the asm printer ignores it, see:

!0 = distinct !{ptr @_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_E12blockReverse, !"kernel", i32 1}
!1 = distinct !{null, !"maxntidx", i32 256}

For it to be valid, maxntidx should be link with the kernel:

!1 = distinct !{ptr @_ZTSZZ4mainENKUlRN4sycl3_V17handlerEE_clES2_E12blockReverse, !"maxntidx", i32 256}

Let me find who's misbehaving here.

jchlanda · 2023-08-08T08:53:24Z

@jinz2014 I've got a patch to fix this issue here: #9772
Please let me know if it works for you.

jchlanda · 2023-09-15T12:41:10Z

@jinz2014 FYI: #11192

zjin-lcf added 2 commits June 7, 2023 06:54

[SYCL][CUDA] To support launch bounds

69d04ef

[SYCL][CUDA] update the formats

440e64f

jinz2014 requested a review from a team as a code owner June 7, 2023 14:10

elizabethandrews reviewed Jun 7, 2023

View reviewed changes

jinz2014 temporarily deployed to aws June 7, 2023 16:32 — with GitHub Actions Inactive

jinz2014 temporarily deployed to aws June 7, 2023 18:40 — with GitHub Actions Inactive

jinz2014 marked this pull request as draft June 12, 2023 21:36

This was referenced Aug 8, 2023

[SYCL] Fix error ‘CHAR_BIT’ was not declared in this scope in sycl/source/stream.cpp #10726

Merged

[SYCL] Update kernel function metadata after local accessor pass #10727

Merged

jinz2014 closed this Dec 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCL][CUDA] support launch bounds #9772

[SYCL][CUDA] support launch bounds #9772

Uh oh!

jinz2014 commented Jun 7, 2023

Uh oh!

elizabethandrews left a comment

Uh oh!

steffenlarsen commented Jun 7, 2023

Uh oh!

steffenlarsen commented Jun 7, 2023

Uh oh!

jinz2014 commented Jun 9, 2023

Uh oh!

jchlanda commented Jun 13, 2023

Uh oh!

jinz2014 commented Jun 14, 2023

Uh oh!

jinz2014 commented Aug 1, 2023

Uh oh!

jchlanda commented Aug 2, 2023

Uh oh!

jchlanda commented Aug 4, 2023

Uh oh!

jchlanda commented Aug 8, 2023

Uh oh!

jchlanda commented Sep 15, 2023

Uh oh!

Uh oh!

[SYCL][CUDA] support launch bounds #9772

[SYCL][CUDA] support launch bounds #9772

Uh oh!

Conversation

jinz2014 commented Jun 7, 2023

Uh oh!

elizabethandrews left a comment

Choose a reason for hiding this comment

Uh oh!

steffenlarsen commented Jun 7, 2023

Uh oh!

steffenlarsen commented Jun 7, 2023

Uh oh!

jinz2014 commented Jun 9, 2023

Uh oh!

jchlanda commented Jun 13, 2023

Uh oh!

jinz2014 commented Jun 14, 2023

Uh oh!

jinz2014 commented Aug 1, 2023

Uh oh!

jchlanda commented Aug 2, 2023

Uh oh!

jchlanda commented Aug 4, 2023

Uh oh!

jchlanda commented Aug 8, 2023

Uh oh!

jchlanda commented Sep 15, 2023

Uh oh!

Uh oh!