Skip to content

add Level Zero memory provider #237

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 4, 2024

Conversation

bratpiorka
Copy link
Contributor

@bratpiorka bratpiorka commented Feb 14, 2024

This PR adds a Level Zero memory provider that supports #225 enhancement. TODO:

  • Make Level Zero provider a separate lib that dynamically loads Level Zero functions from ze_loader
  • Create a separate workflow for this with required deps (Level Zero) that will run on GPU
  • Add basic test with simple allocation/deallocation
  • Update README and docs

Please note that shared memory would be supported in separate PR

@igchor
Copy link
Member

igchor commented Feb 14, 2024

@bratpiorka can you describe what's your plan regarding integration with UR? Do we want to have a duplicate provider in UMF for now or are you aiming to use this provider in UR? If it's the latter, there are more features you need to support: see the original implementation here https://github.com/oneapi-src/unified-runtime/blob/main/source/adapters/level_zero/usm.cpp#L170

@igchor
Copy link
Member

igchor commented Feb 14, 2024

Also, the L0 provider should be a separate static library: see #225

@bratpiorka
Copy link
Contributor Author

@igchor thanks for the links. What do you think about creating a basic functionality first and then adding more features (in separate PRs) so it could finally replace the providers used in UR?

@igchor
Copy link
Member

igchor commented Feb 14, 2024

@igchor thanks for the links. What do you think about creating a basic functionality first and then adding more features (in separate PRs) so it could finally replace the providers used in UR?

Sure, that sounds reasonable, I just wanted to highlight that there might be some gaps with the current implementation.

@bratpiorka bratpiorka force-pushed the rrudnick_ze_provider branch 2 times, most recently from 55a88e9 to eddecb1 Compare February 15, 2024 09:18
@vinser52
Copy link
Contributor

@igchor thanks for the links. What do you think about creating a basic functionality first and then adding more features (in separate PRs) so it could finally replace the providers used in UR?

But what is the issue of migrating the whole UR implementation as is? Probably with some minor tweaks

@vinser52
Copy link
Contributor

Also, the L0 provider should be a separate static library: see #225

@igchor did you mean that L0 provider should be only static library or both: static and dynamic?

@bratpiorka
Copy link
Contributor Author

@igchor thanks for the links. What do you think about creating a basic functionality first and then adding more features (in separate PRs) so it could finally replace the providers used in UR?

But what is the issue of migrating the whole UR implementation as is? Probably with some minor tweaks

This version of the Level Zero provider is very simple and is intended to be used by oneCCL, but does not contain any oneCCL-specific code. On the other hand, the UR provider contains a lot of UR-specific code and is quite complex. Making it generic would require a lot of work. Also keep in mind that most of the contents of this PR are additional files needed by umf: workflows, cmake files, def/map etc. So even with this PR porting is still an option - it wouldn't change a lot of code.

@igchor
Copy link
Member

igchor commented Feb 16, 2024

Also, the L0 provider should be a separate static library: see #225

@igchor did you mean that L0 provider should be only static library or both: static and dynamic?

I think we need the possibility to build it as a static library for UR. As for dynamic library, I don't know at this moment.

@vinser52
Copy link
Contributor

@igchor thanks for the links. What do you think about creating a basic functionality first and then adding more features (in separate PRs) so it could finally replace the providers used in UR?

But what is the issue of migrating the whole UR implementation as is? Probably with some minor tweaks

This version of the Level Zero provider is very simple and is intended to be used by oneCCL, but does not contain any oneCCL-specific code. On the other hand, the UR provider contains a lot of UR-specific code and is quite complex. Making it generic would require a lot of work. Also keep in mind that most of the contents of this PR are additional files needed by umf: workflows, cmake files, def/map etc. So even with this PR porting is still an option - it wouldn't change a lot of code.

My point is since UR already uses UMF and we agreed to move L0 provider impl from UR repo to UMF we should migrate it first (of course UR specific code should be cleaned up) and make sure we are not breaking UR flow. And then we can test if it fits for oneCCL needs.

@vinser52
Copy link
Contributor

Also, the L0 provider should be a separate static library: see #225

@igchor did you mean that L0 provider should be only static library or both: static and dynamic?

I think we need the possibility to build it as a static library for UR. As for dynamic library, I don't know at this moment.

Will it introduce any runtime dependencies to some L0 components? It is important in context of MPI, today they have no link dependency to L0 they try to load L0 at runtime (using dlopen) if I am correct. I will double-check.

@bratpiorka
Copy link
Contributor Author

@igchor thanks for the links. What do you think about creating a basic functionality first and then adding more features (in separate PRs) so it could finally replace the providers used in UR?

But what is the issue of migrating the whole UR implementation as is? Probably with some minor tweaks

This version of the Level Zero provider is very simple and is intended to be used by oneCCL, but does not contain any oneCCL-specific code. On the other hand, the UR provider contains a lot of UR-specific code and is quite complex. Making it generic would require a lot of work. Also keep in mind that most of the contents of this PR are additional files needed by umf: workflows, cmake files, def/map etc. So even with this PR porting is still an option - it wouldn't change a lot of code.

My point is since UR already uses UMF and we agreed to move L0 provider impl from UR repo to UMF we should migrate it first (of course UR specific code should be cleaned up) and make sure we are not breaking UR flow. And then we can test if it fits for oneCCL needs.

Also, the L0 provider should be a separate static library: see #225

@igchor did you mean that L0 provider should be only static library or both: static and dynamic?

I think we need the possibility to build it as a static library for UR. As for dynamic library, I don't know at this moment.

Will it introduce any runtime dependencies to some L0 components? It is important in context of MPI, today they have no link dependency to L0 they try to load L0 at runtime (using dlopen) if I am correct. I will double-check.

oneCCL also loads L0 at runtime

@igchor
Copy link
Member

igchor commented Feb 16, 2024

Also, the L0 provider should be a separate static library: see #225

@igchor did you mean that L0 provider should be only static library or both: static and dynamic?

I think we need the possibility to build it as a static library for UR. As for dynamic library, I don't know at this moment.

Will it introduce any runtime dependencies to some L0 components? It is important in context of MPI, today they have no link dependency to L0 they try to load L0 at runtime (using dlopen) if I am correct. I will double-check.

@vinser52 Okay, then in that case we'll need a dynamic version I guess. Can you add this info to the issue so that we don't miss it?

@bratpiorka
Copy link
Contributor Author

@igchor thanks for the links. What do you think about creating a basic functionality first and then adding more features (in separate PRs) so it could finally replace the providers used in UR?

But what is the issue of migrating the whole UR implementation as is? Probably with some minor tweaks

This version of the Level Zero provider is very simple and is intended to be used by oneCCL, but does not contain any oneCCL-specific code. On the other hand, the UR provider contains a lot of UR-specific code and is quite complex. Making it generic would require a lot of work. Also keep in mind that most of the contents of this PR are additional files needed by umf: workflows, cmake files, def/map etc. So even with this PR porting is still an option - it wouldn't change a lot of code.

My point is since UR already uses UMF and we agreed to move L0 provider impl from UR repo to UMF we should migrate it first (of course UR specific code should be cleaned up) and make sure we are not breaking UR flow. And then we can test if it fits for oneCCL needs.

I looked into the code more deeply + consulted with @igchor and we could break the existing L0 provider in UR into two providers: a UR-specific top-level provider (implemented in UR) and a low-level generic L0 provider that just translates UMF provider API to ZE calls. With this approach, the provider implementation in this PR should be really simple and generic so it could be used by both UR and oneCCL

@vinser52
Copy link
Contributor

@igchor thanks for the links. What do you think about creating a basic functionality first and then adding more features (in separate PRs) so it could finally replace the providers used in UR?

But what is the issue of migrating the whole UR implementation as is? Probably with some minor tweaks

This version of the Level Zero provider is very simple and is intended to be used by oneCCL, but does not contain any oneCCL-specific code. On the other hand, the UR provider contains a lot of UR-specific code and is quite complex. Making it generic would require a lot of work. Also keep in mind that most of the contents of this PR are additional files needed by umf: workflows, cmake files, def/map etc. So even with this PR porting is still an option - it wouldn't change a lot of code.

My point is since UR already uses UMF and we agreed to move L0 provider impl from UR repo to UMF we should migrate it first (of course UR specific code should be cleaned up) and make sure we are not breaking UR flow. And then we can test if it fits for oneCCL needs.

I looked into the code more deeply + consulted with @igchor and we could break the existing L0 provider in UR into two providers: a UR-specific top-level provider (implemented in UR) and a low-level generic L0 provider that just translates UMF provider API to ZE calls. With this approach, the provider implementation in this PR should be really simple and generic so it could be used by both UR and oneCCL

I have the same idea and wanted to discuss it in our sync today. Looks like we are on the same page.

@vinser52
Copy link
Contributor

Also, the L0 provider should be a separate static library: see #225

@igchor did you mean that L0 provider should be only static library or both: static and dynamic?

I think we need the possibility to build it as a static library for UR. As for dynamic library, I don't know at this moment.

Will it introduce any runtime dependencies to some L0 components? It is important in context of MPI, today they have no link dependency to L0 they try to load L0 at runtime (using dlopen) if I am correct. I will double-check.

@vinser52 Okay, then in that case we'll need a dynamic version I guess. Can you add this info to the issue so that we don't miss it?

Sure, I will update the issue. But, the dynamic version can be omitted if our static version will try to dlopen/dlsym libze_loader.so - In that case we are not making any hard dependency.

@bratpiorka bratpiorka force-pushed the rrudnick_ze_provider branch 10 times, most recently from 9ab1e48 to 15e3522 Compare February 21, 2024 08:02
@bratpiorka bratpiorka force-pushed the rrudnick_ze_provider branch 2 times, most recently from cf34132 to e66b0bc Compare February 26, 2024 18:32
@bratpiorka
Copy link
Contributor Author

@igchor @vinser52 @pbalcer - please review. I removed dlopen() from the Level Zero provider at all. The test I added to the CI loads libze_loader dynamically using dlopen(RTLD_GLOBAL) and the Level Zero provider reuses these symbols without any problems.

@igchor
Copy link
Member

igchor commented Feb 27, 2024

@igchor @vinser52 @pbalcer - please review. I removed dlopen() from the Level Zero provider at all. The test I added to the CI loads libze_loader dynamically using dlopen(RTLD_GLOBAL) and the Level Zero provider reuses these symbols without any problems.

Can we also have a similar test for when the application links with Level Zero normally (not using dlopen)? I.e. just have a second test that just assign all those functions pointers in libze_ops to &zeInit, &zeDriverGet, etc?

Copy link
Contributor

@lukaszstolarczuk lukaszstolarczuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@bratpiorka
Copy link
Contributor Author

Can we also have a similar test for when the application links with Level Zero normally (not using dlopen)? I.e. just have a second test that just assign all those functions pointers in libze_ops to &zeInit, &zeDriverGet, etc?

@igchor done

@bratpiorka bratpiorka force-pushed the rrudnick_ze_provider branch 2 times, most recently from d8e07c3 to 799e4fc Compare March 4, 2024 13:54
@bratpiorka
Copy link
Contributor Author

@vinser52 I have resolved the discussions started by you, as we agreed

@bratpiorka bratpiorka force-pushed the rrudnick_ze_provider branch from 799e4fc to 5330b56 Compare March 4, 2024 14:10
Copy link
Contributor

@ldorau ldorau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not build on WSL Ubuntu:

/home/ldorau/work/unified-memory-framework/src/provider/provider_level_zero.c: In function ‘init_ze_global_state’:
/home/ldorau/work/unified-memory-framework/src/provider/provider_level_zero.c:48:48: error: ‘RTLD_DEFAULT’ undeclared (first use in this function)
   48 |     *(void **)&g_ze_ops.zeMemAllocHost = dlsym(RTLD_DEFAULT, "zeMemAllocHost");
      |                                                ^~~~~~~~~~~~
/home/ldorau/work/unified-memory-framework/src/provider/provider_level_zero.c:48:48: note: each undeclared identifier is reported only once for each function it appears in
make[2]: *** [src/CMakeFiles/umf.dir/build.make:328: src/CMakeFiles/umf.dir/provider/provider_level_zero.c.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:1153: src/CMakeFiles/umf.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

@bratpiorka bratpiorka force-pushed the rrudnick_ze_provider branch from 5330b56 to d5c014e Compare March 4, 2024 14:26
@bratpiorka
Copy link
Contributor Author

@ldorau please check now. BTW we definitely should add WSL to our CI

@bratpiorka bratpiorka force-pushed the rrudnick_ze_provider branch from d5c014e to cdf96c4 Compare March 4, 2024 14:29
@bratpiorka bratpiorka force-pushed the rrudnick_ze_provider branch from cdf96c4 to ba84ef3 Compare March 4, 2024 18:30
@bratpiorka bratpiorka merged commit 0d83bd0 into oneapi-src:main Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants