Skip to content

[mlir] Document GPU dialect layering to capture discussions from a PR #95812

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 20, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 39 additions & 3 deletions mlir/docs/Dialects/GPU.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,35 @@ manipulations to launch a GPU kernel and provide a simple path towards GPU
execution from MLIR. It may be targeted, for example, by DSLs using MLIR. The
dialect uses `gpu` as its canonical prefix.

This dialect also abstracts away primitives commonly available in GPU code, such
as with `gpu.thread_id` (an operation that returns the ID of threads within
a thread block/workgroup along a given dimension). While the compilation
pipelines documented below expect such code to live inside a `gpu.module` and
`gpu.func`, these intrinsic wrappers may be used outside of this context.

Intrinsic-wrapping operations should not expect that they have a parent of type
`gpu.func`. However, operations that deal in compiling and launching GPU functions,
like `gpu.launch_func` or `gpu.binary` may assume that the dialect's full layering
is being used.

[TOC]

## GPU address spaces

The GPU dialect exposes the `gpu.address_space` attribute, which currently has
three values: `global`, `workgroup`, and `private`.

These address spaces represent the types of buffer commonly seen in GPU compilation.
`global` memory is memory that resides in the GPU's global memory. `workgroup`
memory is a limited, per-workgroup resource: all threads in a workgroup/thread
block access the same values in `workgroup` memory. Finally, `private` memory is
used to represent `alloca`-like buffers that are private to a single thread/workitem.

These address spaces may be used as the `memorySpace` attribute on `memref` values.
The `gpu.module`/`gpu.func` compilation pipeline will lower such memory space
usages to the correct address spaces on target platforms. Memory attributions should be
created with the correct memory space on the memref.

## Memory attribution

Memory buffers are defined at the function level, either in "gpu.launch" or in
Expand Down Expand Up @@ -61,6 +88,15 @@ mlir-translate example-nvvm.mlir \
-o example.ll
```

This compilation process expects all GPU code to live in a `gpu.module` and
expects all kernels to be `gpu.func` operations. Non-kernel functions, like
device library calls, may be defined using `func.func` or other non-GPU dialect
operations. This permits downstream systems to use these wrappers without
requiring them to use the GPU dialect's function operations, which might not include
information those systems want to have as intrinsic values on their functions.
Additionally, this allows for using `func.func` for device-side library functions
in `gpu.module`s.

### Default NVVM Compilation Pipeline: gpu-lower-to-nvvm-pipeline

The `gpu-lower-to-nvvm-pipeline` compilation pipeline serves as the default way
Expand All @@ -85,9 +121,9 @@ within GPU code execution:
func.func @main() {
%c2 = arith.constant 2 : index
%c1 = arith.constant 1 : index
gpu.launch
blocks(%0, %1, %2) in (%3 = %c1, %4 = %c1, %5 = %c1)
threads(%6, %7, %8) in (%9 = %c2, %10 = %c1, %11 = %c1) {
gpu.launch
blocks(%0, %1, %2) in (%3 = %c1, %4 = %c1, %5 = %c1)
threads(%6, %7, %8) in (%9 = %c2, %10 = %c1, %11 = %c1) {
gpu.printf "Hello from %d\n" %6 : index
gpu.terminator
}
Expand Down
Loading