Skip to content

[MLIR][XeGPU] Update XeGPU doc #136155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 17, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 48 additions & 47 deletions mlir/include/mlir/Dialect/XeGPU/IR/XeGPUAttrs.td
Original file line number Diff line number Diff line change
Expand Up @@ -183,53 +183,54 @@ def XeGPU_LayoutAttr : XeGPUAttr<"Layout", "layout"> {
1-dimensional layout. The first dimension in the order list is the fastest-changing dimension. If it
is not present, the default value is [1, 0].

### Examples:
1. Subgroup level layout:
```mlir
#xegpu.layout<lane_layout = [2, 8], lane_data = [1, 1]>
```
In this example, there are 16 work-items per subgroup, and is organized as
[[0, 1, 2, .., 7],[8, 9, .., 15]]. The distribution unit is 1x1.

2. Subgroup level layout with order:
```mlir
#xegpu.layout<lane_layout = [2, 8], lane_data = [1, 1], order = [0, 1]>
```
In this example, there are 16 work-items per subgroup, and is organized as
[[0, 2, 4, ..., 14], [1, 3, 5, ..., 15]]. The distribution unit is 1x1.

3. Subgroup level layout with inst_data
```mlir
#xegpu.layout<inst_data = [8, 16], lane_layout = [2, 8], lane_data = [2, 2]>
```
In this example, the original problem size is partitioned into smaller subproblems of dimensions [8, 16],
which are then distributed among 16 work-items arranged as [[0, 1, 2, ..., 7], [8, 9, ..., 15]]. Each
work-item is assigned four 2x2 blocks in a round-robin manner.

4. Workgroup level layout:
```mlir
#xegpu.layout<sg_layout = [2, 4], sg_data = [16, 16], lane_layout = [2, 8], lane_data = [1, 1]>
```
In this example, the layout represents a workgroup distribution. A workgroup consists of 8 subgroups
arranged as [[0, 1, 2, 3], [4, 5, 6, 7]]. Each subgroup accesses a 16x16 block per instruction, which
is further distributed to 16 work items which is organized as [[0, 1, 2, .., 7],[8, 9, .., 15]].

5. Workgroup level layout with order:
```mlir
#xegpu.layout<sg_layout = [2, 4], sg_data = [16, 16], lane_layout = [2, 8], lane_data = [1, 1], order = [0, 1]>
```
In this example, the layout represents a workgroup distribution. A workgroup consists of 8 subgroups
arranged as [[0, 2, 4, 6], [1, 3, 5, 7]]. Each subgroup accesses a 16x16 block per instruction, which
is further distributed to 16 work items which is organized as [[0, 2, 4, ..., 14], [1, 3, 5, ..., 15]].

6. Workgroup level layout with inst_data:
```mlir
#xegpu.layout<sg_layout = [2, 4], sg_data = [16, 16], inst_data = [8, 16], lane_layout = [2, 8], lane_data = [1, 1]>
```
This example is similar to the previous ones, but the `inst_data` parameter divides `sg_data` into two instructions,
each processing an 8x16 block. These blocks are further distributed across 16 work-items with a distribution unit of 1x1.
Unlike the 2x2 distribution unit in example 3, which results in accessing contiguous 2x2 blocks, the 1x1 distribution
unit may result in non-contiguous access.
Examples:

1. Subgroup level layout:
```mlir
#xegpu.layout<lane_layout = [2, 8], lane_data = [1, 1]>
```
In this example, there are 16 work-items per subgroup, and is organized as
[[0, 1, 2, .., 7],[8, 9, .., 15]]. The distribution unit is 1x1.

2. Subgroup level layout with order:
```mlir
#xegpu.layout<lane_layout = [2, 8], lane_data = [1, 1], order = [0, 1]>
```
In this example, there are 16 work-items per subgroup, and is organized as
[[0, 2, 4, ..., 14], [1, 3, 5, ..., 15]]. The distribution unit is 1x1.

3. Subgroup level layout with inst_data
```mlir
#xegpu.layout<inst_data = [8, 16], lane_layout = [2, 8], lane_data = [2, 2]>
```
In this example, the original problem size is partitioned into smaller subproblems of dimensions [8, 16],
which are then distributed among 16 work-items arranged as [[0, 1, 2, ..., 7], [8, 9, ..., 15]]. Each
work-item is assigned four 2x2 blocks in a round-robin manner.

4. Workgroup level layout:
```mlir
#xegpu.layout<sg_layout = [2, 4], sg_data = [16, 16], lane_layout = [2, 8], lane_data = [1, 1]>
```
In this example, the layout represents a workgroup distribution. A workgroup consists of 8 subgroups
arranged as [[0, 1, 2, 3], [4, 5, 6, 7]]. Each subgroup accesses a 16x16 block per instruction, which
is further distributed to 16 work items which is organized as [[0, 1, 2, .., 7],[8, 9, .., 15]].

5. Workgroup level layout with order:
```mlir
#xegpu.layout<sg_layout = [2, 4], sg_data = [16, 16], lane_layout = [2, 8], lane_data = [1, 1], order = [0, 1]>
```
In this example, the layout represents a workgroup distribution. A workgroup consists of 8 subgroups
arranged as [[0, 2, 4, 6], [1, 3, 5, 7]]. Each subgroup accesses a 16x16 block per instruction, which
is further distributed to 16 work items which is organized as [[0, 2, 4, ..., 14], [1, 3, 5, ..., 15]].

6. Workgroup level layout with inst_data:
```mlir
#xegpu.layout<sg_layout = [2, 4], sg_data = [16, 16], inst_data = [8, 16], lane_layout = [2, 8], lane_data = [1, 1]>
```
This example is similar to the previous ones, but the `inst_data` parameter divides `sg_data` into two instructions,
each processing an 8x16 block. These blocks are further distributed across 16 work-items with a distribution unit of 1x1.
Unlike the 2x2 distribution unit in example 3, which results in accessing contiguous 2x2 blocks, the 1x1 distribution
unit may result in non-contiguous access.
}];

let parameters = (ins
Expand Down
19 changes: 14 additions & 5 deletions mlir/include/mlir/Dialect/XeGPU/IR/XeGPUDialect.td
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,20 @@ def XeGPU_Dialect : Dialect {
let cppNamespace = "::mlir::xegpu";
let summary = "The XeGPU dialect that models Intel GPU's ISA";
let description = [{
The XeGPU dialect models Intel Xe ISA semantics but works at vector and
TensorDesc data type. It provides 1:1 mappings to match Xe instructions
like DPAS and 2D block load. The matrix size being processed at this level
exactly matches the hardware instructions or the intrinsic supported by
the lower-level GPU compiler.
The XeGPU dialect closely models a subset of the Xe GPU's ISA, providing an
abstraction to support high-performance GEMM code generation. It serves as a
bridge dialect in the MLIR gradual lowering process, working with MLIR memref
and vector types, and complements the Arith, Math, Vector, and Memref dialects.
XeGPU operations are introduced for special Xe instructions not modeled by the
LLVM/SPIR-V dialect, such as DPAS and 2D block load and store.

It supports a tile-based programming model, decomposing the GEMM kernel into
large predefined tile sizes at the subgroup and workgroup levels. XeGPU allows
the high-level GEMM algorithm to be easily expressed. Underneath, it uses
target-specific recipes and hardware features to achieve optimal performance
on specific hardware. By decomposing GEMM at submatrix granularity and mapping it
to registers, it naturally supports optimizations like fusing with neighboring
operations.
}];

let dependentDialects = ["arith::ArithDialect"];
Expand Down
Loading