Skip to content

[mlir][xegpu] XeGPU distribution patterns for load_nd, store_nd, and create_nd_tdesc. #119783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

charithaintc
Copy link
Contributor

@charithaintc charithaintc commented Dec 12, 2024

This introduces SIMT distribution patterns for XeGPU load_nd, store_nd, and create_nd_tdesc operations. For these operations, TensorDescType is not distributed. This PR is based on @kurapov-peter's earlier draft #112945.

@charithaintc charithaintc changed the title [MLIR][XeGPU] Xegpu distribution patterns for load_nd, store_nd, and create_nd_tdesc. [MLIR][XeGPU][Draft] Xegpu distribution patterns for load_nd, store_nd, and create_nd_tdesc. Dec 12, 2024
Copy link

github-actions bot commented Dec 12, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@charithaintc charithaintc changed the title [MLIR][XeGPU][Draft] Xegpu distribution patterns for load_nd, store_nd, and create_nd_tdesc. [MLIR][XeGPU][Draft] XeGPU distribution patterns for load_nd, store_nd, and create_nd_tdesc. Dec 13, 2024
@charithaintc charithaintc changed the title [MLIR][XeGPU][Draft] XeGPU distribution patterns for load_nd, store_nd, and create_nd_tdesc. [mlir][xegpu] XeGPU distribution patterns for load_nd, store_nd, and create_nd_tdesc. Feb 4, 2025
@charithaintc charithaintc marked this pull request as ready for review February 4, 2025 20:44
@charithaintc
Copy link
Contributor Author

@kurapov-peter @chencha3 @adam-smnk @Jianhui-Li Could you please review this PR.

Copy link
Contributor

@kurapov-peter kurapov-peter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviving this!

/// %view = memref.subview %r#0[0, %laneid] [4, 1] [1, 1]
/// : memref<4x8xf32> to memref<4x1xf32>
/// %td = xegpu.create_nd_tdesc %view[0, 0]: memref<4x1xf32>
/// -> !xegpu.tensor_desc<4x1xf32>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments need to be change as well, as we don't need memref.subview.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed now.

using namespace mlir;

namespace {
bool divisible(APInt lhs, APInt rhs) { return !lhs.urem(rhs); }
Copy link
Contributor

@Jianhui-Li Jianhui-Li Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about change the file name to be XeGPUSubgroupDistribute.cpp, to be more explicit. Since we also have a notion of "workgroup distribute".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

everything under our control is changed to Subgroup along with class names, pass names and test cases.

@@ -0,0 +1,79 @@
// RUN: mlir-opt -test-xegpu-distribute -split-input-file %s | FileCheck %s
Copy link
Contributor

@Jianhui-Li Jianhui-Li Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about change the file name to be xegpu-subgroup-distribute.cpp, to be more explicit? Since we also have a notion of "workgroup distribute".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to subgroup-distribute.mlir. Since we are already in XeGPU directory, I think xegpu prefix is redundant here. I don't see other dialects using same naming convention. We need to fix some of the existing test case name like xegpu-fold-alias-ops.mlir to just fold-alias-ops.mlir

/// !xegpu.tensor_desc<4x8xf32>, vector<4x8xf32>
/// vector.yield %arg0, %arg1
/// }
/// xegpu.store_nd %r#0, %r#1: vector<4x1xf32>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be load_nd?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. fixed it. sorry I missed it.

}

LogicalResult
WarpOpTensorDescOp::matchAndRewrite(gpu::WarpExecuteOnLane0Op warpOp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to stick to "subgroup" prefix since XeGPU uses "subgroup" terminology, which is counterpart of warp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@charithaintc
Copy link
Contributor Author

@adam-smnk Can you please help with the review and merging this?

Comment on lines +19 to +20
#define DEBUG_TYPE "xegpu-distribute"
#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks unused so could be removed

auto layout = sgMap.getWiLayout();
auto shape = originalT.getShape();
for (const auto [l, o] : llvm::zip_equal(layout, shape)) {
if (!divisible(APInt(64, o), APInt(64, l)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to go through APInt for this?

/// still contain the original op that will not be used by the yield op (and
/// should be cleaned up later with dce). The yield op will bypass the
/// create_nd_tdesc's arguments. Tensor descriptor is not distributed because it
/// is a uniform value accorss all work items within the subgroup.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// is a uniform value accorss all work items within the subgroup.
/// is a uniform value across all work items within the subgroup.

llvm::SmallVector<int64_t, 2> distributedShape;
auto layout = sgMap.getWiLayout();
auto shape = originalT.getShape();
for (const auto [l, o] : llvm::zip_equal(layout, shape)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could you use more descriptive variable names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add a few invalid test cases? Like 1D with sg_map, no map etc.

MLIRXeGPUTransforms
MLIRXeGPUDialect
MLIRSupport
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing newline

"Failed to distribute the type");
VectorType newVectorType = distributedTypeOrFailure.value();

auto distributedDescTypeOrFailure = getDistributedTensorDescType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this at all? I think it TensorDesc can't be ever scattered for nd ops

return rewriter.notifyMatchFailure(
descOp, "the tensor descriptor lacks sg_map attribute");

auto distributedDescTypeOrFailure = getDistributedTensorDescType(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test case for this?

@@ -0,0 +1,79 @@
// RUN: mlir-opt -test-xegpu-subgroup-distribute -split-input-file %s | FileCheck %s

#sg_map_16 = #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a sanity check, does the distribution work fine with wi_data different than 1?

FailureOr<xegpu::TensorDescType>
getDistributedTensorDescType(xegpu::TensorDescType originalT,
xegpu::SGMapAttr sgMap,
xegpu::MemorySpace memSpace) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take that memSpace is there to propagate it to the newly created type. If so, it's unused atm.
Anyway, does it need to be a separate argument at all or could it be taken directly from originalT?

@charithaintc
Copy link
Contributor Author

@adam-smnk Thanks very much for the review. It looks like PR needs more changes to support corner cases like wi_data = [1, 2]. I am working on it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants