-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[mlir][xegpu] XeGPU distribution patterns for load_nd, store_nd, and create_nd_tdesc. #119783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
✅ With the latest revision this PR passed the C/C++ code formatter. |
@kurapov-peter @chencha3 @adam-smnk @Jianhui-Li Could you please review this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviving this!
/// %view = memref.subview %r#0[0, %laneid] [4, 1] [1, 1] | ||
/// : memref<4x8xf32> to memref<4x1xf32> | ||
/// %td = xegpu.create_nd_tdesc %view[0, 0]: memref<4x1xf32> | ||
/// -> !xegpu.tensor_desc<4x1xf32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comments need to be change as well, as we don't need memref.subview.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed now.
using namespace mlir; | ||
|
||
namespace { | ||
bool divisible(APInt lhs, APInt rhs) { return !lhs.urem(rhs); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change the file name to be XeGPUSubgroupDistribute.cpp, to be more explicit. Since we also have a notion of "workgroup distribute".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything under our control is changed to Subgroup
along with class names, pass names and test cases.
@@ -0,0 +1,79 @@ | |||
// RUN: mlir-opt -test-xegpu-distribute -split-input-file %s | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change the file name to be xegpu-subgroup-distribute.cpp, to be more explicit? Since we also have a notion of "workgroup distribute".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to subgroup-distribute.mlir
. Since we are already in XeGPU directory, I think xegpu
prefix is redundant here. I don't see other dialects using same naming convention. We need to fix some of the existing test case name like xegpu-fold-alias-ops.mlir
to just fold-alias-ops.mlir
/// !xegpu.tensor_desc<4x8xf32>, vector<4x8xf32> | ||
/// vector.yield %arg0, %arg1 | ||
/// } | ||
/// xegpu.store_nd %r#0, %r#1: vector<4x1xf32>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be load_nd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. fixed it. sorry I missed it.
} | ||
|
||
LogicalResult | ||
WarpOpTensorDescOp::matchAndRewrite(gpu::WarpExecuteOnLane0Op warpOp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to stick to "subgroup" prefix since XeGPU uses "subgroup" terminology, which is counterpart of warp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
@adam-smnk Can you please help with the review and merging this? |
#define DEBUG_TYPE "xegpu-distribute" | ||
#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: looks unused so could be removed
auto layout = sgMap.getWiLayout(); | ||
auto shape = originalT.getShape(); | ||
for (const auto [l, o] : llvm::zip_equal(layout, shape)) { | ||
if (!divisible(APInt(64, o), APInt(64, l))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to go through APInt for this?
/// still contain the original op that will not be used by the yield op (and | ||
/// should be cleaned up later with dce). The yield op will bypass the | ||
/// create_nd_tdesc's arguments. Tensor descriptor is not distributed because it | ||
/// is a uniform value accorss all work items within the subgroup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// is a uniform value accorss all work items within the subgroup. | |
/// is a uniform value across all work items within the subgroup. |
llvm::SmallVector<int64_t, 2> distributedShape; | ||
auto layout = sgMap.getWiLayout(); | ||
auto shape = originalT.getShape(); | ||
for (const auto [l, o] : llvm::zip_equal(layout, shape)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could you use more descriptive variable names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also add a few invalid test cases? Like 1D with sg_map
, no map etc.
MLIRXeGPUTransforms | ||
MLIRXeGPUDialect | ||
MLIRSupport | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: missing newline
"Failed to distribute the type"); | ||
VectorType newVectorType = distributedTypeOrFailure.value(); | ||
|
||
auto distributedDescTypeOrFailure = getDistributedTensorDescType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this at all? I think it TensorDesc can't be ever scattered for nd
ops
return rewriter.notifyMatchFailure( | ||
descOp, "the tensor descriptor lacks sg_map attribute"); | ||
|
||
auto distributedDescTypeOrFailure = getDistributedTensorDescType( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test case for this?
@@ -0,0 +1,79 @@ | |||
// RUN: mlir-opt -test-xegpu-subgroup-distribute -split-input-file %s | FileCheck %s | |||
|
|||
#sg_map_16 = #xegpu.sg_map<wi_layout = [1, 16], wi_data = [1, 1]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a sanity check, does the distribution work fine with wi_data
different than 1?
FailureOr<xegpu::TensorDescType> | ||
getDistributedTensorDescType(xegpu::TensorDescType originalT, | ||
xegpu::SGMapAttr sgMap, | ||
xegpu::MemorySpace memSpace) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I take that memSpace
is there to propagate it to the newly created type. If so, it's unused atm.
Anyway, does it need to be a separate argument at all or could it be taken directly from originalT
?
@adam-smnk Thanks very much for the review. It looks like PR needs more changes to support corner cases like wi_data = [1, 2]. I am working on it now. |
This introduces SIMT distribution patterns for XeGPU
load_nd
,store_nd
, andcreate_nd_tdesc
operations. For these operations,TensorDescType
is not distributed. This PR is based on @kurapov-peter's earlier draft #112945.