You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[inductor][Autotune] Add matrix_instr_nonkdim to triton_meta (pytorch#122852)
Summary: Previous work `https://github.com/pytorch/pytorch/pull/120742` to enable `matrix_instr_nonkdim` only dealt with the autotuner benchmarking, but failed to enable the parameter in Triton meta for real runs. `matrix_instr_nonkdim` needs to be visible to the compiler driver to set up the optimization pipeline, so it's unlike other kernel parameters such as `BLOCK_N` that can be just set inside the kernel itself.
Test Plan:
P1201466917
triton_heuristics.template(
num_stages=1,
num_warps=4,
triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [instance_descriptor(divisible_by_16=(0, 1, 2), equal_to_1=(), ids_of_folded_args=(), divisible_by_8=())], 'matrix_instr_nonkdim': 16},
inductor_meta={'kernel_name': 'triton_tem_fused_mm_0', 'backend_hash': None},
)
Perf :
Before: 1.693ms 0.134GB 79.28GB/s
After: 1.577ms 0.134GB 85.12GB/s
Differential Revision: D55456401
Pull Request resolved: pytorch#122852
Approved by: https://github.com/xw285cornell
0 commit comments