@@ -432,6 +432,143 @@ to left-shift the found bit into the most-significant bit position, otherwise
432
432
the result is the shift amount needed to right-shift the found bit into the
433
433
least-significant bit position. 0xffffffff is returned if no 1 bit is found.
434
434
435
+ TMA family of Intrinsics
436
+ ------------------------
437
+
438
+ '``llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d ``'
439
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
440
+
441
+ Syntax:
442
+ """""""
443
+
444
+ .. code-block :: llvm
445
+
446
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.1d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch)
447
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.2d(..., i32 %d0, i32 %d1, ...)
448
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
449
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
450
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
451
+
452
+ Overview:
453
+ """""""""
454
+
455
+ The '``@llvm.nvvm.cp.async.bulk.tensor.g2s.tile.[1-5]d ``' intrinsics
456
+ correspond to the ``cp.async.bulk.tensor.[1-5]d.* `` set of PTX instructions.
457
+ These instructions initiate an asynchronous copy of tensor data from
458
+ global memory to shared::cluster memory (indicated by the ``g2s `` prefix)
459
+ in ``tile `` mode. In tile mode, the multi-dimensional layout of the
460
+ source tensor is preserved at the destination. The dimension of the
461
+ tensor data ranges from 1d to 5d with the coordinates specified
462
+ by the ``i32 %d0 ... i32 %d4 `` arguments.
463
+
464
+ * The last two arguments to these intrinsics are boolean flags
465
+ indicating support for cache_hint and/or multicast modifiers.
466
+ These flag arguments must be compile-time constants. The backend
467
+ looks through these flags and lowers the intrinsics appropriately.
468
+
469
+ * The Nth argument (denoted by ``i1 flag_ch ``) when set, indicates
470
+ a valid cache_hint (``i64 %ch ``) and generates the ``.L2::cache_hint ``
471
+ variant of the PTX instruction.
472
+
473
+ * The [N-1]th argument (denoted by ``i1 flag_mc ``) when set, indicates
474
+ the presence of a multicast mask (``i16 %mc ``) and generates the PTX
475
+ instruction with the ``.multicast::cluster `` modifier.
476
+
477
+ For more information, refer PTX ISA
478
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor >`_.
479
+
480
+ '``llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.[3-5]d ``'
481
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
482
+
483
+ Syntax:
484
+ """""""
485
+
486
+ .. code-block :: llvm
487
+
488
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.3d(ptr addrspace(3) %dst, ptr addrspace(3) %bar, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i16 %im2col0, i16 %mc, i64 %ch, i1 %flag_mc, i1 %flag_ch)
489
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i16 %im2col0, i16 %im2col1, ...)
490
+ declare void @llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, i16 %im2col0, i16 %im2col1, i16 %im2col2, ...)
491
+
492
+ Overview:
493
+ """""""""
494
+
495
+ The '``@llvm.nvvm.cp.async.bulk.tensor.g2s.im2col.[3-5]d ``' intrinsics
496
+ correspond to the ``cp.async.bulk.tensor.[1-5]d.* `` set of PTX instructions.
497
+ These instructions initiate an asynchronous copy of tensor data from
498
+ global memory to shared::cluster memory (indicated by the ``g2s `` prefix)
499
+ in ``im2col `` mode. In im2col mode, some dimensions of the source tensor
500
+ are unrolled into a single dimensional column at the destination. In this
501
+ mode, the tensor has to be at least three-dimensional. Along with the tensor
502
+ coordinates, im2col offsets are also specified (denoted by
503
+ ``i16 im2col0...i16 %im2col2 ``). The number of im2col offsets is two less
504
+ than the number of dimensions of the tensor operation. The last two arguments
505
+ to these intrinsics are boolean flags, with the same functionality as described
506
+ in the ``tile `` mode intrinsics above.
507
+
508
+ For more information, refer PTX ISA
509
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor >`_.
510
+
511
+ '``llvm.nvvm.cp.async.bulk.tensor.s2g.tile.[1-5]d ``'
512
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
513
+
514
+ Syntax:
515
+ """""""
516
+
517
+ .. code-block :: llvm
518
+
519
+ declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.1d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i64 %ch, i1 %flag_ch)
520
+ declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.2d(..., i32 %d0, i32 %d1, ...)
521
+ declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.3d(..., i32 %d0, i32 %d1, i32 %d2, ...)
522
+ declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
523
+ declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.tile.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
524
+
525
+ Overview:
526
+ """""""""
527
+
528
+ The '``@llvm.nvvm.cp.async.bulk.tensor.s2g.tile.[1-5]d ``' intrinsics
529
+ correspond to the ``cp.async.bulk.tensor.[1-5]d.* `` set of PTX instructions.
530
+ These instructions initiate an asynchronous copy of tensor data from
531
+ shared::cta to global memory (indicated by the ``s2g `` prefix)
532
+ in ``tile `` mode. The dimension of the tensor data ranges from 1d to 5d
533
+ with the coordinates specified by the ``i32 %d0 ... i32 %d4 `` arguments.
534
+
535
+ * The last argument to these intrinsics is a boolean flag
536
+ indicating support for cache_hint. This flag argument must
537
+ be a compile-time constant. When set, it indicates a valid
538
+ cache_hint (``i64 %ch ``) and generates the ``.L2::cache_hint ``
539
+ variant of the PTX instruction.
540
+
541
+ For more information, refer PTX ISA
542
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor >`_.
543
+
544
+ '``llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.[3-5]d ``'
545
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
546
+
547
+ Syntax:
548
+ """""""
549
+
550
+ .. code-block :: llvm
551
+
552
+ declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.3d(ptr addrspace(3) %src, ptr %tensor_map, i32 %d0, i32 %d1, i32 %d2, i64 %ch, i1 %flag_ch)
553
+ declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.4d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, ...)
554
+ declare void @llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.5d(..., i32 %d0, i32 %d1, i32 %d2, i32 %d3, i32 %d4, ...)
555
+
556
+ Overview:
557
+ """""""""
558
+
559
+ The '``@llvm.nvvm.cp.async.bulk.tensor.s2g.im2col.[1-5]d ``' intrinsics
560
+ correspond to the ``cp.async.bulk.tensor.[1-5]d.* `` set of PTX instructions.
561
+ These instructions initiate an asynchronous copy of tensor data from
562
+ shared::cta to global memory (indicated by the ``s2g `` prefix)
563
+ in ``im2col `` mode. In this mode, the tensor has to be at least
564
+ three-dimensional. Unlike the ``g2s `` variants, there are no
565
+ im2col_offsets for these intrinsics. The last argument to these
566
+ intrinsics is a boolean flag, with the same functionality as
567
+ described in the ``s2g.tile `` mode intrinsics above.
568
+
569
+ For more information, refer PTX ISA
570
+ `<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#data-movement-and-conversion-instructions-cp-async-bulk-tensor >`_.
571
+
435
572
Other Intrinsics
436
573
----------------
437
574
0 commit comments