Added CUDA graph, Tensor Core and Core pinning explaination

orion160 · orion160 · commit c2f6f842f432 · 2024-06-07T23:08:52.000Z
diff --git a/recipes_source/recipes/tuning_guide.py b/recipes_source/recipes/tuning_guide.py
@@ -213,6 +213,7 @@ def gelu(x):
 
 ###############################################################################
 # Typically, the following environment variables are used to set for CPU affinity with GNU OpenMP implementation. ``OMP_PROC_BIND`` specifies whether threads may be moved between processors. Setting it to CLOSE keeps OpenMP threads close to the primary thread in contiguous place partitions. ``OMP_SCHEDULE`` determines how OpenMP threads are scheduled. ``GOMP_CPU_AFFINITY`` binds threads to specific CPUs.
+# An important tuning parameter is core pinning which prevent the threads of migrating between multiple CPUs, enchancing data location and minimizing intra core communication.
 #
 # .. code-block:: sh
 #
@@ -318,6 +319,33 @@ def gelu(x):
 # GPU specific optimizations
 # --------------------------
 
+###############################################################################
+# Enable Tensor cores
+# ~~~~~~~~~~~~~~~~~~~~~~~
+# Tensor cores are specialized hardware to compute matrix-matrix multiplication
+# operations which neural network operation can take advantage of.
+#
+# Hardware tensor core operations tend to use a different floating point format
+# which sacrifices precision at expense of speed gains.
+
+torch.backends.cuda.matmul.allow_tf32
+
+# Prior to pytorch 1.12 this was enabled by default but since this version
+# it must be explicitly set as it can conflict with some operations which do not
+# benefit from Tensor core computations.
+
+
+###############################################################################
+# Use CUDA Graphs
+# ~~~~~~~~~~~~~~~~~~~~~~~
+# At the time of using a GPU, work first must be launched from the CPU and
+# on some cases the context switch between CPU and GPU can lead to bad resourse
+# utilization. CUDA graphs are a way to keep computation within the GPU without
+# paying the extra cost of kernel launches and host synchronization.
+#
+# It can be enabled using `torch.compile <https://pytorch.org/docs/stable/generated/torch.compile.html>`_ "reduce-overhead" and "max-autotune" modes.
+# Special care must be present when using cuda graphs as it can lead to increased memory consumption and some models might not compile.
+
 ###############################################################################
 # Enable cuDNN auto-tuner
 # ~~~~~~~~~~~~~~~~~~~~~~~