[SYCL] Use dim_loop to unroll loops in reduce_over_group in cuda backend. (#7948)

JackAKirk · JackAKirk · web-flow · commit c7bb4c1272d8 · 2023-02-22T13:46:50.000-08:00
A performance regression was reported when using `reduce_over_group` with sycl::vec. This was due to a loop over calls to the scalar `reduce_over_group` for each of the `sycl::vec` components that was not unrolled and led to register spills even at -O3. It was initially possible to fix the performance by calling `#pragma unroll` and declare `reduce_over_group` with `__attribute__((always_inline))`. However the `SYCL_UNROLL` macro that calls `#pragma unroll` has been removed in favour of `dim_loop` (#6939). I have used dim_loop to fix the loop unrolling. However, in the cuda backend, just using `dim_loop` in this way actually makes the performance worse. This is because `dim_loop` introduces new non inlined function calls in the cuda backend that lead to register spills. The solution to this coincides with the solution of several user reports that the cuda backend is not aggressive enough with inlining. In this PR I have also therefore increased the inlining threshold multiplier value to 11. See https://reviews.llvm.org/D142232/new/ for the corresponding upstream PR (for the inlining threshold change) that includes much more details on benchmarking dpc++ cuda with this change. In short, for dpc++ cuda, there is no other downside apart from a very small increase in compile time in some cases, but there is a massive benefit to increasing the inlining threshold across a large amount of applications. Testing using opencl cpu backend reveals that this code change has no effect on this backend. This change is required for the cuda backend but should have no performance effect for other backends. fixes #6583. --------- Signed-off-by: JackAKirk <jack.kirk@codeplay.com> Co-authored-by: JackAKirk <chezjakirk@gmail.com>
diff --git a/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h b/llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h
@@ -90,9 +90,9 @@ class NVPTXTTIImpl : public BasicTTIImplBase<NVPTXTTIImpl> {
     return true;
   }
 
-  // Increase the inlining cost threshold by a factor of 5, reflecting that
+  // Increase the inlining cost threshold by a factor of 11, reflecting that
   // calls are particularly expensive in NVPTX.
-  unsigned getInliningThresholdMultiplier() { return 5; }
+  unsigned getInliningThresholdMultiplier() { return 11; }
 
   InstructionCost getArithmeticInstrCost(
       unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,
diff --git a/sycl/include/sycl/accessor.hpp b/sycl/include/sycl/accessor.hpp
@@ -241,16 +241,6 @@ struct AccHostDataT {
   void *Reserved = nullptr;
 };
 
-// To ensure loop unrolling is done when processing dimensions.
-template <size_t... Inds, class F>
-void dim_loop_impl(std::integer_sequence<size_t, Inds...>, F &&f) {
-  (f(Inds), ...);
-}
-
-template <size_t count, class F> void dim_loop(F &&f) {
-  dim_loop_impl(std::make_index_sequence<count>{}, std::forward<F>(f));
-}
-
 void __SYCL_EXPORT constructorNotification(void *BufferObj, void *AccessorObj,
                                            access::target Target,
                                            access::mode Mode,
diff --git a/sycl/include/sycl/detail/helpers.hpp b/sycl/include/sycl/detail/helpers.hpp
@@ -245,6 +245,16 @@ getSPIRVMemorySemanticsMask(const access::fence_space AccessSpace,
                  LocalScopeMask);
 }
 
+// To ensure loop unrolling is done when processing dimensions.
+template <size_t... Inds, class F>
+void dim_loop_impl(std::integer_sequence<size_t, Inds...>, F &&f) {
+  (f(Inds), ...);
+}
+
+template <size_t count, class F> void dim_loop(F &&f) {
+  dim_loop_impl(std::make_index_sequence<count>{}, std::forward<F>(f));
+}
+
 } // namespace detail
 
 } // __SYCL_INLINE_VER_NAMESPACE(_V1)
diff --git a/sycl/include/sycl/group_algorithm.hpp b/sycl/include/sycl/group_algorithm.hpp
@@ -214,23 +214,24 @@ reduce_over_group(Group g, T x, BinaryOperation binary_op) {
 #endif
 }
 
-template <typename Group, typename T, class BinaryOperation>
-detail::enable_if_t<(is_group_v<std::decay_t<Group>> &&
-                     detail::is_vector_arithmetic<T>::value &&
-                     detail::is_native_op<T, BinaryOperation>::value),
-                    T>
-reduce_over_group(Group g, T x, BinaryOperation binary_op) {
+template <typename Group, typename T, int N, class BinaryOperation>
+detail::enable_if_t<
+    (is_group_v<std::decay_t<Group>> &&
+     detail::is_vector_arithmetic<sycl::vec<T, N>>::value &&
+     detail::is_native_op<sycl::vec<T, N>, BinaryOperation>::value),
+    sycl::vec<T, N>>
+reduce_over_group(Group g, sycl::vec<T, N> x, BinaryOperation binary_op) {
   // FIXME: Do not special-case for half precision
   static_assert(
       std::is_same<decltype(binary_op(x[0], x[0])),
-                   typename T::element_type>::value ||
-          (std::is_same<T, half>::value &&
+                   typename sycl::vec<T, N>::element_type>::value ||
+          (std::is_same<sycl::vec<T, N>, half>::value &&
            std::is_same<decltype(binary_op(x[0], x[0])), float>::value),
       "Result type of binary_op must match reduction accumulation type.");
-  T result;
-  for (int s = 0; s < x.size(); ++s) {
-    result[s] = reduce_over_group(g, x[s], binary_op);
-  }
+  sycl::vec<T, N> result;
+
+  detail::dim_loop<N>(
+      [&](size_t s) { result[s] = reduce_over_group(g, x[s], binary_op); });
   return result;
 }