Skip to content

Commit c7bb4c1

Browse files
JackAKirkJackAKirk
andauthored
[SYCL] Use dim_loop to unroll loops in reduce_over_group in cuda backend. (#7948)
A performance regression was reported when using `reduce_over_group` with sycl::vec. This was due to a loop over calls to the scalar `reduce_over_group` for each of the `sycl::vec` components that was not unrolled and led to register spills even at -O3. It was initially possible to fix the performance by calling `#pragma unroll` and declare `reduce_over_group` with `__attribute__((always_inline))`. However the `SYCL_UNROLL` macro that calls `#pragma unroll` has been removed in favour of `dim_loop` (#6939). I have used dim_loop to fix the loop unrolling. However, in the cuda backend, just using `dim_loop` in this way actually makes the performance worse. This is because `dim_loop` introduces new non inlined function calls in the cuda backend that lead to register spills. The solution to this coincides with the solution of several user reports that the cuda backend is not aggressive enough with inlining. In this PR I have also therefore increased the inlining threshold multiplier value to 11. See https://reviews.llvm.org/D142232/new/ for the corresponding upstream PR (for the inlining threshold change) that includes much more details on benchmarking dpc++ cuda with this change. In short, for dpc++ cuda, there is no other downside apart from a very small increase in compile time in some cases, but there is a massive benefit to increasing the inlining threshold across a large amount of applications. Testing using opencl cpu backend reveals that this code change has no effect on this backend. This change is required for the cuda backend but should have no performance effect for other backends. fixes #6583. --------- Signed-off-by: JackAKirk <[email protected]> Co-authored-by: JackAKirk <[email protected]>
1 parent 473bea3 commit c7bb4c1

File tree

4 files changed

+25
-24
lines changed

4 files changed

+25
-24
lines changed

llvm/lib/Target/NVPTX/NVPTXTargetTransformInfo.h

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -90,9 +90,9 @@ class NVPTXTTIImpl : public BasicTTIImplBase<NVPTXTTIImpl> {
9090
return true;
9191
}
9292

93-
// Increase the inlining cost threshold by a factor of 5, reflecting that
93+
// Increase the inlining cost threshold by a factor of 11, reflecting that
9494
// calls are particularly expensive in NVPTX.
95-
unsigned getInliningThresholdMultiplier() { return 5; }
95+
unsigned getInliningThresholdMultiplier() { return 11; }
9696

9797
InstructionCost getArithmeticInstrCost(
9898
unsigned Opcode, Type *Ty, TTI::TargetCostKind CostKind,

sycl/include/sycl/accessor.hpp

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -241,16 +241,6 @@ struct AccHostDataT {
241241
void *Reserved = nullptr;
242242
};
243243

244-
// To ensure loop unrolling is done when processing dimensions.
245-
template <size_t... Inds, class F>
246-
void dim_loop_impl(std::integer_sequence<size_t, Inds...>, F &&f) {
247-
(f(Inds), ...);
248-
}
249-
250-
template <size_t count, class F> void dim_loop(F &&f) {
251-
dim_loop_impl(std::make_index_sequence<count>{}, std::forward<F>(f));
252-
}
253-
254244
void __SYCL_EXPORT constructorNotification(void *BufferObj, void *AccessorObj,
255245
access::target Target,
256246
access::mode Mode,

sycl/include/sycl/detail/helpers.hpp

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,16 @@ getSPIRVMemorySemanticsMask(const access::fence_space AccessSpace,
245245
LocalScopeMask);
246246
}
247247

248+
// To ensure loop unrolling is done when processing dimensions.
249+
template <size_t... Inds, class F>
250+
void dim_loop_impl(std::integer_sequence<size_t, Inds...>, F &&f) {
251+
(f(Inds), ...);
252+
}
253+
254+
template <size_t count, class F> void dim_loop(F &&f) {
255+
dim_loop_impl(std::make_index_sequence<count>{}, std::forward<F>(f));
256+
}
257+
248258
} // namespace detail
249259

250260
} // __SYCL_INLINE_VER_NAMESPACE(_V1)

sycl/include/sycl/group_algorithm.hpp

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -214,23 +214,24 @@ reduce_over_group(Group g, T x, BinaryOperation binary_op) {
214214
#endif
215215
}
216216

217-
template <typename Group, typename T, class BinaryOperation>
218-
detail::enable_if_t<(is_group_v<std::decay_t<Group>> &&
219-
detail::is_vector_arithmetic<T>::value &&
220-
detail::is_native_op<T, BinaryOperation>::value),
221-
T>
222-
reduce_over_group(Group g, T x, BinaryOperation binary_op) {
217+
template <typename Group, typename T, int N, class BinaryOperation>
218+
detail::enable_if_t<
219+
(is_group_v<std::decay_t<Group>> &&
220+
detail::is_vector_arithmetic<sycl::vec<T, N>>::value &&
221+
detail::is_native_op<sycl::vec<T, N>, BinaryOperation>::value),
222+
sycl::vec<T, N>>
223+
reduce_over_group(Group g, sycl::vec<T, N> x, BinaryOperation binary_op) {
223224
// FIXME: Do not special-case for half precision
224225
static_assert(
225226
std::is_same<decltype(binary_op(x[0], x[0])),
226-
typename T::element_type>::value ||
227-
(std::is_same<T, half>::value &&
227+
typename sycl::vec<T, N>::element_type>::value ||
228+
(std::is_same<sycl::vec<T, N>, half>::value &&
228229
std::is_same<decltype(binary_op(x[0], x[0])), float>::value),
229230
"Result type of binary_op must match reduction accumulation type.");
230-
T result;
231-
for (int s = 0; s < x.size(); ++s) {
232-
result[s] = reduce_over_group(g, x[s], binary_op);
233-
}
231+
sycl::vec<T, N> result;
232+
233+
detail::dim_loop<N>(
234+
[&](size_t s) { result[s] = reduce_over_group(g, x[s], binary_op); });
234235
return result;
235236
}
236237

0 commit comments

Comments
 (0)