-
Notifications
You must be signed in to change notification settings - Fork 790
[SYCL] Fix endless-loop in reduction with nd_range having 1 element l… #2022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
d5620dd
82a8b8b
f9137f4
c5e1b62
3203415
63be65a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
//==---------------- reduction.cpp - SYCL reduction ------------*- C++ -*---==// | ||
// | ||
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions. | ||
// See https://llvm.org/LICENSE.txt for license information. | ||
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
// | ||
//===----------------------------------------------------------------------===// | ||
|
||
#include <CL/sycl/intel/reduction.hpp> | ||
#include <detail/queue_impl.hpp> | ||
|
||
__SYCL_INLINE_NAMESPACE(cl) { | ||
namespace sycl { | ||
namespace intel { | ||
namespace detail { | ||
|
||
// TODO: The algorithm of choosing the work-group size is definitely | ||
// imperfect now and can be improved. | ||
__SYCL_EXPORT size_t reduComputeWGSize(size_t NWorkItems, size_t MaxWGSize, | ||
size_t &NWorkGroups) { | ||
size_t WGSize = MaxWGSize; | ||
if (NWorkItems <= WGSize) { | ||
NWorkGroups = 1; | ||
WGSize = NWorkItems; | ||
} else { | ||
NWorkGroups = NWorkItems / WGSize; | ||
size_t Rem = NWorkItems % WGSize; | ||
if (Rem != 0) { | ||
// Let's say MaxWGSize = 128 and NWorkItems is (128+32). | ||
// It seems better to have 5 groups 32 work-items each than 2 groups with | ||
// 128 work-items in the 1st group and 32 work-items in the 2nd group. | ||
size_t NWorkGroupsAlt = NWorkItems / Rem; | ||
size_t RemAlt = NWorkItems % Rem; | ||
if (RemAlt == 0 && NWorkGroupsAlt <= MaxWGSize) { | ||
NWorkGroups = NWorkGroupsAlt; | ||
WGSize = Rem; | ||
} | ||
} else { | ||
NWorkGroups++; | ||
} | ||
} | ||
return WGSize; | ||
} | ||
|
||
__SYCL_EXPORT size_t | ||
reduGetMaxWGSize(shared_ptr_class<sycl::detail::queue_impl> Queue, | ||
size_t LocalMemBytesPerWorkItem) { | ||
device Dev = Queue->get_device(); | ||
size_t WGSize = Dev.get_info<info::device::max_work_group_size>(); | ||
Comment on lines
+48
to
+49
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think it's worth adding a TODO here noting that the code assumes that the reduction kernel can be launched with the maximum work-group size? I think eventually we want to pay attention to the maximum work-group size for the kernel itself (which may be different). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added TODO comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right, I don't think we can count on the query to account for local memory. We'd need two queries as you have here, but the max work-group size one would be kernel-specific instead of device-specific. It might be hard to do this right now, because I think the program interface only allows you to access this information today by re-compiling the kernel. Paying the JIT cost for the kernel every time it's launched just to check what sizes it supports doesn't seem like a good idea. |
||
if (LocalMemBytesPerWorkItem != 0) { | ||
size_t MemSize = Dev.get_info<info::device::local_mem_size>(); | ||
size_t WGSizePerMem = MemSize / LocalMemBytesPerWorkItem; | ||
|
||
// If the work group size is not pow of two, then an additional element | ||
// in local memory is needed for the reduction algorithm and thus the real | ||
// work-group size requirement per available memory is stricter. | ||
if ((WGSize & (WGSize - 1)) == 0) | ||
WGSizePerMem--; | ||
WGSize = (std::min)(WGSizePerMem, WGSize); | ||
} | ||
return WGSize; | ||
} | ||
|
||
} // namespace detail | ||
} // namespace intel | ||
} // namespace sycl | ||
} // __SYCL_INLINE_NAMESPACE(cl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should set up some benchmarks for this sort of thing. I agree that load balance is important, but we might need to pay attention to the total number of kernels we're launching as well.