-
Notifications
You must be signed in to change notification settings - Fork 53
Add support for instance group of type 'MODEL' #107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/libtorch.cc
Outdated
auto parameters = (*torch_model)->parameters(); | ||
auto buffers = (*torch_model)->buffers(); | ||
|
||
for (const auto& parameter : parameters) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obtaining device_id sets can be simplified to a templated utility function that can consume both parameters and buffers data type.
Should be able reduce the code duplicity.
src/libtorch.cc
Outdated
at::cuda::getStreamFromExternal(stream_, DeviceId()); | ||
if ((Kind() == TRITONSERVER_INSTANCEGROUPKIND_GPU) || | ||
(Kind() == TRITONSERVER_INSTANCEGROUPKIND_MODEL && device_.is_cuda())) { | ||
at::cuda::CUDAStream torch_stream = at::cuda::getStreamFromExternal( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is calling setCurrentCUDAStream
to the cuda stream corresponding to the first device, really the right thing to do for the KIND_MODEL. Is setting the current cuda stream even required for kind_model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline that we should call setCurrentCUDAStream
for the cuda stream of every device to replace the default stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain how this PR would solve the cuda synchronization issue?
Just creating the streams doesn't mean that PyTorch would use them. Looks like you are only creating streams for other devices but they are not actually being used by PyTorch for the execution of the model.
12e8e56
to
b3d6ec6
Compare
Updated the description to explain how to do the cuda synchronization for |
Documentation added in a separate PR: #110 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update this part too?
pytorch_backend/src/libtorch.cc
Lines 1774 to 1779 in ead0e23
if (device_.is_cpu()) { | |
alloc_perference = {{TRITONSERVER_MEMORY_CPU_PINNED, 0}, | |
{TRITONSERVER_MEMORY_CPU, 0}}; | |
} else { | |
alloc_perference = {{TRITONSERVER_MEMORY_GPU, device_.index()}}; | |
} |
I guess for KIND_MODEL the inputs will be always in CPU since we don't have a way to query the input types required by the model beforehand?
That is correct. Here the |
src/libtorch.cc
Outdated
std::mutex timestamp_mu; | ||
|
||
uint64_t compute_input_start = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of mutex / variable combination you can use an atomic variable: std::atomic<uint64_t>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to use std::atomic<uint64_t> and removed the mutex.
src/libtorch.cc
Outdated
#ifdef TRITON_ENABLE_GPU | ||
for (const auto& stream : stream_vec_) { | ||
cudaLaunchHostFunc( | ||
stream, CaptureFirstTimestampCallback, | ||
reinterpret_cast<void*>(&compute_input_cb_data)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed, I don't think for inputs we need to make any changes. We only need the multi-stream logic for calculating the compute infer
time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this part and use cuda events for the comput input duration.
src/libtorch.cc
Outdated
uint64_t compute_infer_start = 0; | ||
|
||
std::tuple<uint64_t*, std::mutex*> compute_infer_cb_data( | ||
&compute_infer_start, ×tamp_mu); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these changes are required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are using cuda events for the comput input duration, but we would still need the compute_infer_start
timestamp for the compute infer duration.
Edit: for compute_infer_start, no need to use callback function. Simply using SET_TIMESTAMP
.
…callback for compute_infer_duration
@Tabrizian Addressed all the comments from the offline discussion. Please review, thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tanmayv25 would be great if you can take a look as well.
Fix up the initialization of |
* Add support for instance group of type 'MODEL' * Format * Handle multi GPU cases when recording timestamps * Address comment * Use callback function to record timestamp for 'MODEL' kind * Add missing #ifdef * Update comment and if condition for input tensor memory alloc_perference * Fix for cuda stream. Use separate cuda callback to capture timestamp * Add comment to mention the possible timestamp issue * For 'KIND_MODEL', use cuda events for compute_input_duration and use callback for compute_infer_duration * Move the cudaLaunchHostFunc from RecordBackendTimestamp function * Fix up naming * Fix up * Fix up atomic initialization * Capture the timestamp after synchronization
* Add support for instance group of type 'MODEL' * Format * Handle multi GPU cases when recording timestamps * Address comment * Use callback function to record timestamp for 'MODEL' kind * Add missing #ifdef * Update comment and if condition for input tensor memory alloc_perference * Fix for cuda stream. Use separate cuda callback to capture timestamp * Add comment to mention the possible timestamp issue * For 'KIND_MODEL', use cuda events for compute_input_duration and use callback for compute_infer_duration * Move the cudaLaunchHostFunc from RecordBackendTimestamp function * Fix up naming * Fix up * Fix up atomic initialization * Capture the timestamp after synchronization
For
MODEL
instance group type,cudaLaunchHostFunc
to launch a callback function to record the timestamp.Added testing: triton-inference-server/server#5810
Closes: triton-inference-server/server#5694