Skip to content

Add support for instance group of type 'MODEL' #107

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Jun 13, 2023
Merged

Conversation

krishung5
Copy link
Contributor

@krishung5 krishung5 commented May 17, 2023

For MODEL instance group type,

  • Don't specify the device when loading the model so that the backend will not select the device for the model
  • For the compute input duration, since only one stream will be used for input collection, we can record the cuda events to calculate the elapsed time. However, for the compute infer duration, since multiple streams will be involved, it is hard to use cuda events as timestamps. Therefore, using cudaLaunchHostFunc to launch a callback function to record the timestamp.

Added testing: triton-inference-server/server#5810
Closes: triton-inference-server/server#5694

@krishung5 krishung5 changed the title Add support for instance group of type 'MODEL Add support for instance group of type 'MODEL' May 25, 2023
@krishung5 krishung5 marked this pull request as ready for review May 25, 2023 21:26
src/libtorch.cc Outdated
auto parameters = (*torch_model)->parameters();
auto buffers = (*torch_model)->buffers();

for (const auto& parameter : parameters) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obtaining device_id sets can be simplified to a templated utility function that can consume both parameters and buffers data type.
Should be able reduce the code duplicity.

src/libtorch.cc Outdated
at::cuda::getStreamFromExternal(stream_, DeviceId());
if ((Kind() == TRITONSERVER_INSTANCEGROUPKIND_GPU) ||
(Kind() == TRITONSERVER_INSTANCEGROUPKIND_MODEL && device_.is_cuda())) {
at::cuda::CUDAStream torch_stream = at::cuda::getStreamFromExternal(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is calling setCurrentCUDAStream to the cuda stream corresponding to the first device, really the right thing to do for the KIND_MODEL. Is setting the current cuda stream even required for kind_model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline that we should call setCurrentCUDAStream for the cuda stream of every device to replace the default stream.

Copy link
Member

@Tabrizian Tabrizian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please explain how this PR would solve the cuda synchronization issue?

Just creating the streams doesn't mean that PyTorch would use them. Looks like you are only creating streams for other devices but they are not actually being used by PyTorch for the execution of the model.

@krishung5 krishung5 force-pushed the krish-pytorch branch 2 times, most recently from 12e8e56 to b3d6ec6 Compare June 5, 2023 16:43
@krishung5
Copy link
Contributor Author

Updated the description to explain how to do the cuda synchronization for KIND_MODEL. Let me know if there is anything unclear.

@krishung5
Copy link
Contributor Author

Documentation added in a separate PR: #110

Copy link
Member

@Tabrizian Tabrizian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to update this part too?

if (device_.is_cpu()) {
alloc_perference = {{TRITONSERVER_MEMORY_CPU_PINNED, 0},
{TRITONSERVER_MEMORY_CPU, 0}};
} else {
alloc_perference = {{TRITONSERVER_MEMORY_GPU, device_.index()}};
}

I guess for KIND_MODEL the inputs will be always in CPU since we don't have a way to query the input types required by the model beforehand?

@krishung5
Copy link
Contributor Author

Do we need to update this part too?

if (device_.is_cpu()) {
alloc_perference = {{TRITONSERVER_MEMORY_CPU_PINNED, 0},
{TRITONSERVER_MEMORY_CPU, 0}};
} else {
alloc_perference = {{TRITONSERVER_MEMORY_GPU, device_.index()}};
}

I guess for KIND_MODEL the inputs will be always in CPU since we don't have a way to query the input types required by the model beforehand?

That is correct. Here the device_.is_cpu() will be true for KIND_MODEL because it is initialized with torch::kCPU and is not updated to any GPU device for KIND_MODEL. However, this condition can be indirect, so I have updated the if condition and the comment to make it clearer.

@krishung5
Copy link
Contributor Author

krishung5 commented Jun 7, 2023

  • For KIND_MODEL, used the first stream from stream_vec_ as the stream for input/output collector/responder.
  • Added cuda stream synchronization before reading the output tensors.
  • Removed the cuda synchronization for all the streams from places where only one stream will be used(input/output collector/responder).
  • Used two variants of cuda callback functions to capture the first and the last timestamp.
  • Since we are adding a cuda stream synchronization before reading the output tensors, we wouldn't need to get the first_compute_infer_start. We just need to get the last_comput_infer_start(compute_output_start). Attaching the graph from the discussion yesterday:
    Capture

@krishung5 krishung5 requested review from tanmayv25 and Tabrizian June 7, 2023 16:48
src/libtorch.cc Outdated
Comment on lines 1241 to 1243
std::mutex timestamp_mu;

uint64_t compute_input_start = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of mutex / variable combination you can use an atomic variable: std::atomic<uint64_t>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use std::atomic<uint64_t> and removed the mutex.

src/libtorch.cc Outdated
Comment on lines 1256 to 1261
#ifdef TRITON_ENABLE_GPU
for (const auto& stream : stream_vec_) {
cudaLaunchHostFunc(
stream, CaptureFirstTimestampCallback,
reinterpret_cast<void*>(&compute_input_cb_data));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, I don't think for inputs we need to make any changes. We only need the multi-stream logic for calculating the compute infer time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this part and use cuda events for the comput input duration.

src/libtorch.cc Outdated
Comment on lines 1287 to 1290
uint64_t compute_infer_start = 0;

std::tuple<uint64_t*, std::mutex*> compute_infer_cb_data(
&compute_infer_start, &timestamp_mu);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these changes are required.

Copy link
Contributor Author

@krishung5 krishung5 Jun 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using cuda events for the comput input duration, but we would still need the compute_infer_start timestamp for the compute infer duration.

Edit: for compute_infer_start, no need to use callback function. Simply using SET_TIMESTAMP.

@krishung5
Copy link
Contributor Author

@Tabrizian Addressed all the comments from the offline discussion. Please review, thank you!

Copy link
Member

@Tabrizian Tabrizian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tanmayv25 would be great if you can take a look as well.

@krishung5 krishung5 requested a review from Tabrizian June 9, 2023 15:56
Tabrizian
Tabrizian previously approved these changes Jun 9, 2023
@krishung5
Copy link
Contributor Author

Fix up the initialization of std::atomic which was failing on jetson build.

@krishung5 krishung5 requested a review from Tabrizian June 12, 2023 16:36
Tabrizian
Tabrizian previously approved these changes Jun 12, 2023
@krishung5 krishung5 merged commit 83d2ada into main Jun 13, 2023
@krishung5 krishung5 deleted the krish-pytorch branch June 13, 2023 05:43
krishung5 added a commit that referenced this pull request Jun 13, 2023
* Add support for instance group of type 'MODEL'

* Format

* Handle multi GPU cases when recording timestamps

* Address comment

* Use callback function to record timestamp for 'MODEL' kind

* Add missing #ifdef

* Update comment and if condition for input tensor memory alloc_perference

* Fix for cuda stream. Use separate cuda callback to capture timestamp

* Add comment to mention the possible timestamp issue

* For 'KIND_MODEL', use cuda events for compute_input_duration and use callback for compute_infer_duration

* Move the cudaLaunchHostFunc from RecordBackendTimestamp function

* Fix up naming

* Fix up

* Fix up atomic initialization

* Capture the timestamp after synchronization
@krishung5 krishung5 restored the krish-pytorch branch June 13, 2023 05:54
mc-nv pushed a commit that referenced this pull request Jun 13, 2023
* Add support for instance group of type 'MODEL'

* Format

* Handle multi GPU cases when recording timestamps

* Address comment

* Use callback function to record timestamp for 'MODEL' kind

* Add missing #ifdef

* Update comment and if condition for input tensor memory alloc_perference

* Fix for cuda stream. Use separate cuda callback to capture timestamp

* Add comment to mention the possible timestamp issue

* For 'KIND_MODEL', use cuda events for compute_input_duration and use callback for compute_infer_duration

* Move the cudaLaunchHostFunc from RecordBackendTimestamp function

* Fix up naming

* Fix up

* Fix up atomic initialization

* Capture the timestamp after synchronization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Support instance group of type 'MODEL' in pytorch backend
4 participants