@@ -2,8 +2,8 @@ Deploying a Torch-TensorRT model (to Triton)
2
2
============================================
3
3
4
4
Optimization and deployment go hand in hand in a discussion about Machine
5
- Learning infrastructure. For a Torch-TensorRT user, network level optimzation
6
- to get the maximum performance would already be an area of expertize .
5
+ Learning infrastructure. Once network level optimzation are done
6
+ to get the maximum performance, the next step would be to deploy it .
7
7
8
8
However, serving this optimized model comes with it's own set of considerations
9
9
and challenges like: building an infrastructure to support concorrent model
@@ -18,7 +18,7 @@ Step 1: Optimize your model with Torch-TensorRT
18
18
-----------------------------------------------
19
19
20
20
Most Torch-TensorRT users will be familiar with this step. For the purpose of
21
- this demoonstration , we will be using a ResNet50 model from Torchhub.
21
+ this demonstration , we will be using a ResNet50 model from Torchhub.
22
22
23
23
Let’s first pull the NGC PyTorch Docker container. You may need to create
24
24
an account and get the API key from `here <https://ngc.nvidia.com/setup/ >`__.
@@ -30,7 +30,7 @@ Sign up and login with your key (follow the instructions
30
30
# <xx.xx> is the yy:mm for the publishing tag for NVIDIA's Pytorch
31
31
# container; eg. 22.04
32
32
33
- docker run -it --gpus all -v /path/to/folder:/resnet50_eg nvcr.io/nvidia/pytorch:<xx.xx>-py3
33
+ docker run -it --gpus all -v /path/to/local/ folder/to/copy/model :/resnet50_eg nvcr.io/nvidia/pytorch:<xx.xx>-py3
34
34
35
35
Once inside the container, we can proceed to download a ResNet model from
36
36
Torchhub and optimize it with Torch-TensorRT.
@@ -180,25 +180,25 @@ with the Triton Inference Server.
180
180
::
181
181
182
182
# Setting up client
183
- triton_client = httpclient.InferenceServerClient(url="localhost:8000")
183
+ client = httpclient.InferenceServerClient(url="localhost:8000")
184
184
185
185
Secondly, we specify the names of the input and output layer(s) of our model.
186
186
187
187
::
188
188
189
- test_input = httpclient.InferInput("input__0", transformed_img.shape, datatype="FP32")
190
- test_input .set_data_from_numpy(transformed_img, binary_data=True)
189
+ inputs = httpclient.InferInput("input__0", transformed_img.shape, datatype="FP32")
190
+ inputs .set_data_from_numpy(transformed_img, binary_data=True)
191
191
192
- test_output = httpclient.InferRequestedOutput("output__0", binary_data=True, class_count=1000)
192
+ outputs = httpclient.InferRequestedOutput("output__0", binary_data=True, class_count=1000)
193
193
194
194
Lastly, we send an inference request to the Triton Inference Server.
195
195
196
196
::
197
197
198
198
# Querying the server
199
- results = triton_client .infer(model_name="resnet50", inputs=[test_input ], outputs=[test_output ])
200
- test_output_fin = results.as_numpy('output__0')
201
- print(test_output_fin [:5])
199
+ results = client .infer(model_name="resnet50", inputs=[inputs ], outputs=[outputs ])
200
+ inference_output = results.as_numpy('output__0')
201
+ print(inference_output [:5])
202
202
203
203
The output of the same should look like below:
204
204
0 commit comments