added triton deplolyment

tanayvarshney · tanayvarshney · commit c2cb9699bbe5 · 2022-06-14T10:14:09.000-07:00
diff --git a/docsrc/index.rst b/docsrc/index.rst
@@ -28,6 +28,7 @@ Getting Started
 * :ref:`use_from_pytorch`
 * :ref:`runtime`
 * :ref:`using_dla`
+* :ref:`deploy_torch_tensorrt_to_triton`
 
 .. toctree::
    :caption: Getting Started
@@ -43,6 +44,7 @@ Getting Started
    tutorials/use_from_pytorch
    tutorials/runtime
    tutorials/using_dla
+   tutorials/deploy_torch_tensorrt_to_triton
 
 .. toctree::
    :caption: Notebooks
diff --git a/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst b/docsrc/tutorials/deploy_torch_tensorrt_to_triton.rst
@@ -0,0 +1,212 @@
+Deploying a Torch-TensorRT model (to Triton)
+============================================
+
+Optimization and deployment go hand in hand in a discussion about Machine 
+Learning infrastructure. For a Torch-TensorRT user, network level optimzation 
+to get the maximum performance would already be an area of expertize. 
+
+However, serving this optimized model comes with it's own set of considerations
+and challenges like: building an infrastructure to support concorrent model
+executions, supporting clients over HTTP or gRPC and more.
+
+The `Triton Inference Server <https://github.com/triton-inference-server/server>`__ 
+solves the aforementioned and more. Let's discuss step-by-step, the process of
+optimizing a model with Torch-TensorRT, deploying it on Triton Inference
+Server, and building a client to query the model. 
+
+Step 1: Optimize your model with Torch-TensorRT
+-----------------------------------------------
+
+Most Torch-TensorRT users will be familiar with this step. For the purpose of
+this demoonstration, we will be using a ResNet50 model from Torchhub.
+
+Let’s first pull the NGC PyTorch Docker container. You may need to create 
+an account and get the API key from `here <https://ngc.nvidia.com/setup/>`__. 
+Sign up and login with your key (follow the instructions
+`here <https://ngc.nvidia.com/setup/api-key>`__ after signing up).
+
+::
+
+   # <xx.xx> is the yy:mm for the publishing tag for NVIDIA's Pytorch 
+   # container; eg. 22.04
+
+   docker run -it --gpus all -v /path/to/folder:/resnet50_eg nvcr.io/nvidia/pytorch:<xx.xx>-py3
+
+Once inside the container, we can proceed to download a ResNet model from
+Torchhub and optimize it with Torch-TensorRT. 
+
+::
+
+   import torch
+   import torch_tensorrt
+   torch.hub._validate_not_a_forked_repo=lambda a,b,c: True
+
+   # load model
+   model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True).eval().to("cuda")
+
+   # Compile with Torch TensorRT; 
+   trt_model = torch_tensorrt.compile(model, 
+       inputs= [torch_tensorrt.Input((1, 3, 224, 224))],
+       enabled_precisions= { torch.half} # Run with FP32
+   )
+
+   # Save the model
+   torch.jit.save(trt_model, "model.pt")
+
+The next step in the process is to set up a Triton Inference Server.
+
+Step 2: Set Up Triton Inference Server
+--------------------------------------
+
+If you are new to the Triton Inference Server and want to learn more, we
+highly recommend to checking our `Github
+Repository <https://github.com/triton-inference-server>`__.
+
+To use Triton, we need to make a model repository. A model repository, as the
+name suggested, is a repository of the models the Inference server hosts. While
+Triton can serve models from multiple repositories, in this example, we will
+discuss the simplest possible form of the model repository.
+
+The structure of this repository should look something like this:
+
+::
+
+   model_repository
+   |
+   +-- resnet50
+       |
+       +-- config.pbtxt
+       +-- 1
+           |
+           +-- model.pt
+
+There are two files that Triton requires to serve the model: the model itself
+and a model configuration file which is typically provided in ``config.pbtxt``.
+For the model we prepared in step 1, the following configuration can be used: 
+
+::
+
+   name: "resnet50"
+   platform: "pytorch_libtorch"
+   max_batch_size : 0
+   input [
+     {
+       name: "input__0"
+       data_type: TYPE_FP32
+       dims: [ 3, 224, 224 ]
+       reshape { shape: [ 1, 3, 224, 224 ] }
+     }
+   ]
+   output [
+     {
+       name: "output__0"
+       data_type: TYPE_FP32
+       dims: [ 1, 1000 ,1, 1]
+       reshape { shape: [ 1, 1000 ] }
+     }
+   ]
+
+The ``config.pbtxt`` file is used to describe the exact model configuration 
+with details like the names and shapes of the input and output layer(s),
+datatypes, scheduling and batching details and more. If you are new to Triton, 
+we highly encourage you to check out this `section of our
+documentation <https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md>`__
+for more details. 
+
+With the model repository setup, we can proceed to launch the Triton server
+with the docker command below.
+
+::
+
+   # Make sure that the TensorRT version in the Triton container
+   # and TensorRT version in the environment used to optimize the model
+   # are the same.
+
+   docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models
+
+This should spin up a Triton Inference server. Next step, building a simple
+http client to query the server.
+
+Step 3: Building a Triton Client to Query the Server
+----------------------------------------------------
+
+Before proceeding, make sure to have a sample image on hand. If you don't
+have one, download an example image to test inference. In this section, we 
+will be going over a very basic client. For a variety of more fleshed out
+examples, refer to the `Triton Client Repository <https://github.com/triton-inference-server/client/tree/main/src/python/examples>`__
+
+::
+
+   wget  -O img1.jpg "https://www.hakaimagazine.com/wp-content/uploads/header-gulf-birds.jpg"
+
+We then need to install dependencies for building a python client. These will 
+change from client to client. For a full list of all languages supported by Triton,
+please refer to `Triton's client repository <https://github.com/triton-inference-server/client>`__.
+
+::
+
+   pip install torchvision
+   pip install attrdict
+   pip install nvidia-pyindex
+   pip install tritonclient[all]
+
+Let's jump into the client. Firstly, we write a small preprocessing function to
+resize and normalize the query image.
+
+::
+
+   import numpy as np
+   from torchvision import transforms
+   from PIL import Image
+   import tritonclient.http as httpclient
+   from tritonclient.utils import triton_to_np_dtype
+
+   # preprocessing function
+   def rn50_preprocess(img_path="img1.jpg"):
+       img = Image.open(img_path)
+       preprocess = transforms.Compose([
+           transforms.Resize(256),
+           transforms.CenterCrop(224),
+           transforms.ToTensor(),
+           transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
+       ])
+       return preprocess(img).numpy()
+
+   transformed_img = rn50_preprocess()
+
+Building a client requires three basic points. Firstly, we setup a connection
+with the Triton Inference Server.
+
+::
+
+   # Setting up client
+   triton_client = httpclient.InferenceServerClient(url="localhost:8000")
+
+Secondly, we specify the names of the input and output layer(s) of our model.
+
+::
+
+   test_input = httpclient.InferInput("input__0", transformed_img.shape, datatype="FP32")
+   test_input.set_data_from_numpy(transformed_img, binary_data=True)
+
+   test_output = httpclient.InferRequestedOutput("output__0", binary_data=True, class_count=1000)
+
+Lastly, we send an inference request to the Triton Inference Server.
+
+::
+
+   # Querying the server
+   results = triton_client.infer(model_name="resnet50", inputs=[test_input], outputs=[test_output])
+   test_output_fin = results.as_numpy('output__0')
+   print(test_output_fin[:5])
+
+The output of the same should look like below:
+
+::
+
+   [b'12.468750:90' b'11.523438:92' b'9.664062:14' b'8.429688:136'
+    b'8.234375:11']
+
+The output format here is ``<confidence_score>:<classification_index>``.
+To learn how to map these to the label names and more, refer to our
+`documentation <https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_classification.md>`__.