Skip to content

Commit d992c5b

Browse files
enhancement: Dynamically updating CUDA EP options (#256)
* dynamic CUDA and TRT options updating * Fix up * Add doc * Format --------- Co-authored-by: Maximilian Müller <[email protected]>
1 parent e2061b7 commit d992c5b

File tree

2 files changed

+235
-58
lines changed

2 files changed

+235
-58
lines changed

README.md

Lines changed: 91 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
<!--
2-
# Copyright (c) 2020-2023, NVIDIA CORPORATION. All rights reserved.
2+
# Copyright (c) 2020-2024, NVIDIA CORPORATION. All rights reserved.
33
#
44
# Redistribution and use in source and binary forms, with or without
55
# modification, are permitted provided that the following conditions
@@ -83,7 +83,11 @@ $ make install
8383

8484

8585
## ONNX Runtime with TensorRT optimization
86-
TensorRT can be used in conjunction with an ONNX model to further optimize the performance. To enable TensorRT optimization you must set the model configuration appropriately. There are several optimizations available for TensorRT, like selection of the compute precision and workspace size. The optimization parameters and their description are as follows.
86+
TensorRT can be used in conjunction with an ONNX model to further optimize the
87+
performance. To enable TensorRT optimization you must set the model configuration
88+
appropriately. There are several optimizations available for TensorRT, like
89+
selection of the compute precision and workspace size. The optimization
90+
parameters and their description are as follows.
8791

8892

8993
* `precision_mode`: The precision used for optimization. Allowed values are "FP32", "FP16" and "INT8". Default value is "FP32".
@@ -93,9 +97,11 @@ TensorRT can be used in conjunction with an ONNX model to further optimize the p
9397
* `trt_engine_cache_enable`: Enable engine caching.
9498
* `trt_engine_cache_path`: Specify engine cache path.
9599

96-
To explore the usage of more parameters, follow the mapping table below and check [ONNX Runtime doc](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#execution-provider-options) for detail.
100+
To explore the usage of more parameters, follow the mapping table below and
101+
check [ONNX Runtime doc](https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#execution-provider-options) for detail.
97102

98-
> Please link to the latest ONNX Runtime binaries in CMake or build from [main branch of ONNX Runtime](https://github.com/microsoft/onnxruntime/tree/main) to enable latest options.
103+
> Please link to the latest ONNX Runtime binaries in CMake or build from
104+
[main branch of ONNX Runtime](https://github.com/microsoft/onnxruntime/tree/main) to enable latest options.
99105

100106
### Parameter mapping between ONNX Runtime and Triton ONNXRuntime Backend
101107

@@ -155,17 +161,50 @@ optimization { execution_accelerators {
155161
```
156162

157163
## ONNX Runtime with CUDA Execution Provider optimization
158-
When GPU is enabled for ORT, CUDA execution provider is enabled. If TensorRT is also enabled then CUDA EP is treated as a fallback option (only comes into picture for nodes which TensorRT cannot execute). If TensorRT is not enabled then CUDA EP is the primary EP which executes the models. ORT enabled configuring options for CUDA EP to further optimize based on the specific model and user scenarios. To enable CUDA EP optimization you must set the model configuration appropriately. There are several optimizations available, like selection of max mem, cudnn conv algorithm etc... The optimization parameters and their description are as follows.
164+
When GPU is enabled for ORT, CUDA execution provider is enabled. If TensorRT is
165+
also enabled then CUDA EP is treated as a fallback option (only comes into
166+
picture for nodes which TensorRT cannot execute). If TensorRT is not enabled
167+
then CUDA EP is the primary EP which executes the models. ORT enabled
168+
configuring options for CUDA EP to further optimize based on the specific model
169+
and user scenarios. There are several optimizations available, please refer to
170+
the [ONNX Runtime doc](https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#cuda-execution-provider)
171+
for more details. To enable CUDA EP optimization you must set the model
172+
configuration appropriately:
159173

160-
* `cudnn_conv_algo_search`: CUDA Convolution algorithm search configuration. Available options are 0 - EXHAUSTIVE (expensive exhaustive benchmarking using cudnnFindConvolutionForwardAlgorithmEx). This is also the default option, 1 - HEURISTIC (lightweight heuristic based search using cudnnGetConvolutionForwardAlgorithm_v7), 2 - DEFAULT (default algorithm using CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM)
174+
```
175+
optimization { execution_accelerators {
176+
gpu_execution_accelerator : [ {
177+
name : "cuda"
178+
parameters { key: "cudnn_conv_use_max_workspace" value: "0" }
179+
parameters { key: "use_ep_level_unified_stream" value: "1" }}
180+
]
181+
}}
182+
```
161183

162-
* `gpu_mem_limit`: CUDA memory limit. To use all possible memory pass in maximum size_t. Defaults to SIZE_MAX.
184+
### Deprecated Parameters
185+
The way to specify these specific parameters as shown below is deprecated. For
186+
backward compatibility, these parameters are still supported. Please use the
187+
above method to specify the parameters.
163188

164-
* `arena_extend_strategy`: Strategy used to grow the memory arena. Available options are: 0 = kNextPowerOfTwo, 1 = kSameAsRequested. Defaults to 0.
189+
* `cudnn_conv_algo_search`: CUDA Convolution algorithm search configuration.
190+
Available options are 0 - EXHAUSTIVE (expensive exhaustive benchmarking using
191+
cudnnFindConvolutionForwardAlgorithmEx). This is also the default option,
192+
1 - HEURISTIC (lightweight heuristic based search using
193+
cudnnGetConvolutionForwardAlgorithm_v7), 2 - DEFAULT (default algorithm using
194+
CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM)
165195

166-
* `do_copy_in_default_stream`: Flag indicating if copying needs to take place on the same stream as the compute stream in the CUDA EP. Available options are: 0 = Use separate streams for copying and compute, 1 = Use the same stream for copying and compute. Defaults to 1.
196+
* `gpu_mem_limit`: CUDA memory limit. To use all possible memory pass in maximum
197+
size_t. Defaults to SIZE_MAX.
167198

168-
The section of model config file specifying these parameters will look like:
199+
* `arena_extend_strategy`: Strategy used to grow the memory arena. Available
200+
options are: 0 = kNextPowerOfTwo, 1 = kSameAsRequested. Defaults to 0.
201+
202+
* `do_copy_in_default_stream`: Flag indicating if copying needs to take place on
203+
the same stream as the compute stream in the CUDA EP. Available options are:
204+
0 = Use separate streams for copying and compute, 1 = Use the same stream for
205+
copying and compute. Defaults to 1.
206+
207+
In the model config file, specifying these parameters will look like:
169208

170209
```
171210
.
@@ -203,14 +242,26 @@ optimization { execution_accelerators {
203242

204243
## Other Optimization Options with ONNX Runtime
205244

206-
Details regarding when to use these options and what to expect from them can be found [here](https://onnxruntime.ai/docs/performance/tune-performance.html)
245+
Details regarding when to use these options and what to expect from them can be
246+
found [here](https://onnxruntime.ai/docs/performance/tune-performance.html)
207247

208248
### Model Config Options
209-
* `intra_op_thread_count`: Sets the number of threads used to parallelize the execution within nodes. A value of 0 means ORT will pick a default which is number of cores.
210-
* `inter_op_thread_count`: Sets the number of threads used to parallelize the execution of the graph (across nodes). If sequential execution is enabled this value is ignored.
249+
* `intra_op_thread_count`: Sets the number of threads used to parallelize the
250+
execution within nodes. A value of 0 means ORT will pick a default which is
251+
number of cores.
252+
* `inter_op_thread_count`: Sets the number of threads used to parallelize the
253+
execution of the graph (across nodes). If sequential execution is enabled this
254+
value is ignored.
211255
A value of 0 means ORT will pick a default which is number of cores.
212-
* `execution_mode`: Controls whether operators in the graph are executed sequentially or in parallel. Usually when the model has many branches, setting this option to 1 .i.e. "parallel" will give you better performance. Default is 0 which is "sequential execution."
213-
* `level`: Refers to the graph optimization level. By default all optimizations are enabled. Allowed values are -1, 1 and 2. -1 refers to BASIC optimizations, 1 refers to basic plus extended optimizations like fusions and 2 refers to all optimizations being disabled. Please find the details [here](https://onnxruntime.ai/docs/performance/graph-optimizations.html).
256+
* `execution_mode`: Controls whether operators in the graph are executed
257+
sequentially or in parallel. Usually when the model has many branches, setting
258+
this option to 1 .i.e. "parallel" will give you better performance. Default is
259+
0 which is "sequential execution."
260+
* `level`: Refers to the graph optimization level. By default all optimizations
261+
are enabled. Allowed values are -1, 1 and 2. -1 refers to BASIC optimizations,
262+
1 refers to basic plus extended optimizations like fusions and 2 refers to all
263+
optimizations being disabled. Please find the details
264+
[here](https://onnxruntime.ai/docs/performance/graph-optimizations.html).
214265

215266
```
216267
optimization {
@@ -223,32 +274,48 @@ parameters { key: "execution_mode" value: { string_value: "0" } }
223274
parameters { key: "inter_op_thread_count" value: { string_value: "0" } }
224275
225276
```
226-
* `enable_mem_arena`: Use 1 to enable the arena and 0 to disable. See [this](https://onnxruntime.ai/docs/api/c/struct_ort_api.html#a0bbd62df2b3c119636fba89192240593) for more information.
227-
* `enable_mem_pattern`: Use 1 to enable memory pattern and 0 to disable. See [this](https://onnxruntime.ai/docs/api/c/struct_ort_api.html#ad13b711736956bf0565fea0f8d7a5d75) for more information.
228-
* `memory.enable_memory_arena_shrinkage`: See [this](https://github.com/microsoft/onnxruntime/blob/master/include/onnxruntime/core/session/onnxruntime_run_options_config_keys.h) for more information.
277+
* `enable_mem_arena`: Use 1 to enable the arena and 0 to disable. See
278+
[this](https://onnxruntime.ai/docs/api/c/struct_ort_api.html#a0bbd62df2b3c119636fba89192240593)
279+
for more information.
280+
* `enable_mem_pattern`: Use 1 to enable memory pattern and 0 to disable.
281+
See [this](https://onnxruntime.ai/docs/api/c/struct_ort_api.html#ad13b711736956bf0565fea0f8d7a5d75)
282+
for more information.
283+
* `memory.enable_memory_arena_shrinkage`:
284+
See [this](https://github.com/microsoft/onnxruntime/blob/master/include/onnxruntime/core/session/onnxruntime_run_options_config_keys.h)
285+
for more information.
229286

230287
### Command line options
231288

232289
#### Thread Pools
233290

234-
When intra and inter op threads is set to 0 or a value higher than 1, by default ORT creates threadpool per session. This may not be ideal in every scenario, therefore ORT also supports global threadpools. When global threadpools are enabled ORT creates 1 global threadpool which is shared by every session. Use the backend config to enable global threadpool. When global threadpool is enabled, intra and inter op num threads config should also be provided via backend config. Config values provided in model config will be ignored.
291+
When intra and inter op threads is set to 0 or a value higher than 1, by default
292+
ORT creates threadpool per session. This may not be ideal in every scenario,
293+
therefore ORT also supports global threadpools. When global threadpools are
294+
enabled ORT creates 1 global threadpool which is shared by every session.
295+
Use the backend config to enable global threadpool. When global threadpool is
296+
enabled, intra and inter op num threads config should also be provided via
297+
backend config. Config values provided in model config will be ignored.
235298

236299
```
237300
--backend-config=onnxruntime,enable-global-threadpool=<0,1>, --backend-config=onnxruntime,intra_op_thread_count=<int> , --backend-config=onnxruntime,inter_op_thread_count=<int>
238301
```
239302

240303
#### Default Max Batch Size
241304

242-
The default-max-batch-size value is used for max_batch_size during [Autocomplete](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#auto-generated-model-configuration) when no
243-
other value is found. Assuming server was not launched with `--disable-auto-complete-config`
244-
command-line option, the onnxruntime backend will set the max_batch_size
245-
of the model to this default value under the following conditions:
305+
The default-max-batch-size value is used for max_batch_size during
306+
[Autocomplete](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#auto-generated-model-configuration)
307+
when no other value is found. Assuming server was not launched with
308+
`--disable-auto-complete-config` command-line option, the onnxruntime backend
309+
will set the max_batch_size of the model to this default value under the
310+
following conditions:
246311

247312
1. Autocomplete has determined the model is capable of batching requests.
248313
2. max_batch_size is 0 in the model configuration or max_batch_size
249314
is omitted from the model configuration.
250315

251-
If max_batch_size > 1 and no [scheduler](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#scheduling-and-batching) is provided, the dynamic batch scheduler will be used.
316+
If max_batch_size > 1 and no
317+
[scheduler](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#scheduling-and-batching)
318+
is provided, the dynamic batch scheduler will be used.
252319

253320
```
254321
--backend-config=onnxruntime,default-max-batch-size=<int>

0 commit comments

Comments
 (0)