Skip to content

Commit 1e9554d

Browse files
author
Talia Chopra
committed
documentation: fix smp code example, add note for CUDA 11 to sdp
1 parent db22f2e commit 1e9554d

File tree

3 files changed

+11
-2
lines changed

3 files changed

+11
-2
lines changed

doc/api/training/smd_data_parallel.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,15 @@ with multiple GPUs. As the cluster size increases, so does the significant drop
2020
in performance. This drop in performance is primarily caused the communications
2121
overhead between nodes in a cluster.
2222

23+
.. important::
24+
SDP only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
25+
``Estimator`` with ``dataparallel`` parameter ``enabled`` set to ``True``,
26+
it uses CUDA 11. When you extend or customize your own training image
27+
you must use a CUDA 11 base image. See
28+
`SageMaker Python SDK's SDP APIs
29+
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`__
30+
for more information.
31+
2332
.. rubric:: Customize your training script
2433

2534
To customize your own training script, you will need the following:

doc/api/training/smd_model_parallel.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ Use the following sections to learn more about the model parallelism and the SMP
1313

1414
.. important::
1515
SMP only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
16-
``Estimator`` with ``smdistributed`` ``enabled``,
16+
``Estimator`` with ``modelparallel`` parameter ``enabled`` set to ``True``,
1717
it uses CUDA 11. When you extend or customize your own training image
1818
you must use a CUDA 11 base image. See
1919
`Extend or Adapt A Docker Container that Contains SMP

doc/api/training/smd_model_parallel_general.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ The following is an example of how you can launch a new PyTorch training job wit
4747
py_version='py3',
4848
instance_count=1,
4949
distribution={
50-
"smdistributed": smp_options,
50+
"smdistributed": {"modelparallel": smp_options},
5151
"mpi": mpi_options
5252
},
5353
base_job_name="SMD-MP-demo",

0 commit comments

Comments
 (0)