Skip to content

Commit 30c6dca

Browse files
authored
Merge branch 'master' into master
2 parents 735711a + 2cc9bc3 commit 30c6dca

File tree

4 files changed

+21
-4
lines changed

4 files changed

+21
-4
lines changed

doc/api/training/smd_data_parallel.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,14 @@ with multiple GPUs. As the cluster size increases, so does the significant drop
2020
in performance. This drop in performance is primarily caused the communications
2121
overhead between nodes in a cluster.
2222

23+
.. important::
24+
SDP only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
25+
``Estimator`` with ``dataparallel`` parameter ``enabled`` set to ``True``,
26+
it uses CUDA 11. When you extend or customize your own training image
27+
you must use a CUDA 11 base image. See
28+
`SageMaker Python SDK's SDP APIs
29+
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`__
30+
for more information.
2331

2432
.. rubric:: Customize your training script
2533

doc/api/training/smd_model_parallel.rst

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,15 @@ across multiple GPUs with minimal code changes. The SMP API can be accessed thro
1111

1212
Use the following sections to learn more about the model parallelism and the SMP library.
1313

14+
.. important::
15+
SMP only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
16+
``Estimator`` with ``modelparallel`` parameter ``enabled`` set to ``True``,
17+
it uses CUDA 11. When you extend or customize your own training image
18+
you must use a CUDA 11 base image. See
19+
`Extend or Adapt A Docker Container that Contains SMP
20+
<https://integ-docs-aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-customize-container>`__
21+
for more information.
22+
1423
It is recommended to use this documentation alongside `SageMaker Distributed Model Parallel
1524
<http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`__ in the Amazon SageMaker
1625
developer guide. This developer guide documentation includes:

doc/api/training/smd_model_parallel_general.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ The following is an example of how you can launch a new PyTorch training job wit
4747
py_version='py3',
4848
instance_count=1,
4949
distribution={
50-
"smdistributed": smp_options,
50+
"smdistributed": {"modelparallel": smp_options},
5151
"mpi": mpi_options
5252
},
5353
base_job_name="SMD-MP-demo",

doc/frameworks/pytorch/using_pytorch.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
###########################################
2-
Using PyTorch with the SageMaker Python SDK
3-
###########################################
1+
#########################################
2+
Use PyTorch with the SageMaker Python SDK
3+
#########################################
44

55
With PyTorch Estimators and Models, you can train and host PyTorch models on Amazon SageMaker.
66

0 commit comments

Comments
 (0)