Skip to content

Commit 8b9cca7

Browse files
aaronmarkhamChoiByungWook
authored andcommitted
documentation: add SageMaker distributed libraries documentation (#549)
1 parent 7a2c466 commit 8b9cca7

13 files changed

+2586
-3
lines changed

doc/_static/theme_overrides.css

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
/* override table width restrictions */
2+
.wy-table-responsive table td, .wy-table-responsive table th {
3+
white-space: normal;
4+
}
5+
6+
.wy-table-responsive {
7+
margin-bottom: 24px;
8+
max-width: 100%;
9+
overflow: visible;
10+
}

doc/api/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,5 +9,6 @@ The SageMaker Python SDK consists of a variety classes for preparing data, train
99

1010
prep_data/feature_store
1111
training/index
12+
training/distributed
1213
inference/index
1314
utility/index

doc/api/training/distributed.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Distributed Training APIs
2+
-------------------------
3+
SageMaker distributed training libraries offer both data parallel and model parallel training strategies.
4+
They combine software and hardware technologies to improve inter-GPU and inter-node communications.
5+
They extend SageMaker’s training capabilities with built-in options that require only small code changes to your training scripts.
6+
7+
.. toctree::
8+
:maxdepth: 3
9+
10+
smd_data_parallel
11+
smd_model_parallel

doc/api/training/index.rst

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,13 @@ Training APIs
33
#############
44

55
.. toctree::
6-
:maxdepth: 1
7-
:glob:
6+
:maxdepth: 4
87

9-
*
8+
analytics
9+
automl
10+
debugger
11+
estimators
12+
algorithm
13+
tuner
14+
parameter
15+
processing
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
###################################
2+
Distributed data parallel
3+
###################################
4+
5+
SageMaker distributed data parallel (SDP) extends SageMaker’s training
6+
capabilities on deep learning models with near-linear scaling efficiency,
7+
achieving fast time-to-train with minimal code changes.
8+
9+
- SDP optimizes your training job for AWS network infrastructure and EC2 instance topology.
10+
- SDP takes advantage of gradient update to communicate between nodes with a custom AllReduce algorithm.
11+
12+
When training a model on a large amount of data, machine learning practitioners
13+
will often turn to distributed training to reduce the time to train.
14+
In some cases, where time is of the essence,
15+
the business requirement is to finish training as quickly as possible or at
16+
least within a constrained time period.
17+
Then, distributed training is scaled to use a cluster of multiple nodes,
18+
meaning not just multiple GPUs in a computing instance, but multiple instances
19+
with multiple GPUs. As the cluster size increases, so does the significant drop
20+
in performance. This drop in performance is primarily caused the communications
21+
overhead between nodes in a cluster.
22+
23+
24+
.. rubric:: Customize your training script
25+
26+
To customize your own training script, you will need the following:
27+
28+
.. raw:: html
29+
30+
<div data-section-style="5" style="">
31+
32+
- You must provide TensorFlow / PyTorch training scripts that are
33+
adapted to use SDP.
34+
- Your input data must be in an S3 bucket or in FSx in the AWS region
35+
that you will use to launch your training job. If you use the Jupyter
36+
notebooks provided, create a SageMaker notebook instance in the same
37+
region as the bucket that contains your input data. For more
38+
information about storing your training data, refer to
39+
the `SageMaker Python SDK data
40+
inputs <https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-inputs>`__ documentation.
41+
42+
.. raw:: html
43+
44+
</div>
45+
46+
Use the API guides for each framework to see
47+
examples of training scripts that can be used to convert your training scripts.
48+
Then, use one of the example notebooks as your template to launch a training job.
49+
You’ll need to swap your training script with the one that came with the
50+
notebook and modify any input functions as necessary.
51+
Once you have launched a training job, you can monitor it using CloudWatch.
52+
53+
Then you can see how to deploy your trained model to an endpoint by
54+
following one of the example notebooks for deploying a model. Finally,
55+
you can follow an example notebook to test inference on your deployed
56+
model.
57+
58+
59+
60+
.. toctree::
61+
:maxdepth: 2
62+
63+
smd_data_parallel_pytorch
64+
smd_data_parallel_tensorflow

0 commit comments

Comments
 (0)