Skip to content

Commit 93f7f1f

Browse files
authored
Merge branch 'master' into master
2 parents 414de39 + 36b5f95 commit 93f7f1f

28 files changed

+1406
-135
lines changed

CHANGELOG.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,33 @@
11
# Changelog
22

3+
## v2.23.0 (2020-12-23)
4+
5+
### Features
6+
7+
* Add support for actions in debugger rules.
8+
9+
### Bug Fixes and Other Changes
10+
11+
* include sparkml 2.4 in image uri config properly
12+
* Mount metadata dir only if it exists
13+
* allow urllib3 1.26
14+
15+
## v2.22.0 (2020-12-22)
16+
17+
### Features
18+
19+
* Support local mode for Amazon SageMaker Processing jobs
20+
21+
### Bug Fixes and Other Changes
22+
23+
* Add API enhancements for SMP
24+
* adjust naming convention; fix links
25+
* lower value used in featurestore test
26+
27+
### Documentation Changes
28+
29+
* Update GTDD instructions
30+
331
## v2.21.0 (2020-12-21)
432

533
### Features

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.21.1.dev0
1+
2.23.1.dev0

doc/api/training/smd_data_parallel.rst

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22
Distributed data parallel
33
###################################
44

5-
SageMaker distributed data parallel (SDP) extends SageMaker’s training
5+
SageMaker's distributed data parallel library extends SageMaker’s training
66
capabilities on deep learning models with near-linear scaling efficiency,
77
achieving fast time-to-train with minimal code changes.
88

9-
- SDP optimizes your training job for AWS network infrastructure and EC2 instance topology.
10-
- SDP takes advantage of gradient update to communicate between nodes with a custom AllReduce algorithm.
9+
- optimizes your training job for AWS network infrastructure and EC2 instance topology.
10+
- takes advantage of gradient update to communicate between nodes with a custom AllReduce algorithm.
1111

1212
When training a model on a large amount of data, machine learning practitioners
1313
will often turn to distributed training to reduce the time to train.
@@ -21,11 +21,11 @@ in performance. This drop in performance is primarily caused the communications
2121
overhead between nodes in a cluster.
2222

2323
.. important::
24-
SDP only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
24+
The distributed data parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
2525
``Estimator`` with ``dataparallel`` parameter ``enabled`` set to ``True``,
2626
it uses CUDA 11. When you extend or customize your own training image
2727
you must use a CUDA 11 base image. See
28-
`SageMaker Python SDK's SDP APIs
28+
`SageMaker Python SDK's distributed data parallel library APIs
2929
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`__
3030
for more information.
3131

@@ -38,7 +38,7 @@ To customize your own training script, you will need the following:
3838
<div data-section-style="5" style="">
3939

4040
- You must provide TensorFlow / PyTorch training scripts that are
41-
adapted to use SDP.
41+
adapted to use the distributed data parallel library.
4242
- Your input data must be in an S3 bucket or in FSx in the AWS region
4343
that you will use to launch your training job. If you use the Jupyter
4444
notebooks provided, create a SageMaker notebook instance in the same
@@ -53,7 +53,7 @@ To customize your own training script, you will need the following:
5353

5454
Use the API guides for each framework to see
5555
examples of training scripts that can be used to convert your training scripts.
56-
Then, use one of the example notebooks as your template to launch a training job.
56+
Then use one of the example notebooks as your template to launch a training job.
5757
You’ll need to swap your training script with the one that came with the
5858
notebook and modify any input functions as necessary.
5959
Once you have launched a training job, you can monitor it using CloudWatch.

doc/api/training/smd_data_parallel_pytorch.rst

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
####################
2-
PyTorch Guide to SDP
3-
####################
1+
##############################################################
2+
PyTorch Guide to SageMaker's distributed data parallel library
3+
##############################################################
44

55
.. admonition:: Contents
66

@@ -13,16 +13,16 @@ Modify a PyTorch training script to use SageMaker data parallel
1313
======================================================================
1414

1515
The following steps show you how to convert a PyTorch training script to
16-
utilize SageMaker Distributed Data Parallel (SDP).
16+
utilize SageMaker's distributed data parallel library.
1717

18-
The SDP APIs are designed to be close to PyTorch Distributed Data
19-
Parallel (DDP) APIs. Please see `SageMaker Distributed Data Parallel
20-
PyTorch API documentation <http://#>`__ for additional details on each
21-
API SDP offers for PyTorch.
18+
The distributed data parallel library APIs are designed to be close to PyTorch Distributed Data
19+
Parallel (DDP) APIs.
20+
See `SageMaker distributed data parallel PyTorch examples <https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#pytorch-distributed>`__ for additional details on how to implement the data parallel library
21+
API offered for PyTorch.
2222

2323

24-
- First import SDP’s PyTorch client and initialize it. You also import
25-
the SDP module for distributed training.
24+
- First import the distributed data parallel library’s PyTorch client and initialize it. You also import
25+
the distributed data parallel library module for distributed training.
2626

2727
.. code:: python
2828
@@ -33,7 +33,7 @@ API SDP offers for PyTorch.
3333
dist.init_process_group()
3434
3535
36-
- Pin each GPU to a single SDP process with ``local_rank`` - this
36+
- Pin each GPU to a single distributed data parallel library process with ``local_rank`` - this
3737
refers to the relative rank of the process within a given node.
3838
``smdistributed.dataparallel.torch.get_local_rank()`` API provides
3939
you the local rank of the device. The leader node will be rank 0, and
@@ -45,12 +45,12 @@ API SDP offers for PyTorch.
4545
torch.cuda.set_device(dist.get_local_rank())
4646
4747
48-
- Then wrap the PyTorch model with SDP’s DDP.
48+
- Then wrap the PyTorch model with the distributed data parallel library’s DDP.
4949

5050
.. code:: python
5151
5252
model = ...
53-
# Wrap model with SDP DistributedDataParallel
53+
# Wrap model with SageMaker's DistributedDataParallel
5454
model = DDP(model)
5555
5656
@@ -82,17 +82,17 @@ API SDP offers for PyTorch.
8282
8383
8484
All put together, the following is an example PyTorch training script
85-
you will have for distributed training with SDP:
85+
you will have for distributed training with the distributed data parallel library:
8686

8787
.. code:: python
8888
89-
# SDP: Import SDP PyTorch API
89+
# Import distributed data parallel library PyTorch API
9090
import smdistributed.dataparallel.torch.distributed as dist
9191
92-
# SDP: Import SDP PyTorch DDP
92+
# Import distributed data parallel library PyTorch DDP
9393
from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP
9494
95-
# SDP: Initialize SDP
95+
# Initialize distributed data parallel library
9696
dist.init_process_group()
9797
9898
class Net(nn.Module):
@@ -109,25 +109,25 @@ you will have for distributed training with SDP:
109109
110110
def main():
111111
112-
    # SDP: Scale batch size by world size
112+
    # Scale batch size by world size
113113
    batch_size //= dist.get_world_size() // 8
114114
    batch_size = max(batch_size, 1)
115115
116116
    # Prepare dataset
117117
    train_dataset = torchvision.datasets.MNIST(...)
118118
119-
    # SDP: Set num_replicas and rank in DistributedSampler
119+
    # Set num_replicas and rank in DistributedSampler
120120
    train_sampler = torch.utils.data.distributed.DistributedSampler(
121121
            train_dataset,
122122
            num_replicas=dist.get_world_size(),
123123
            rank=dist.get_rank())
124124
125125
    train_loader = torch.utils.data.DataLoader(..)
126126
127-
    # SDP: Wrap the PyTorch model with SDP’s DDP
127+
    # Wrap the PyTorch model with distributed data parallel library’s DDP
128128
    model = DDP(Net().to(device))
129129
130-
    # SDP: Pin each GPU to a single SDP process.
130+
    # Pin each GPU to a single distributed data parallel library process.
131131
    torch.cuda.set_device(local_rank)
132132
    model.cuda(local_rank)
133133
@@ -140,7 +140,7 @@ you will have for distributed training with SDP:
140140
            test(...)
141141
        scheduler.step()
142142
143-
    # SDP: Save model on master node.
143+
    # Save model on master node.
144144
    if dist.get_rank() == 0:
145145
        torch.save(...)
146146

doc/api/training/smd_data_parallel_tensorflow.rst

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
#######################
2-
TensorFlow Guide to SDP
3-
#######################
1+
#################################################################
2+
TensorFlow Guide to SageMaker's distributed data parallel library
3+
#################################################################
44

55
.. admonition:: Contents
66

@@ -13,13 +13,13 @@ Modify a TensorFlow 2.x training script to use SageMaker data parallel
1313
======================================================================
1414

1515
The following steps show you how to convert a TensorFlow 2.x training
16-
script to utilize SDP.
16+
script to utilize the distributed data parallel library.
1717

18-
The SDP APIs are designed to be close to Horovod APIs. Please see the
19-
SDP TensorFlow API specification for additional details on each API that
20-
SDP offers for TensorFlow.
18+
The distributed data parallel library APIs are designed to be close to Horovod APIs.
19+
See `SageMaker distributed data parallel TensorFlow examples <https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#tensorflow-distributed>`__ for additional details on how to implement the data parallel library
20+
API offered for TensorFlow.
2121

22-
- First import SDP’s TensorFlow client and initialize it:
22+
- First import the distributed data parallel library’s TensorFlow client and initialize it:
2323

2424
.. code:: python
2525
@@ -54,7 +54,7 @@ SDP offers for TensorFlow.
5454
learning_rate = learning_rate * sdp.size()
5555
5656
57-
- Use SDP’s ``DistributedGradientTape`` to optimize AllReduce
57+
- Use the library’s ``DistributedGradientTape`` to optimize AllReduce
5858
operations during training. This wraps ``tf.GradientTape``.
5959

6060
.. code:: python
@@ -63,7 +63,7 @@ SDP offers for TensorFlow.
6363
      output = model(input)
6464
      loss_value = loss(label, output)
6565
66-
# SDP: Wrap tf.GradientTape with SDP's DistributedGradientTape
66+
# Wrap tf.GradientTape with the library's DistributedGradientTape
6767
tape = sdp.DistributedGradientTape(tape)
6868
6969
@@ -92,23 +92,23 @@ SDP offers for TensorFlow.
9292
9393
9494
All put together, the following is an example TensorFlow2 training
95-
script you will have for distributed training with SDP.
95+
script you will have for distributed training with the library.
9696

9797
.. code:: python
9898
9999
import tensorflow as tf
100100
101-
# SDP: Import SDP TF API
101+
# Import the library's TF API
102102
import smdistributed.dataparallel.tensorflow as sdp
103103
104-
# SDP: Initialize SDP
104+
# Initialize the library
105105
sdp.init()
106106
107107
gpus = tf.config.experimental.list_physical_devices('GPU')
108108
for gpu in gpus:
109109
    tf.config.experimental.set_memory_growth(gpu, True)
110110
if gpus:
111-
    # SDP: Pin GPUs to a single SDP process
111+
    # Pin GPUs to a single process
112112
    tf.config.experimental.set_visible_devices(gpus[sdp.local_rank()], 'GPU')
113113
114114
# Prepare Dataset
@@ -118,7 +118,7 @@ script you will have for distributed training with SDP.
118118
mnist_model = tf.keras.Sequential(...)
119119
loss = tf.losses.SparseCategoricalCrossentropy()
120120
121-
# SDP: Scale Learning Rate
121+
# Scale Learning Rate
122122
# LR for 8 node run : 0.000125
123123
# LR for single node run : 0.001
124124
opt = tf.optimizers.Adam(0.000125 * sdp.size())
@@ -129,22 +129,22 @@ script you will have for distributed training with SDP.
129129
        probs = mnist_model(images, training=True)
130130
        loss_value = loss(labels, probs)
131131
132-
    # SDP: Wrap tf.GradientTape with SDP's DistributedGradientTape
132+
    # Wrap tf.GradientTape with the library's DistributedGradientTape
133133
    tape = sdp.DistributedGradientTape(tape)
134134
135135
    grads = tape.gradient(loss_value, mnist_model.trainable_variables)
136136
    opt.apply_gradients(zip(grads, mnist_model.trainable_variables))
137137
138138
    if first_batch:
139-
       # SDP: Broadcast model and optimizer variables
139+
       # Broadcast model and optimizer variables
140140
       sdp.broadcast_variables(mnist_model.variables, root_rank=0)
141141
       sdp.broadcast_variables(opt.variables(), root_rank=0)
142142
143143
    return loss_value
144144
145145
...
146146
147-
# SDP: Save checkpoints only from master node.
147+
# Save checkpoints only from master node.
148148
if sdp.rank() == 0:
149149
    checkpoint.save(checkpoint_dir)
150150

doc/api/training/smd_model_parallel.rst

Lines changed: 29 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,45 @@
11
Distributed model parallel
22
--------------------------
33

4-
Amazon SageMaker Distributed Model Parallel (SMP) is a model parallelism library for training
4+
The Amazon SageMaker distributed model parallel library is a model parallelism library for training
55
large deep learning models that were previously difficult to train due to GPU memory limitations.
6-
SMP automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training,
6+
The library automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training,
77
allowing you to increase prediction accuracy by creating larger models with more parameters.
88

9-
You can use SMP to automatically partition your existing TensorFlow and PyTorch workloads
10-
across multiple GPUs with minimal code changes. The SMP API can be accessed through the Amazon SageMaker SDK.
9+
You can use the library to automatically partition your existing TensorFlow and PyTorch workloads
10+
across multiple GPUs with minimal code changes. The library's API can be accessed through the Amazon SageMaker SDK.
1111

12-
Use the following sections to learn more about the model parallelism and the SMP library.
12+
Use the following sections to learn more about the model parallelism and the library.
1313

1414
.. important::
15-
SMP only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
15+
The model parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
1616
``Estimator`` with ``modelparallel`` parameter ``enabled`` set to ``True``,
1717
it uses CUDA 11. When you extend or customize your own training image
1818
you must use a CUDA 11 base image. See
19-
`Extend or Adapt A Docker Container that Contains SMP
19+
`Extend or Adapt A Docker Container that Contains the Model Parallel Library
2020
<https://integ-docs-aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-customize-container>`__
2121
for more information.
2222

23+
How to Use this Guide
24+
=====================
25+
26+
The library contains a Common API that is shared across frameworks, as well as APIs
27+
that are specific to supported frameworks, TensorFlow and PyTorch. To use the library, reference the
28+
**Common API** documentation alongside the framework specific API documentation.
29+
30+
.. toctree::
31+
:maxdepth: 1
32+
33+
smd_model_parallel_general
34+
smd_model_parallel_common_api
35+
smd_model_parallel_pytorch
36+
smd_model_parallel_tensorflow
37+
2338
It is recommended to use this documentation alongside `SageMaker Distributed Model Parallel
2439
<http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html>`__ in the Amazon SageMaker
2540
developer guide. This developer guide documentation includes:
2641

27-
- An overview of model parallelism and the SMP library
42+
- An overview of model parallelism and the library
2843
`core features <https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html>`__
2944
- Instructions on how to modify `TensorFlow
3045
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-training-script.html#model-parallel-customize-training-script-tf>`__
@@ -34,17 +49,12 @@ developer guide. This developer guide documentation includes:
3449
- `Configuration tips and pitfalls
3550
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-tips-pitfalls.html>`__
3651

37-
**How to Use this Guide**
38-
39-
The SMP library contains a Common API that is shared across frameworks, as well as APIs
40-
that are specific to supported frameworks, TensorFlow and PyTroch. To use SMP, reference the
41-
**Common API** documentation alongside framework specific API documentation.
52+
Latest Updates
53+
==============
4254

55+
New features, bug fixes, and improvements are regularly made to the SageMaker distributed model parallel library.
4356

44-
.. toctree::
45-
:maxdepth: 1
57+
To see the the latest changes made to the library, refer to the library
58+
`Release Notes
59+
<https://github.com/aws/sagemaker-python-sdk/blob/master/doc/api/training/smd_model_parallel_release_notes/>`_.
4660

47-
smd_model_parallel_general
48-
smd_model_parallel_common_api
49-
smd_model_parallel_pytorch
50-
smd_model_parallel_tensorflow

0 commit comments

Comments
 (0)