Skip to content

Documentation: SageMaker distributed data parallel doc versioning #2120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Feb 2, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions doc/api/training/sdp_versions/v1_0_0.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@

Version 1.0.0 (Latest)
======================

.. toctree::
:maxdepth: 1

v1.0.0/smd_data_parallel_pytorch.rst
v1.0.0/smd_data_parallel_tensorflow.rst
99 changes: 62 additions & 37 deletions doc/api/training/smd_data_parallel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,39 +6,36 @@ SageMaker's distributed data parallel library extends SageMaker’s training
capabilities on deep learning models with near-linear scaling efficiency,
achieving fast time-to-train with minimal code changes.

- optimizes your training job for AWS network infrastructure and EC2 instance topology.
- takes advantage of gradient update to communicate between nodes with a custom AllReduce algorithm.

When training a model on a large amount of data, machine learning practitioners
will often turn to distributed training to reduce the time to train.
In some cases, where time is of the essence,
the business requirement is to finish training as quickly as possible or at
least within a constrained time period.
Then, distributed training is scaled to use a cluster of multiple nodes,
meaning not just multiple GPUs in a computing instance, but multiple instances
with multiple GPUs. As the cluster size increases, so does the significant drop
in performance. This drop in performance is primarily caused the communications
overhead between nodes in a cluster.
with multiple GPUs. However, as the cluster size increases, it is possible to see a significant drop
in performance due to communications overhead between nodes in a cluster.

.. important::
The distributed data parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
``Estimator`` with ``dataparallel`` parameter ``enabled`` set to ``True``,
it uses CUDA 11. When you extend or customize your own training image
you must use a CUDA 11 base image. See
`SageMaker Python SDK's distributed data parallel library APIs
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`__
for more information.
SageMaker's distributed data parallel library addresses communications overhead in two ways:

.. rubric:: Customize your training script
1. The library performs AllReduce, a key operation during distributed training that is responsible for a
large portion of communication overhead.
2. The library performs optimized node-to-node communication by fully utilizing AWS’s network
infrastructure and Amazon EC2 instance topology.

To customize your own training script, you will need the following:
To learn more about the core features of this library, see
`Introduction to SageMaker's Distributed Data Parallel Library
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html>`_
in the SageMaker Developer Guide.

.. raw:: html
Use with the SageMaker Python SDK
=================================

<div data-section-style="5" style="">
To use the SageMaker distributed data parallel library with the SageMaker Python SDK, you will need the following:

- You must provide TensorFlow / PyTorch training scripts that are
adapted to use the distributed data parallel library.
- A TensorFlow or PyTorch training script that is
adapted to use the distributed data parallel library. The :ref:`sdp_api_docs` includes
framework specific examples of training scripts that are adapted to use this library.
- Your input data must be in an S3 bucket or in FSx in the AWS region
that you will use to launch your training job. If you use the Jupyter
notebooks provided, create a SageMaker notebook instance in the same
Expand All @@ -47,32 +44,60 @@ To customize your own training script, you will need the following:
the `SageMaker Python SDK data
inputs <https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-inputs>`__ documentation.

.. raw:: html
When you define
a Pytorch or TensorFlow ``Estimator`` using the SageMaker Python SDK,
you must select ``dataparallel`` as your ``distribution`` strategy:

.. code::

distribution = { "smdistributed": { "dataparallel": { "enabled": True } } }

We recommend you use one of the example notebooks as your template to launch a training job. When
you use an example notebook you’ll need to swap your training script with the one that came with the
notebook and modify any input functions as necessary. For instructions on how to get started using a
Jupyter Notebook example, see `Distributed Training Jupyter Notebook Examples
<https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training-notebook-examples.html>`_.

</div>
Once you have launched a training job, you can monitor it using CloudWatch. To learn more, see
`Monitor and Analyze Training Jobs Using Metrics
<https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html>`_.

Use the API guides for each framework to see
examples of training scripts that can be used to convert your training scripts.
Then use one of the example notebooks as your template to launch a training job.
You’ll need to swap your training script with the one that came with the
notebook and modify any input functions as necessary.
Once you have launched a training job, you can monitor it using CloudWatch.

Then you can see how to deploy your trained model to an endpoint by
following one of the example notebooks for deploying a model. Finally,
you can follow an example notebook to test inference on your deployed
model.
After you train a model, you can see how to deploy your trained model to an endpoint for inference by
following one of the `example notebooks for deploying a model
<https://sagemaker-examples.readthedocs.io/en/latest/inference/index.html>`_.
For more information, see `Deploy Models for Inference
<https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-model.html>`_.

.. _sdp_api_docs:

API Documentation
=================

This section contains the SageMaker distributed data parallel API documentation. If you are a
new user of this library, it is recommended you use this guide alongside
`SageMaker's Distributed Data Parallel Library
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel.html>`_.

Select a version to see the API documentation for version.

.. toctree::
:maxdepth: 2
:maxdepth: 1

sdp_versions/v1_0_0.rst

.. important::
The distributed data parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
``Estimator`` with ``dataparallel`` parameter ``enabled`` set to ``True``,
it uses CUDA 11. When you extend or customize your own training image
you must use a CUDA 11 base image. See
`SageMaker Python SDK's distributed data parallel library APIs
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`_
for more information.

sdp_versions/smd_data_parallel_pytorch
sdp_versions/smd_data_parallel_tensorflow

Latest Updates
==============
Release Notes
=============

New features, bug fixes, and improvements are regularly made to the SageMaker distributed data parallel library.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
# Sagemaker Distributed Data Parallel - Release Notes
# Sagemaker Distributed Data Parallel 1.0.0 Release Notes

- First Release
- Getting Started

## First Release

SageMaker's distributed data parallel library extends SageMaker’s training
capabilities on deep learning models with near-linear scaling efficiency,
achieving fast time-to-train with minimal code changes.
Expand All @@ -12,7 +13,8 @@ SageMaker Distributed Data Parallel:
- optimizes your training job for AWS network infrastructure and EC2 instance topology.
- takes advantage of gradient update to communicate between nodes with a custom AllReduce algorithm.

The library currently supports Tensorflow v2 and PyTorch via [AWS Deep Learning Containers](https://aws.amazon.com/machine-learning/containers/).
The library currently supports TensorFlow v2 and PyTorch via [AWS Deep Learning Containers](https://aws.amazon.com/machine-learning/containers/).

## Getting Started

For getting started, refer to [SageMaker Distributed Data Parallel Python SDK Guide](https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api).
18 changes: 9 additions & 9 deletions doc/api/training/smd_model_parallel.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,6 @@ across multiple GPUs with minimal code changes. The library's API can be accesse

Use the following sections to learn more about the model parallelism and the library.

.. important::
The model parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
``Estimator`` with ``modelparallel`` parameter ``enabled`` set to ``True``,
it uses CUDA 11. When you extend or customize your own training image
you must use a CUDA 11 base image. See
`Extend or Adapt A Docker Container that Contains the Model Parallel Library
<https://integ-docs-aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-customize-container>`__
for more information.

Use with the SageMaker Python SDK
=================================

Expand Down Expand Up @@ -61,6 +52,15 @@ developer guide. This developer guide documentation includes:
<https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-customize-tips-pitfalls.html>`__


.. important::
The model parallel library only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
``Estimator`` with ``modelparallel`` parameter ``enabled`` set to ``True``,
it uses CUDA 11. When you extend or customize your own training image
you must use a CUDA 11 base image. See
`Extend or Adapt A Docker Container that Contains the Model Parallel Library
<https://integ-docs-aws.amazon.com/sagemaker/latest/dg/model-parallel-use-api.html#model-parallel-customize-container>`__
for more information.

Release Notes
=============

Expand Down