update distributed training doc

mchoi8739 · mchoi8739 · commit 9a5e8bc5462f · 2022-02-24T11:53:29.000-08:00
diff --git a/doc/api/training/distributed.rst b/doc/api/training/distributed.rst
@@ -4,14 +4,20 @@ SageMaker distributed training libraries offer both data parallel and model para
 They combine software and hardware technologies to improve inter-GPU and inter-node communications.
 They extend SageMaker’s training capabilities with built-in options that require only small code changes to your training scripts.
 
+.. _sdp_api_docs_toc:
+
 The SageMaker Distributed Data Parallel Library
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 .. toctree::
-   :maxdepth: 3
+    :maxdepth: 3
 
-   smd_data_parallel
+    smd_data_parallel
+    sdp_versions/latest
+    smd_data_parallel_use_sm_pysdk
+    smd_data_parallel_release_notes/smd_data_parallel_change_log
 
+.. _smp_api_docs_toc:
 
 The SageMaker Distributed Model Parallel Library
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/doc/api/training/sdp_versions/latest.rst b/doc/api/training/sdp_versions/latest.rst
@@ -1,8 +1,8 @@
 .. _sdp_api_docs:
 
-###############################################
-Use the Library's API to Adapt Training Scripts
-###############################################
+#############################################
+Use the Library to Adapt Your Training Script
+#############################################
 
 This section contains the SageMaker distributed data parallel API documentation.
 If you are a new user of this library, it is recommended you use this guide alongside
@@ -15,11 +15,13 @@ Select the latest or one of the previous versions of the API documentation
 depending on the version of the library you use.
 
 .. important::
+
    The distributed data parallel library supports training jobs using CUDA 11 or later.
-   When you define a SageMaker ``PyTorch`` or ``TensorFlow``
-   estimator with ``dataparallel`` parameter ``enabled`` set to ``True``,
-   it uses CUDA 11. When you extend or customize your own training image,
-   you must use a CUDA 11 base image. See
+   When you define a :class:`sagemaker.tensorflow.estimator.TensorFlow` or
+   :class:`sagemaker.pytorch.estimator.PyTorch`
+   estimator with the data parallel library enabled,
+   SageMaker uses CUDA 11. When you extend or customize your own training image,
+   you must use a base image with CUDA 11 or later. See
    `SageMaker Python SDK's distributed data parallel library APIs
    <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`_
    for more information.
diff --git a/doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst b/doc/api/training/sdp_versions/latest/smd_data_parallel_pytorch.rst
@@ -11,154 +11,40 @@ data parallel library API for PyTorch.
 
 .. _pytorch-sdp-modify:
 
-Modify a PyTorch training script to use the SageMaker data parallel library
-===========================================================================
+Use the SageMaker Distributed Data Parallel Library as a Backend of ``torch.distributed``
+===========================================================================================
 
-The following steps show you how to convert a PyTorch training script to
-utilize SageMaker's distributed data parallel library.
-
-The distributed data parallel library works as a backend of the PyTorch distributed package.
-See `SageMaker distributed data parallel PyTorch examples <https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#pytorch-distributed>`__ 
-for additional details on how to use the library.
-
-1.  Import the SageMaker distributed data parallel library’s PyTorch client.
-
-    .. code:: python
-
-      import smdistributed.dataparallel.torch.torch_smddp
-
-2.  Import the PyTorch distributed modules.
-
-    .. code:: python
-
-      import torch
-      import torch.distributed as dist
-      from torch.nn.parallel import DistributedDataParallel as DDP
-
-3.  Set the backend of ``torch.distributed`` as ``smddp``.
-
-    .. code:: python
-
-      dist.init_process_group(backend='smddp')
-
-4.  After parsing arguments and defining a batch size parameter
-    (for example, ``batch_size=args.batch_size``), add a two-line of code to
-    resize the batch size per worker (GPU). PyTorch's DataLoader operation
-    does not automatically handle the batch resizing for distributed training.
-
-    .. code:: python
-
-      batch_size //= dist.get_world_size()
-      batch_size = max(batch_size, 1)
-
-5.  Pin each GPU to a single SageMaker data parallel library process with
-    ``local_rank``. This refers to the relative rank of the process within a given node.
-
-    You can retrieve the rank of the process from the ``LOCAL_RANK`` environment variable.
-
-    .. code:: python
-
-      import os
-      local_rank = os.environ["LOCAL_RANK"]
-      torch.cuda.set_device(local_rank)
-
-6.  After defining a model, wrap it with the PyTorch DDP.
-
-    .. code:: python
-
-      model = ...
-
-      # Wrap the model with the PyTorch DistributedDataParallel API
-      model = DDP(model)
-
-7.  When you call the ``torch.utils.data.distributed.DistributedSampler`` API,
-    specify the total number of processes (GPUs) participating in training across
-    all the nodes in the cluster. This is called ``world_size``, and you can retrieve
-    the number from the ``torch.distributed.get_world_size()`` API. Also, specify
-    the rank of each process among all processes using the ``torch.distributed.get_rank()`` API.
-
-    .. code:: python
-
-      train_sampler = DistributedSampler(
-          train_dataset,
-          num_replicas = dist.get_world_size(),
-          rank = dist.get_rank()
-      )
-
-8.  Modify your script to save checkpoints only on the leader process (rank 0).
-    The leader process has a synchronized model. This also avoids other processes
-    overwriting the checkpoints and possibly corrupting the checkpoints.
-
-The following example code shows the structure of a PyTorch training script with DDP and smddp as the backend.
+To use the SageMaker distributed data parallel library,
+the only thing you need to do is to import the SageMaker distributed data
+parallel library’s PyTorch client (``smdistributed.dataparallel.torch.torch_smddp``).
+The client registers ``smddp`` as a backend for PyTorch.
+When you initialize the PyTorch distributed process group using
+the ``torch.distributed.init_process_group`` API,
+make sure you specify ``'smddp'`` to the backend argument.
 
 .. code:: python
 
-  import os
-  import torch
-
-  # SageMaker data parallel: Import the library PyTorch API
   import smdistributed.dataparallel.torch.torch_smddp
-
-  # SageMaker data parallel: Import PyTorch's distributed API
   import torch.distributed as dist
-  from torch.nn.parallel import DistributedDataParallel as DDP
 
-  # SageMaker data parallel: Initialize the process group
   dist.init_process_group(backend='smddp')
 
-  class Net(nn.Module):
-      ...
-      # Define model
-
-  def train(...):
-      ...
-      # Model training
-
-  def test(...):
-      ...
-      # Model evaluation
-
-  def main():
-
-      # SageMaker data parallel: Scale batch size by world size
-      batch_size //= dist.get_world_size()
-      batch_size = max(batch_size, 1)
-
-      # Prepare dataset
-      train_dataset = torchvision.datasets.MNIST(...)
-
-      # SageMaker data parallel: Set num_replicas and rank in DistributedSampler
-      train_sampler = torch.utils.data.distributed.DistributedSampler(
-              train_dataset,
-              num_replicas=dist.get_world_size(),
-              rank=dist.get_rank())
-
-      train_loader = torch.utils.data.DataLoader(..)
-
-      # SageMaker data parallel: Wrap the PyTorch model with the library's DDP
-      model = DDP(Net().to(device))
 
-      # SageMaker data parallel: Pin each GPU to a single library process.
-      local_rank = os.environ["LOCAL_RANK"]
-      torch.cuda.set_device(local_rank)
-      model.cuda(local_rank)
+If you already have a working PyTorch script and only need to add the
+backend specification, you can proceed to Using the SageMaker PyTorch Estimator
+in the Step 2: Launch a SageMaker Distributed Training Job Using the SageMaker Python SDK topic.
 
-      # Train
-      optimizer = optim.Adadelta(...)
-      scheduler = StepLR(...)
-      for epoch in range(1, args.epochs + 1):
-          train(...)
-          if rank == 0:
-              test(...)
-          scheduler.step()
+.. note::
 
-      # SageMaker data parallel: Save model on the leader node (rank 0).
-      if dist.get_rank() == 0:
-          torch.save(...)
+  The ``smddp`` backend currently does not support creating subprocess groups
+  with the ``torch.distributed.new_group()`` API.
+  You cannot use the ``smddp`` backend concurrently with other backends.
 
-  if __name__ == '__main__':
-      main()
+.. seealso::
 
+  If you still need to modify your training script to properly use
+  the PyTorch distributed package, see `Preparing a PyTorch Training Script for Distributed Training <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-modify-sdp-pt.html>`_
+  in the *Amazon SageMaker Developer Guide*.
 
 .. _pytorch-sdp-api:
 
diff --git a/doc/api/training/smd_data_parallel.rst b/doc/api/training/smd_data_parallel.rst
@@ -27,10 +27,3 @@ To learn more about the core features of this library, see
 `Introduction to SageMaker's Distributed Data Parallel Library
 <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-intro.html>`_
 in the SageMaker Developer Guide.
-
-.. toctree::
-   :maxdepth: 3
-
-   sdp_versions/latest
-   smd_data_parallel_use_sm_pysdk
-   smd_data_parallel_release_notes/smd_data_parallel_change_log
diff --git a/doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.rst b/doc/api/training/smd_data_parallel_release_notes/smd_data_parallel_change_log.rst
@@ -10,20 +10,55 @@ distributed data parallel library.
 SageMaker Distributed Data Parallel 1.4.0 Release Notes
 =======================================================
 
-*Date: Feb. 18. 2022*
+*Date: Feb. 24. 2022*
 
 **New Features**
 
 * Integrated to PyTorch DDP as a backend option
 * Added support for PyTorch 1.10.2
 
-**Bug Fixes**
+**Breaking Changes**
+
+* As the library is migrated into the PyTorch distributed package as a backend,
+  the following smdistributed implementation APIs are deprecated in
+  the SageMaker data parallal library v1.4.0 and later.
+  Please use the `PyTorch distributed APIs <https://pytorch.org/docs/stable/distributed.html>`_ instead.
+
+  * ``smdistributed.dataparallel.torch.distributed``
+  * ``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``
+  * Please note the slight differences between the deprecated
+    ``smdistributed.dataparallel.torch`` APIs and the
+    `PyTorch distributed APIs <https://pytorch.org/docs/stable/distributed.html>`_.
 
-*
+    * `torch.distributed.barrier <https://pytorch.org/docs/master/distributed.html#torch.distributed.barrier)>`_
+      takes ``device_ids``, which the ``smddp`` backend does not support.
+    * The ``gradient_accumulation_steps`` option in
+      ``smdistributed.dataparallel.torch.parallel.DistributedDataParallel``
+      is no longer supported. Please use the PyTorch
+      `no_sync <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=no_sync#torch.nn.parallel.DistributedDataParallel.no_sync>`_ API.
+
+
+* If you want to find documentation for the previous versions of the library
+  (v1.3.0 or before), see the `archived SageMaker distributed data parallel library documentation <https://sagemaker.readthedocs.io/en/stable/api/training/sdp_versions/latest.html#documentation-archive>`_.
 
 **Improvements**
 
-*
+* Support AllReduce Large Tensors
+* we support the following new arguments in the `PyTorch DDP class <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel>`_.
+
+  * ``broadcast_buffers``
+  * ``find_unused_parameters``
+  * ``gradient_as_bucket_view``
+
+**Bug Fixes**
+
+* Fixed stalling issues when training on ``ml.p3.16xlarge``.
+
+**Known Issues**
+
+* The library currently does not support the PyTorch sub-process groups API (`torch.distributed.new_group <https://pytorch.org/docs/stable/distributed.html#torch.distributed.new_group>`_).
+  This means that you cannot use the ``smddp`` backend concurrently with other
+  process group backends such as NCCL and Gloo.
 
 **Migration to AWS Deep Learning Containers**
 
diff --git a/doc/api/training/smd_data_parallel_use_sm_pysdk.rst b/doc/api/training/smd_data_parallel_use_sm_pysdk.rst
@@ -1,5 +1,5 @@
-Run a Distributed Training Job Using the SageMaker Python SDK
-=============================================================
+Launch a Distributed Training Job Using the SageMaker Python SDK
+================================================================
 
 To use the SageMaker distributed data parallel library with the SageMaker Python SDK,
 you will need the following:
@@ -18,13 +18,19 @@ you will need the following:
    inputs <https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-inputs>`__ documentation.
 
 When you define
-a Pytorch or TensorFlow ``Estimator`` using the SageMaker Python SDK,
-you must select ``dataparallel`` as your ``distribution`` strategy:
+a :class:`sagemaker.tensorflow.estimator.TensorFlow` or :class:`sagemaker.pytorch.estimator.PyTorch` estimator,
+you must select ``smdistributed`` and then ``dataparallel`` as your ``distribution`` strategy.
 
 .. code:: python
 
    distribution = { "smdistributed": { "dataparallel": { "enabled": True } } }
 
+.. seealso::
+
+  To learn more, see `Step 2: Launch a SageMaker Distributed Training Job Using the SageMaker Python SDK
+  <https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html>`_
+  in the *Amazon SageMaker Developer Guide*.
+
 We recommend you use one of the example notebooks as your template to launch a training job. When
 you use an example notebook you’ll need to swap your training script with the one that came with the
 notebook and modify any input functions as necessary. For instructions on how to get started using a
@@ -35,7 +41,6 @@ Once you have launched a training job, you can monitor it using CloudWatch. To l
 `Monitor and Analyze Training Jobs Using Metrics
 <https://docs.aws.amazon.com/sagemaker/latest/dg/training-metrics.html>`_.
 
-
 After you train a model, you can see how to deploy your trained model to an endpoint for inference by
 following one of the `example notebooks for deploying a model
 <https://sagemaker-examples.readthedocs.io/en/latest/inference/index.html>`_.
diff --git a/doc/api/training/smd_model_parallel_general.rst b/doc/api/training/smd_model_parallel_general.rst
@@ -1,3 +1,5 @@
+.. _sm-sdk-modelparallel-general:
+
 #############################################################
 Run a Distributed Training Job Using the SageMaker Python SDK
 #############################################################
diff --git a/src/sagemaker/pytorch/estimator.py b/src/sagemaker/pytorch/estimator.py
diff --git a/src/sagemaker/tensorflow/estimator.py b/src/sagemaker/tensorflow/estimator.py

Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,5 @@`
	`1`	`+.. _sm-sdk-modelparallel-general:`
	`2`	`+`
`1`	`3`	`#############################################################`
`2`	`4`	`Run a Distributed Training Job Using the SageMaker Python SDK`
`3`	`5`	`#############################################################`