|
| 1 | +################################### |
| 2 | +Distributed data parallel |
| 3 | +################################### |
| 4 | + |
| 5 | +SageMaker distributed data parallel (SDP) extends SageMaker’s training |
| 6 | +capabilities on deep learning models with near-linear scaling efficiency, |
| 7 | +achieving fast time-to-train with minimal code changes. |
| 8 | + |
| 9 | +- SDP optimizes your training job for AWS network infrastructure and EC2 instance topology. |
| 10 | +- SDP takes advantage of gradient update to communicate between nodes with a custom AllReduce algorithm. |
| 11 | + |
| 12 | +When training a model on a large amount of data, machine learning practitioners |
| 13 | +will often turn to distributed training to reduce the time to train. |
| 14 | +In some cases, where time is of the essence, |
| 15 | +the business requirement is to finish training as quickly as possible or at |
| 16 | +least within a constrained time period. |
| 17 | +Then, distributed training is scaled to use a cluster of multiple nodes, |
| 18 | +meaning not just multiple GPUs in a computing instance, but multiple instances |
| 19 | +with multiple GPUs. As the cluster size increases, so does the significant drop |
| 20 | +in performance. This drop in performance is primarily caused the communications |
| 21 | +overhead between nodes in a cluster. |
| 22 | + |
| 23 | + |
| 24 | +.. rubric:: Customize your training script |
| 25 | + |
| 26 | +To customize your own training script, you will need the following: |
| 27 | + |
| 28 | +.. raw:: html |
| 29 | + |
| 30 | + <div data-section-style="5" style=""> |
| 31 | + |
| 32 | +- You must provide TensorFlow / PyTorch training scripts that are |
| 33 | + adapted to use SDP. |
| 34 | +- Your input data must be in an S3 bucket or in FSx in the AWS region |
| 35 | + that you will use to launch your training job. If you use the Jupyter |
| 36 | + notebooks provided, create a SageMaker notebook instance in the same |
| 37 | + region as the bucket that contains your input data. For more |
| 38 | + information about storing your training data, refer to |
| 39 | + the `SageMaker Python SDK data |
| 40 | + inputs <https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-inputs>`__ documentation. |
| 41 | + |
| 42 | +.. raw:: html |
| 43 | + |
| 44 | + </div> |
| 45 | + |
| 46 | +Use the API guides for each framework to see |
| 47 | +examples of training scripts that can be used to convert your training scripts. |
| 48 | +Then, use one of the example notebooks as your template to launch a training job. |
| 49 | +You’ll need to swap your training script with the one that came with the |
| 50 | +notebook and modify any input functions as necessary. |
| 51 | +Once you have launched a training job, you can monitor it using CloudWatch. |
| 52 | + |
| 53 | +Then you can see how to deploy your trained model to an endpoint by |
| 54 | +following one of the example notebooks for deploying a model. Finally, |
| 55 | +you can follow an example notebook to test inference on your deployed |
| 56 | +model. |
| 57 | + |
| 58 | + |
| 59 | + |
| 60 | +.. toctree:: |
| 61 | + :maxdepth: 2 |
| 62 | + |
| 63 | + smd_data_parallel_pytorch |
| 64 | + smd_data_parallel_tensorflow |
0 commit comments