Skip to content

Commit 0c1755e

Browse files
authored
Merge pull request #123 from TEChopra1000/smp-notebook-updates
SMP example notebook updates
2 parents bd122a1 + c6a07c3 commit 0c1755e

File tree

2 files changed

+47
-38
lines changed

2 files changed

+47
-38
lines changed

training/distributed_training/pytorch/model_parallel/bert/smp_bert_tutorial.ipynb

Lines changed: 38 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -6,32 +6,47 @@
66
"source": [
77
"# Use Amazon Sagemaker Distributed Model Parallel to Launch a BERT Training Job with Model Parallelization\n",
88
"\n",
9-
"SMP (Sagemaker Distributed Model Parallel) is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. SMP automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.\n",
9+
"Sagemaker distributed model parallel (SMP) is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. SMP automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.\n",
1010
"\n",
1111
"Use this notebook to configure SMP to train a model using PyTorch (version 1.6.0) and the [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk).\n",
1212
"\n",
1313
"In this notebook, you will use a BERT example training script with SMP.\n",
14-
"The example script is based on [Nvidia Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT) and will require you to download the datasets and upload to s3 as provided in the instructions below."
14+
"The example script is based on [Nvidia Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT) and requires you to download the datasets and upload them to Amazon Simple Storage Service (Amazon S3) as explained in the instructions below. This is a large dataset, and so depending on your connection speed, this process can take hours to complete. \n",
15+
"\n",
16+
"This notebook depends on the following files. You can find all files in the [bert directory](https://github.com/aws/amazon-sagemaker-examples/tree/master/training/distributed_training/pytorch/model_parallel/bert) in the model parllel section of the Amazon SageMaker Examples notebooks repo.\n",
17+
"\n",
18+
"* `bert_example/sagemaker_smp_pretrain.py`: This is an entrypoint script that is passed to the Pytorch estimator in the notebook instructions. This script is responsible for end to end training of the BERT model with SMP. The script has additional comments at places where the SMP API is used.\n",
19+
"\n",
20+
"* `bert_example/modeling.py`: This contains the model definition for the BERT model.\n",
21+
"\n",
22+
"* `bert_example/bert_config.json`: This allows for additional configuration of the model and is used by `modeling.py`. Additional configuration includes dropout probabilities, pooler and encoder sizes, number of hidden layers in the encoder, size of the intermediate layers in the encoder etc.\n",
23+
"\n",
24+
"* `bert_example/schedulers.py`: contains definitions for learning rate schedulers used in end to end training of the BERT model (`bert_example/sagemaker_smp_pretrain.py`).\n",
25+
"\n",
26+
"* `bert_example/utils.py`: This contains different helper utility functions used in end to end training of the BERT model (`bert_example/sagemaker_smp_pretrain.py`).\n",
27+
"\n",
28+
"* `bert_example/file_utils.py`: Contains different file utility functions used in model definition (`bert_example/modeling.py`).\n"
1529
]
1630
},
1731
{
1832
"cell_type": "markdown",
1933
"metadata": {},
2034
"source": [
2135
"### Additional Resources\n",
22-
"If you are a new user of Amazon SageMaker, you may find the following helpful to understand how SageMaker uses Docker to train custom models.\n",
23-
"* To learn more about using Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms\n",
24-
"](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).\n",
36+
"If you are a new user of Amazon SageMaker, you may find the following helpful to learn more about SMP and using SageMaker with Pytorch. \n",
37+
"\n",
38+
"* To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).\n",
39+
"\n",
40+
"* To learn more about using the SageMaker Python SDK with Pytorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).\n",
2541
"\n",
26-
"* To learn more about using Docker to train your own models with Amazon SageMaker, see [Example Notebooks: Use Your Own Algorithm or Model](https://docs.aws.amazon.com/sagemaker/latest/dg/adv-bring-own-examples.html).\n",
27-
"* To see other examples of distributed training using Amazon SageMaker and Pytorch, see [Distributed TensorFlow training using Amazon SageMaker\n",
28-
"](https://github.com/awslabs/amazon-sagemaker-examples/tree/master/advanced_functionality/distributed_tensorflow_mask_rcnn).\n",
42+
"* To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html).\n",
2943
"\n",
3044
"\n",
3145
"### Prerequisites \n",
3246
"\n",
33-
"* A S3 bucket to store the input data to be used for training.\n",
34-
"* The input data you use for training must be in an Amazon S3 bucket in the same AWS Region as this notebook instances."
47+
"1. You must create an S3 bucket to store the input data to be used for training. This bucket must must be located in the same AWS Region you use to launch your training job. This is the AWS Region you use to run this notebook. To learn how, see [Creating a bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html) in the Amazon S3 documentation.\n",
48+
"\n",
49+
"2. You must download the dataset that you use for training from [Nvidia Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT) and upload it to the S3 bucket you created. To learn more about the datasets and scripts provided to preprocess and download it, see [Getting the data](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md#getting-the-data) in the Nvidia Deep Learning Examples repo README. You can also use the [Quick Start Guide](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md#quick-start-guide) to learn how to download the dataset. The repository consists of three datasets. Optionally, you can to use the `wiki_only` parameter to only download the Uncyclopedia dataset. "
3550
]
3651
},
3752
{
@@ -40,7 +55,7 @@
4055
"source": [
4156
"## Amazon SageMaker Initialization\n",
4257
"\n",
43-
"Initialize the notebook instance. Get the aws region, sagemaker execution role"
58+
"Initialize the notebook instance. Get the AWS Region, SageMaker execution role Amazon Resource Name (ARN). "
4459
]
4560
},
4661
{
@@ -75,16 +90,11 @@
7590
"cell_type": "markdown",
7691
"metadata": {},
7792
"source": [
78-
"## Prepare/Identify your Training Data in Amazon S3"
79-
]
80-
},
81-
{
82-
"cell_type": "markdown",
83-
"metadata": {},
84-
"source": [
85-
"If you don't already have the BERT dataset in a S3 bucket, please see the instructions in [Nvidia BERT Example](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md) to download the dataset and upload it to a s3 bucket. \n",
93+
"## Prepare/Identify your Training Data in Amazon S3\n",
94+
"\n",
95+
"If you don't already have the BERT dataset in an S3 bucket, please see the instructions in [Nvidia BERT Example](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md) to download the dataset and upload it to a s3 bucket. See the prerequisites at the beginning of this notebook for more information.\n",
8696
"\n",
87-
"Uncomment and use the following cell to specify the Amazon S3 bucket and prefix that contains your training data. For example, if your training data is in s3://your-bucket/training, enter your-bucket for s3_bucket and training for prefix. Note that your output data will be stored in the same bucket, under the \"output\" prefix."
97+
"Uncomment and use the following cell to specify the Amazon S3 bucket and prefix that contains your training data. For example, if your training data is in s3://your-bucket/training, enter `'your-bucket'` for s3_bucket and `'training'` for prefix. Note that your output data will be stored in the same bucket, under the `output/` prefix."
8898
]
8999
},
90100
{
@@ -103,7 +113,7 @@
103113
"source": [
104114
"## Define SageMaker Data Channels\n",
105115
"\n",
106-
"In this step, you define Amazon SageMaker training data channel. "
116+
"In this step, you define Amazon SageMaker training data channel and output data path. The training data channel identifies where your training data is located in S3. "
107117
]
108118
},
109119
{
@@ -123,7 +133,7 @@
123133
"cell_type": "markdown",
124134
"metadata": {},
125135
"source": [
126-
"Required: Set your output data path:"
136+
"Set your output data path. This is where model artifacts are stored. "
127137
]
128138
},
129139
{
@@ -176,9 +186,9 @@
176186
"metadata": {},
177187
"outputs": [],
178188
"source": [
179-
"mpioptions = \"-verbose --mca orte_base_help_aggregate 0 \"\n",
180-
"mpioptions += \"--mca btl_vader_single_copy_mechanism none\"\n",
181-
"parameters = {\"optimize\": \"speed\", \"microbatches\": 12, \"partitions\": 2, \"ddp\": True, \"pipeline\": \"interleaved\", \"overlapping_allreduce\": True, \"placement_strategy\": \"cluster\", \"memory_weight\": 0.3}\n",
189+
"mpi_options = \"-verbose --mca orte_base_help_aggregate 0 \"\n",
190+
"mpi_options += \"--mca btl_vader_single_copy_mechanism none\"\n",
191+
"smp_parameters = {\"optimize\": \"speed\", \"microbatches\": 12, \"partitions\": 2, \"ddp\": True, \"pipeline\": \"interleaved\", \"overlapping_allreduce\": True, \"placement_strategy\": \"cluster\", \"memory_weight\": 0.3}\n",
182192
"timeout = 60 * 60\n",
183193
"metric_definitions = [{\"Name\": \"base_metric\", \"Regex\": \"<><><><><><>\"}]\n",
184194
"\n",
@@ -230,13 +240,13 @@
230240
" \"smdistributed\": {\n",
231241
" \"modelparallel\": {\n",
232242
" \"enabled\": True,\n",
233-
" \"parameters\": parameters\n",
243+
" \"parameters\": smp_parameters\n",
234244
" }\n",
235245
" },\n",
236246
" \"mpi\": {\n",
237247
" \"enabled\": True,\n",
238248
" \"process_per_host\": 8,\n",
239-
" \"custom_mpi_options\": mpioptions,\n",
249+
" \"custom_mpi_options\": mpi_options,\n",
240250
" }\n",
241251
" },\n",
242252
" source_dir='bert_example',\n",
@@ -284,5 +294,4 @@
284294
},
285295
"nbformat": 4,
286296
"nbformat_minor": 4
287-
}
288-
297+
}

training/distributed_training/pytorch/model_parallel/mnist/pytorch_smmodelparallel_mnist.ipynb

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -8,20 +8,18 @@
88
"\n",
99
"SageMaker Distributed Model Parallel (SMP) is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. SageMaker Distributed Model Parallel automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.\n",
1010
"\n",
11-
"Use this notebook to configure Sagemaker Distributed Model Parallel to train a model using an example PyTorch training script and [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk). \n",
11+
"Use this notebook to configure Sagemaker Distributed Model Parallel to train a model using an example PyTorch training script, `utils/pt_mnist.py` and [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk). \n",
1212
"\n",
1313
"\n",
1414
"### Additional Resources\n",
1515
"\n",
1616
"If you are a new user of Amazon SageMaker, you may find the following helpful to learn more about SMP and using SageMaker with Pytorch. \n",
1717
"\n",
18-
"* To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed\n",
19-
"](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training-model-parallel.html).\n",
18+
"* To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).\n",
2019
"\n",
21-
"* To learn more about using the SageMaker Python SDK with Pytorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries)\n",
20+
"* To learn more about using the SageMaker Python SDK with Pytorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).\n",
2221
"\n",
23-
"* To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms\n",
24-
"](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)."
22+
"* To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)."
2523
]
2624
},
2725
{
@@ -60,7 +58,9 @@
6058
"source": [
6159
"## Prepare your training script\n",
6260
"\n",
63-
"Run the following cell to view an example-training script for PyTorch 1.6"
61+
"Run the following cell to view an example-training script you will use in this demo. This is a PyTorch 1.6 trianing script that uses the MNIST dataset. \n",
62+
"\n",
63+
"You will see that the script contains `SMP` specific operations and decorators, which configure model parallel training. See the training script comments to learn more about the SMP functions and types used in the script."
6464
]
6565
},
6666
{
@@ -166,7 +166,7 @@
166166
" \"pipeline\": \"interleaved\",\n",
167167
" \"optimize\": \"speed\",\n",
168168
" \"partitions\": 2,\n",
169-
" \"ddp\": 1,\n",
169+
" \"ddp\": True,\n",
170170
" }\n",
171171
" }\n",
172172
" },\n",
@@ -223,4 +223,4 @@
223223
},
224224
"nbformat": 4,
225225
"nbformat_minor": 4
226-
}
226+
}

0 commit comments

Comments
 (0)