Skip to content

Commit 4509ca9

Browse files
author
Talia Chopra
committed
additional small typo fixes
1 parent bc206ae commit 4509ca9

File tree

2 files changed

+13
-20
lines changed

2 files changed

+13
-20
lines changed

training/distributed_training/pytorch/model_parallel/bert/smp_bert_tutorial.ipynb

Lines changed: 10 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
"In this notebook, you will use a BERT example training script with SMP.\n",
1414
"The example script is based on [Nvidia Deep Learning Examples](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT) and requires you to download the datasets and upload them to Amazon Simple Storage Service (Amazon S3) as explained in the instructions below. This is a large dataset, and so depending on your connection speed, this process can take hours to complete. \n",
1515
"\n",
16-
"This notebook depends on the following files:\n",
16+
"This notebook depends on the following files. You can find all files in the [bert directory](https://github.com/aws/amazon-sagemaker-examples/tree/master/training/distributed_training/pytorch/model_parallel/bert) in the model parllel section of the Amazon SageMaker Examples notebooks repo.\n",
1717
"\n",
1818
"* `bert_example/sagemaker_smp_pretrain.py`: This is an entrypoint script that is passed to the Pytorch estimator in the notebook instructions. This script is responsible for end to end training of the BERT model with SMP. The script has additional comments at places where the SMP API is used.\n",
1919
"\n",
@@ -25,9 +25,7 @@
2525
"\n",
2626
"* `bert_example/utils.py`: This contains different helper utility functions used in end to end training of the BERT model (`bert_example/sagemaker_smp_pretrain.py`).\n",
2727
"\n",
28-
"* `bert_example/file_utils.py`: Contains different file utility functions used in model definition (*bert_example/modeling.py*).\n",
29-
"\n",
30-
"*Getting Started*: The bert directory needs to be zipped and uploaded to a Sagemaker notebook instance. Unzip on the notebook instance and follow the instructions in the notebook.\n"
28+
"* `bert_example/file_utils.py`: Contains different file utility functions used in model definition (`bert_example/modeling.py`).\n"
3129
]
3230
},
3331
{
@@ -92,13 +90,8 @@
9290
"cell_type": "markdown",
9391
"metadata": {},
9492
"source": [
95-
"## Prepare/Identify your Training Data in Amazon S3"
96-
]
97-
},
98-
{
99-
"cell_type": "markdown",
100-
"metadata": {},
101-
"source": [
93+
"## Prepare/Identify your Training Data in Amazon S3\n",
94+
"\n",
10295
"If you don't already have the BERT dataset in an S3 bucket, please see the instructions in [Nvidia BERT Example](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md) to download the dataset and upload it to a s3 bucket. See the prerequisites at the beginning of this notebook for more information.\n",
10396
"\n",
10497
"Uncomment and use the following cell to specify the Amazon S3 bucket and prefix that contains your training data. For example, if your training data is in s3://your-bucket/training, enter `'your-bucket'` for s3_bucket and `'training'` for prefix. Note that your output data will be stored in the same bucket, under the `output/` prefix."
@@ -193,9 +186,9 @@
193186
"metadata": {},
194187
"outputs": [],
195188
"source": [
196-
"mpioptions = \"-verbose --mca orte_base_help_aggregate 0 \"\n",
197-
"mpioptions += \"--mca btl_vader_single_copy_mechanism none\"\n",
198-
"parameters = {\"optimize\": \"speed\", \"microbatches\": 12, \"partitions\": 2, \"ddp\": True, \"pipeline\": \"interleaved\", \"overlapping_allreduce\": True, \"placement_strategy\": \"cluster\", \"memory_weight\": 0.3}\n",
189+
"mpi_options = \"-verbose --mca orte_base_help_aggregate 0 \"\n",
190+
"mpi_options += \"--mca btl_vader_single_copy_mechanism none\"\n",
191+
"smp_parameters = {\"optimize\": \"speed\", \"microbatches\": 12, \"partitions\": 2, \"ddp\": True, \"pipeline\": \"interleaved\", \"overlapping_allreduce\": True, \"placement_strategy\": \"cluster\", \"memory_weight\": 0.3}\n",
199192
"timeout = 60 * 60\n",
200193
"metric_definitions = [{\"Name\": \"base_metric\", \"Regex\": \"<><><><><><>\"}]\n",
201194
"\n",
@@ -235,7 +228,7 @@
235228
"metadata": {},
236229
"outputs": [],
237230
"source": [
238-
"pytorch_estimator = PyTorch(\"sagemaker_smp_pretrain.py\",\n",
231+
"pytorch_estimator = PyTorch(\"sagemaker_rbk_pretrain.py\",\n",
239232
" role=role,\n",
240233
" instance_type=\"ml.p3.16xlarge\",\n",
241234
" volume_size=200,\n",
@@ -247,13 +240,13 @@
247240
" \"smdistributed\": {\n",
248241
" \"modelparallel\": {\n",
249242
" \"enabled\": True,\n",
250-
" \"parameters\": parameters\n",
243+
" \"parameters\": smp_parameters\n",
251244
" }\n",
252245
" },\n",
253246
" \"mpi\": {\n",
254247
" \"enabled\": True,\n",
255248
" \"process_per_host\": 8,\n",
256-
" \"custom_mpi_options\": mpioptions,\n",
249+
" \"custom_mpi_options\": mpi_options,\n",
257250
" }\n",
258251
" },\n",
259252
" source_dir='bert_example',\n",

training/distributed_training/pytorch/model_parallel/mnist/pytorch_smmodelparallel_mnist.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
"\n",
99
"SageMaker Distributed Model Parallel (SMP) is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. SageMaker Distributed Model Parallel automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.\n",
1010
"\n",
11-
"Use this notebook to configure Sagemaker Distributed Model Parallel to train a model using an example PyTorch training script, `pt_mnist.py` and [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk). \n",
11+
"Use this notebook to configure Sagemaker Distributed Model Parallel to train a model using an example PyTorch training script, `utils/pt_mnist.py` and [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk). \n",
1212
"\n",
1313
"\n",
1414
"### Additional Resources\n",
@@ -17,7 +17,7 @@
1717
"\n",
1818
"* To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed](http://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel.html).\n",
1919
"\n",
20-
"* To learn more about using the SageMaker Python SDK with Pytorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries).\n",
20+
"* To learn more about using the SageMaker Python SDK with Pytorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html).\n",
2121
"\n",
2222
"* To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)."
2323
]
@@ -166,7 +166,7 @@
166166
" \"pipeline\": \"interleaved\",\n",
167167
" \"optimize\": \"speed\",\n",
168168
" \"partitions\": 2,\n",
169-
" \"ddp\": 1,\n",
169+
" \"ddp\": True,\n",
170170
" }\n",
171171
" }\n",
172172
" },\n",

0 commit comments

Comments
 (0)