Skip to content

Commit 7bf22e2

Browse files
Notebook examples for SMDataParallel MaskRCNN and BERT training with PT and TF2 (#98)
* Add SMDataParallel PyTorch MaskRCNN example * Add TF2 SMDataParallel MaskRCNN example notebooks * Add PT SMDataParallel BERT example notebooks * Add SMDataParallel TF2 BERT Example demo * Add config files for 1 node, 2 node, 4 node maskrcnn training * Update batch size for PT BERT
1 parent 6365190 commit 7bf22e2

18 files changed

+2014
-0
lines changed
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
ARG region
2+
3+
FROM 763104351884.dkr.ecr.${region}.amazonaws.com/pytorch-training:1.6.0-gpu-py36-cu110-ubuntu18.04
4+
5+
ARG WORK_DIR="apex_build"
6+
RUN pip --no-cache-dir --no-cache install h5py boto3 'git+https://github.com/NVIDIA/dllogger' tqdm requests; \
7+
cd $WORK_DIR; \
8+
git clone https://github.com/NVIDIA/apex; cd apex; \
9+
python setup.py install --cuda_ext --cpp_ext; \
10+
cd ../..; rm -rf $WORK_DIR;
Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
#!/usr/bin/env bash
2+
# This script shows how to build the Docker image and push it to ECR to be ready for use
3+
# by SageMaker.
4+
# The argument to this script is the image name. This will be used as the image on the local
5+
# machine and combined with the account and region to form the repository name for ECR.
6+
# set region
7+
8+
DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
9+
10+
if [ "$#" -eq 4 ]; then
11+
region=$1
12+
image=$2
13+
tag=$3
14+
else
15+
echo "usage: $0 <aws-region> $1 <image-repo> $2 <image-tag>"
16+
exit 1
17+
fi
18+
# Get the account number associated with the current IAM credentials
19+
account=$(aws sts get-caller-identity --query Account --output text)
20+
if [ $? -ne 0 ]
21+
then
22+
exit 255
23+
fi
24+
25+
fullname="${account}.dkr.ecr.${region}.amazonaws.com/${image}:${tag}"
26+
# If the repository doesn't exist in ECR, create it.
27+
aws ecr describe-repositories --region ${region} --repository-names "${image}" > /dev/null 2>&1
28+
if [ $? -ne 0 ]; then
29+
aws ecr create-repository --region ${region} --repository-name "${image}" > /dev/null
30+
fi
31+
# Build the docker image locally with the image name and then push it to ECR
32+
# with the full name.
33+
# login ECR for the current account
34+
35+
aws ecr get-login-password --region ${region} | docker login --username AWS --password-stdin ${account}.dkr.ecr.${region}.amazonaws.com
36+
37+
docker build ${DIR}/ -t ${image} -f ${DIR}/Dockerfile --build-arg region=${region}
38+
docker tag ${image} ${fullname}
39+
docker push ${fullname}
40+
if [ $? -eq 0 ]; then
41+
echo "Amazon ECR URI: ${fullname}"
42+
else
43+
echo "Error: Image build and push failed"
44+
exit 1
45+
fi
Lines changed: 303 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,303 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Distributed data parallel BERT model training with PyTorch and SMDataParallel\n",
8+
"\n",
9+
"SMDataParallel is a new capability in Amazon SageMaker to train deep learning models faster and cheaper. SMDataParallel is a distributed data parallel training framework for PyTorch, TensorFlow, and MXNet.\n",
10+
"\n",
11+
"This notebook example shows how to use SMDataParallel with PyTorch(version 1.6.0) on [Amazon SageMaker](https://aws.amazon.com/sagemaker/) to train a BERT model using [Amazon FSx for Lustre file-system](https://aws.amazon.com/fsx/lustre/) as data source.\n",
12+
"\n",
13+
"\n",
14+
"The outline of steps is as follows:\n",
15+
"\n",
16+
"1. Stage dataset in [Amazon S3](https://aws.amazon.com/s3/). Original dataset for BERT pretraining consists of text passages from BooksCorpus (800M words) (Zhu et al. 2015) and English Uncyclopedia (2,500M words). Please follow original guidelines by NVidia to prepare training data in hdf5 format - \n",
17+
"https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/LanguageModeling/BERT/README.md#getting-the-data\n",
18+
"2. Create Amazon FSx Lustre file-system and import data into the file-system from S3\n",
19+
"3. Build Docker training image and push it to [Amazon ECR](https://aws.amazon.com/ecr/)\n",
20+
"4. Configure data input channels for SageMaker\n",
21+
"5. Configure hyper-prarameters\n",
22+
"6. Define training metrics\n",
23+
"7. Define training job, set distribution strategy to SMDataParallel and start training\n",
24+
"\n",
25+
"**NOTE:** With large traning dataset, we recommend using (Amazon FSx)[https://aws.amazon.com/fsx/] as the input filesystem for the SageMaker training job. FSx file input to SageMaker significantly cuts down training start up time on SageMaker because it avoids downloading the training data each time you start the training job (as done with S3 input for SageMaker training job) and provides good data read throughput.\n",
26+
"\n",
27+
"\n",
28+
"**NOTE:** This example requires SageMaker Python SDK v2.X.\n"
29+
]
30+
},
31+
{
32+
"cell_type": "markdown",
33+
"metadata": {},
34+
"source": [
35+
"## Amazon SageMaker Initialization\n",
36+
"\n",
37+
"Initialize the notebook instance. Get the aws region, sagemaker execution role.\n",
38+
"\n",
39+
"The IAM role arn used to give training and hosting access to your data. See the [Amazon SageMaker Roles](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the sagemaker.get_execution_role() with the appropriate full IAM role arn string(s). As described above, since we will be using FSx, please make sure to attach `FSx Access` permission to this IAM role."
40+
]
41+
},
42+
{
43+
"cell_type": "code",
44+
"execution_count": null,
45+
"metadata": {},
46+
"outputs": [],
47+
"source": [
48+
"%%time\n",
49+
"! python3 -m pip install --upgrade sagemaker\n",
50+
"import sagemaker\n",
51+
"from sagemaker import get_execution_role\n",
52+
"from sagemaker.estimator import Estimator\n",
53+
"import boto3\n",
54+
"\n",
55+
"sagemaker_session = sagemaker.Session()\n",
56+
"bucket = sagemaker_session.default_bucket()\n",
57+
"\n",
58+
"role = get_execution_role() # provide a pre-existing role ARN as an alternative to creating a new role\n",
59+
"print(f'SageMaker Execution Role:{role}')\n",
60+
"\n",
61+
"client = boto3.client('sts')\n",
62+
"account = client.get_caller_identity()['Account']\n",
63+
"print(f'AWS account:{account}')\n",
64+
"\n",
65+
"session = boto3.session.Session()\n",
66+
"region = session.region_name\n",
67+
"print(f'AWS region:{region}')"
68+
]
69+
},
70+
{
71+
"cell_type": "markdown",
72+
"metadata": {},
73+
"source": [
74+
"## Prepare SageMaker Training Images\n",
75+
"\n",
76+
"1. SageMaker by default use the latest [Amazon Deep Learning Container Images (DLC)](https://github.com/aws/deep-learning-containers/blob/master/available_images.md) PyTorch training image. In this step, we use it as a base image and install additional dependencies required for training BERT model.\n",
77+
"2. In the Github repository https://github.com/HerringForks/DeepLearningExamples.git we have made PyTorch-SMDataParallel BERT training script available for your use. This repository will be cloned in the training image for running the model training.\n",
78+
"\n",
79+
"### Build and Push Docker Image to ECR\n",
80+
"\n",
81+
"Run the below command build the docker image and push it to ECR."
82+
]
83+
},
84+
{
85+
"cell_type": "code",
86+
"execution_count": null,
87+
"metadata": {},
88+
"outputs": [],
89+
"source": [
90+
"image = \"<ADD NAME OF REPO>\" # Example: bert-smdataparallel-sagemaker\n",
91+
"tag = \"<ADD TAG FOR IMAGE>\" # Example: pt1.6 "
92+
]
93+
},
94+
{
95+
"cell_type": "code",
96+
"execution_count": null,
97+
"metadata": {},
98+
"outputs": [],
99+
"source": [
100+
"!pygmentize ./Dockerfile"
101+
]
102+
},
103+
{
104+
"cell_type": "code",
105+
"execution_count": null,
106+
"metadata": {},
107+
"outputs": [],
108+
"source": [
109+
"!pygmentize ./build_and_push.sh"
110+
]
111+
},
112+
{
113+
"cell_type": "code",
114+
"execution_count": null,
115+
"metadata": {},
116+
"outputs": [],
117+
"source": [
118+
"%%time\n",
119+
"! chmod +x build_and_push.sh; bash build_and_push.sh {region} {image} {tag}"
120+
]
121+
},
122+
{
123+
"cell_type": "markdown",
124+
"metadata": {},
125+
"source": [
126+
"### Training script\n",
127+
"\n",
128+
"In the Github repository https://github.com/HerringForks/DeepLearningExamples.git we have made PyTorch-SMDataParallel BERT training script available for your use. Clone the repository."
129+
]
130+
},
131+
{
132+
"cell_type": "code",
133+
"execution_count": null,
134+
"metadata": {},
135+
"outputs": [],
136+
"source": [
137+
"!rm -rf DeepLearningExamples\n",
138+
"!git clone https://github.com/HerringForks/DeepLearningExamples"
139+
]
140+
},
141+
{
142+
"cell_type": "markdown",
143+
"metadata": {},
144+
"source": [
145+
"## Configure hyperparameters for your training"
146+
]
147+
},
148+
{
149+
"cell_type": "code",
150+
"execution_count": null,
151+
"metadata": {},
152+
"outputs": [],
153+
"source": [
154+
"!pygmentize train.sh"
155+
]
156+
},
157+
{
158+
"cell_type": "markdown",
159+
"metadata": {},
160+
"source": [
161+
"## Preparing FSx Input for SageMaker\n",
162+
"\n",
163+
"1. Download and prepare your training dataset on S3.\n",
164+
"2. Follow the steps listed here to create a FSx linked with your S3 bucket with training data - https://docs.aws.amazon.com/fsx/latest/LustreGuide/create-fs-linked-data-repo.html. Make sure to add an endpoint to your VPC allowing S3 access.\n",
165+
"3. Follow the steps listed here to configure your SageMaker training job to use FSx https://aws.amazon.com/blogs/machine-learning/speed-up-training-on-amazon-sagemaker-using-amazon-efs-or-amazon-fsx-for-lustre-file-systems/\n",
166+
"\n",
167+
"\n",
168+
"### Important Caveats\n",
169+
"\n",
170+
"1. You need use the same `subnet` and `vpc` and `security group` used with FSx when launching the SageMaker notebook instance. The same configurations will be used by your SageMaker training job.\n",
171+
"2. Make sure you set appropriate inbound/output rules in the `security group`. Specically, opening up these ports is necessary for SageMaker to access the FSx filesystem in the training job. https://docs.aws.amazon.com/fsx/latest/LustreGuide/limit-access-security-groups.html\n",
172+
"3. Make sure `SageMaker IAM Role` used to launch this SageMaker training job has access to `AmazonFSx`."
173+
]
174+
},
175+
{
176+
"cell_type": "code",
177+
"execution_count": null,
178+
"metadata": {},
179+
"outputs": [],
180+
"source": [
181+
"from sagemaker.inputs import FileSystemInput\n",
182+
"\n",
183+
"subnets=['<SUBNET_ID>'] # Should be same as Subnet used for FSx. Example: subnet-01aXXXX\n",
184+
"security_group_ids=['<SECURITY_GROUP_ID>'] # Should be same as Security group used for FSx. sg-075ZZZZZZ\n",
185+
"file_system_id= '<FSX_ID>' # FSx file system ID with your training dataset. Example: 'fs-0bYYYYYY'\n",
186+
"\n",
187+
"file_system_directory_path= 'YOUR_MOUNT_PATH_FOR_TRAINING_DATA' # NOTE: '/fsx/' will be the root mount path. Example: '/fsx/bert/pt/phase1'\n",
188+
"file_system_access_mode = \"ro\"\n",
189+
"\n",
190+
"file_system_type = 'FSxLustre'\n",
191+
"\n",
192+
"train_fs = FileSystemInput(file_system_id=file_system_id,\n",
193+
" file_system_type=file_system_type,\n",
194+
" directory_path=file_system_directory_path,\n",
195+
" file_system_access_mode=file_system_access_mode)\n"
196+
]
197+
},
198+
{
199+
"cell_type": "markdown",
200+
"metadata": {},
201+
"source": [
202+
"## SageMaker PyTorch Estimator function options\n",
203+
"\n",
204+
"In the following code block, you can update the estimator function to use a different instance type, instance count, and distrubtion strategy. You're also passing in the training script you reviewed in the previous cell.\n",
205+
"\n",
206+
"**Instance types**\n",
207+
"\n",
208+
"SMDataParallel supports model training on SageMaker with the following instance types only:\n",
209+
"1. ml.p3.16xlarge\n",
210+
"1. ml.p3dn.24xlarge [Recommended]\n",
211+
"1. ml.p4d.24xlarge [Recommended]\n",
212+
"\n",
213+
"**Instance count**\n",
214+
"\n",
215+
"To get the best performance and the most out of SMDataParallel, you should use at least 2 instances, but you can also use 1 for testing this example.\n",
216+
"\n",
217+
"**Distribution strategy**\n",
218+
"\n",
219+
"Note that to use DDP mode, you update the the `distribution` strategy, and set it to use `smdistributed dataparallel`. "
220+
]
221+
},
222+
{
223+
"cell_type": "code",
224+
"execution_count": null,
225+
"metadata": {},
226+
"outputs": [],
227+
"source": [
228+
"from sagemaker.pytorch import PyTorch\n",
229+
"\n",
230+
"docker_image = f\"{account}.dkr.ecr.{region}.amazonaws.com/{image}:{tag}\" # YOUR_ECR_IMAGE_BUILT_WITH_ABOVE_DOCKER_FILE\n",
231+
"instance_type = \"ml.p3dn.24xlarge\" # Other supported instance type: ml.p3.16xlarge, ml.p4d.24xlarge\n",
232+
"instance_count = 2 # You can use 2, 4, 8 etc.\n",
233+
"# This job name is used as prefix to the sagemaker training job. Makes it easy for your look for your training job in SageMaker Training job console.\n",
234+
"job_name = 'pt-bert-smdataparallel-N%d-%s' % (instance_count, instance_type.split(\".\")[1]) \n",
235+
"print(\"Job name: \", job_name)\n",
236+
"\n",
237+
"estimator = PyTorch(base_job_name=job_name,\n",
238+
" source_dir=\".\",\n",
239+
" entry_point=\"train.sh\",\n",
240+
" role=role,\n",
241+
" image_uri=docker_image,\n",
242+
" framework_version='1.6.0',\n",
243+
" train_instance_count=instance_count,\n",
244+
" train_instance_type=instance_type,\n",
245+
" sagemaker_session=sagemaker_session,\n",
246+
" subnets=subnets,\n",
247+
" security_group_ids=security_group_ids,\n",
248+
" debugger_hook_config=False,\n",
249+
" # Training using SMDataParallel Distributed Training Framework\n",
250+
" distribution={'smdistributed':{\n",
251+
" 'dataparallel':{\n",
252+
" 'enabled': True\n",
253+
" }\n",
254+
" }\n",
255+
" }\n",
256+
" )"
257+
]
258+
},
259+
{
260+
"cell_type": "code",
261+
"execution_count": null,
262+
"metadata": {},
263+
"outputs": [],
264+
"source": [
265+
"!pygmentize train.sh\n",
266+
"\n",
267+
"estimator.fit(train_fs)"
268+
]
269+
},
270+
{
271+
"cell_type": "code",
272+
"execution_count": null,
273+
"metadata": {},
274+
"outputs": [],
275+
"source": [
276+
"model_data = estimator.model_data\n",
277+
"print(\"Storing {} as model_data\".format(model_data))\n",
278+
"%store model_data"
279+
]
280+
}
281+
],
282+
"metadata": {
283+
"kernelspec": {
284+
"display_name": "conda_pytorch_p36",
285+
"language": "python",
286+
"name": "conda_pytorch_p36"
287+
},
288+
"language_info": {
289+
"codemirror_mode": {
290+
"name": "ipython",
291+
"version": 3
292+
},
293+
"file_extension": ".py",
294+
"mimetype": "text/x-python",
295+
"name": "python",
296+
"nbconvert_exporter": "python",
297+
"pygments_lexer": "ipython3",
298+
"version": "3.6.10"
299+
}
300+
},
301+
"nbformat": 4,
302+
"nbformat_minor": 4
303+
}
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
export HDF5_USE_FILE_LOCKING=FALSE
2+
3+
python ./DeepLearningExamples/PyTorch/LanguageModeling/BERT/run_pretraining.py \
4+
--input_dir=$SM_CHANNEL_TRAINING \
5+
--output_dir=$SM_MODEL_DIR \
6+
--config_file=./DeepLearningExamples/PyTorch/LanguageModeling/BERT/bert_config.json \
7+
--bert_model=bert-large-uncased \
8+
--train_batch_size=64 \
9+
--max_seq_length=128 \
10+
--max_predictions_per_seq=20 \
11+
--max_steps=900864 \
12+
--warmup_proportion=0.2843 \
13+
--num_steps_per_checkpoint=900864 \
14+
--learning_rate=1.12e-3 \
15+
--seed=42 \
16+
--fp16 \
17+
--do_train \
18+
--json-summary ./DeepLearningExamples/PyTorch/LanguageModeling/BERT/results/dllogger.json 2>&1
19+

0 commit comments

Comments
 (0)