Skip to content

Commit d6d721b

Browse files
authored
Add sm model parallel MNIST examples for tf and pt (#104)
1 parent 5f50b59 commit d6d721b

File tree

5 files changed

+1139
-0
lines changed

5 files changed

+1139
-0
lines changed
Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Use SageMaker Distributed Model Parallel with Amazon SageMaker to Launch Training Job with Model Parallelization\n",
8+
"\n",
9+
"SageMaker Distributed Model Parallel (SMP) is a model parallelism library for training large deep learning models that were previously difficult to train due to GPU memory limitations. SageMaker Distributed Model Parallel automatically and efficiently splits a model across multiple GPUs and instances and coordinates model training, allowing you to increase prediction accuracy by creating larger models with more parameters.\n",
10+
"\n",
11+
"Use this notebook to configure Sagemaker Distributed Model Parallel to train a model using an example PyTorch training script and [Amazon SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/overview.html#train-a-model-with-the-sagemaker-python-sdk). \n",
12+
"\n",
13+
"\n",
14+
"### Additional Resources\n",
15+
"\n",
16+
"If you are a new user of Amazon SageMaker, you may find the following helpful to learn more about SMP and using SageMaker with Pytorch. \n",
17+
"\n",
18+
"* To learn more about the SageMaker model parallelism library, see [Model Parallel Distributed Training with SageMaker Distributed\n",
19+
"](https://docs.aws.amazon.com/sagemaker/latest/dg/distributed-training-model-parallel.html).\n",
20+
"\n",
21+
"* To learn more about using the SageMaker Python SDK with Pytorch, see [Using PyTorch with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/pytorch/using_pytorch.html#using-third-party-libraries)\n",
22+
"\n",
23+
"* To learn more about launching a training job in Amazon SageMaker with your own training image, see [Use Your Own Training Algorithms\n",
24+
"](https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo.html)."
25+
]
26+
},
27+
{
28+
"cell_type": "markdown",
29+
"metadata": {},
30+
"source": [
31+
"## Amazon SageMaker Initialization\n",
32+
"\n",
33+
"Run the following cell to initialize the notebook instance. Get the SageMaker execution role used to run this notebook."
34+
]
35+
},
36+
{
37+
"cell_type": "code",
38+
"execution_count": null,
39+
"metadata": {},
40+
"outputs": [],
41+
"source": [
42+
"%%time\n",
43+
"import sagemaker\n",
44+
"from sagemaker import get_execution_role\n",
45+
"from sagemaker.pytorch import PyTorch\n",
46+
"from smexperiments.experiment import Experiment\n",
47+
"from smexperiments.trial import Trial\n",
48+
"import boto3\n",
49+
"from time import gmtime, strftime\n",
50+
"\n",
51+
"role = get_execution_role() # provide a pre-existing role ARN as an alternative to creating a new role\n",
52+
"print(f'SageMaker Execution Role:{role}')\n",
53+
"\n",
54+
"session = boto3.session.Session()"
55+
]
56+
},
57+
{
58+
"cell_type": "markdown",
59+
"metadata": {},
60+
"source": [
61+
"## Prepare your training script\n",
62+
"\n",
63+
"Run the following cell to view an example-training script for PyTorch 1.6"
64+
]
65+
},
66+
{
67+
"cell_type": "code",
68+
"execution_count": null,
69+
"metadata": {},
70+
"outputs": [],
71+
"source": [
72+
"# Run this cell to see an example of a training scripts that you can use to configure -\n",
73+
"# SageMaker Distributed Model Parallel with PyTorch version 1.6\n",
74+
"!cat utils/pt_mnist.py"
75+
]
76+
},
77+
{
78+
"cell_type": "markdown",
79+
"metadata": {},
80+
"source": [
81+
"## Define SageMaker Training Job\n",
82+
"\n",
83+
"Next, you will use SageMaker Estimator API to define a SageMaker Training Job. You will use an [`Estimator`](https://sagemaker.readthedocs.io/en/stable/api/training/estimators.html) to define the number and type of EC2 instances Amazon SageMaker uses for training, as well as the size of the volume attached to those instances. \n",
84+
"\n",
85+
"You can update the following:\n",
86+
"* `processes_per_host`\n",
87+
"* `entry_point`\n",
88+
"* `instance_count`\n",
89+
"* `instance_type`\n",
90+
"* `base_job_name`\n",
91+
"\n",
92+
"In addition, you can supply and modify configuration parameters for the SageMaker Distributed Model Parallel library. These parameters will be passed in through the `distributions` argument, as shown below.\n",
93+
"\n",
94+
"### Update the Type and Number of EC2 Instances Used\n",
95+
"\n",
96+
"Specify your `processes_per_host`. Note that it must be a multiple of your partitions, which by default is 2.\n",
97+
"\n",
98+
"The instance type and number of instances you specify in `instance_type` and `instance_count` respectively will determine the number of GPUs Amazon SageMaker uses during training. Explicitly, `instance_type` will determine the number of GPUs on a single instance and that number will be multiplied by `instance_count`. \n",
99+
"\n",
100+
"You must specify values for `instance_type` and `instance_count` so that the total number of GPUs available for training is equal to `partitions` in `config` of `smp.init` in your training script. \n",
101+
"\n",
102+
"\n",
103+
"To look up instances types, see [Amazon EC2 Instance Types](https://aws.amazon.com/sagemaker/pricing/).\n",
104+
"\n",
105+
"\n",
106+
"### Uploading Checkpoint During Training or Resuming Checkpoint from Previous Training\n",
107+
"We also provide a custom way for users to upload checkpoints during training or resume checkpoints from previous training. We have integrated this into our `pt_mnist.py` example script. Please see the functions `aws_s3_sync`, `sync_local_checkpoints_to_s3`, and `sync_s3_checkpoints_to_local`. For the purpose of this example, we are only uploading a checkpoint during training, by using `sync_local_checkpoints_to_s3`. \n"
108+
]
109+
},
110+
{
111+
"cell_type": "markdown",
112+
"metadata": {},
113+
"source": [
114+
"After you have updated `entry_point`, `instance_count`, `instance_type` and `base_job_name`, run the following to create an estimator. "
115+
]
116+
},
117+
{
118+
"cell_type": "code",
119+
"execution_count": null,
120+
"metadata": {},
121+
"outputs": [],
122+
"source": [
123+
"sagemaker_session = sagemaker.session.Session(boto_session=session)\n",
124+
"mpioptions = \"-verbose -x orte_base_help_aggregate=0 \"\n",
125+
"mpioptions += \"--mca btl_vader_single_copy_mechanism none \"\n",
126+
"\n",
127+
"all_experiment_names = [exp.experiment_name for exp in Experiment.list()]\n",
128+
"\n",
129+
"#choose an experiment name (only need to create it once)\n",
130+
"experiment_name = \"SM-MP-DEMO\"\n",
131+
"\n",
132+
"# Load the experiment if it exists, otherwise create \n",
133+
"if experiment_name not in all_experiment_names:\n",
134+
" customer_churn_experiment = Experiment.create(\n",
135+
" experiment_name=experiment_name, sagemaker_boto_client=boto3.client(\"sagemaker\")\n",
136+
" )\n",
137+
"else:\n",
138+
" customer_churn_experiment = Experiment.load(\n",
139+
" experiment_name=experiment_name, sagemaker_boto_client=boto3.client(\"sagemaker\")\n",
140+
" )\n",
141+
"\n",
142+
"# Create a trial for the current run\n",
143+
"trial = Trial.create(\n",
144+
" trial_name=\"SMD-MP-demo-{}\".format(strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())),\n",
145+
" experiment_name=customer_churn_experiment.experiment_name,\n",
146+
" sagemaker_boto_client=boto3.client(\"sagemaker\"),\n",
147+
" )\n",
148+
"\n",
149+
"\n",
150+
"smd_mp_estimator = PyTorch(\n",
151+
" entry_point=\"pt_mnist.py\", # Pick your train script\n",
152+
" source_dir='utils',\n",
153+
" role=role,\n",
154+
" instance_type='ml.p3.16xlarge',\n",
155+
" sagemaker_session=sagemaker_session,\n",
156+
" framework_version='1.6.0',\n",
157+
" py_version='py3',\n",
158+
" instance_count=1,\n",
159+
" distribution={\n",
160+
" \"smdistributed\": {\n",
161+
" \"modelparallel\": {\n",
162+
" \"enabled\":True,\n",
163+
" \"parameters\": {\n",
164+
" \"microbatches\": 4,\n",
165+
" \"placement_strategy\": \"spread\",\n",
166+
" \"pipeline\": \"interleaved\",\n",
167+
" \"optimize\": \"speed\",\n",
168+
" \"partitions\": 2,\n",
169+
" \"ddp\": 1,\n",
170+
" }\n",
171+
" }\n",
172+
" },\n",
173+
" \"mpi\": {\n",
174+
" \"enabled\": True,\n",
175+
" \"processes_per_host\": 2, # Pick your processes_per_host\n",
176+
" \"custom_mpi_options\": mpioptions \n",
177+
" },\n",
178+
" },\n",
179+
" base_job_name=\"SMD-MP-demo\",\n",
180+
" )\n"
181+
]
182+
},
183+
{
184+
"cell_type": "markdown",
185+
"metadata": {},
186+
"source": [
187+
"Finally, you will use the estimator to launch the SageMaker training job."
188+
]
189+
},
190+
{
191+
"cell_type": "code",
192+
"execution_count": null,
193+
"metadata": {},
194+
"outputs": [],
195+
"source": [
196+
"smd_mp_estimator.fit(\n",
197+
" experiment_config={\n",
198+
" \"ExperimentName\": customer_churn_experiment.experiment_name,\n",
199+
" \"TrialName\": trial.trial_name,\n",
200+
" \"TrialComponentDisplayName\": \"Training\",\n",
201+
" })"
202+
]
203+
}
204+
],
205+
"metadata": {
206+
"kernelspec": {
207+
"display_name": "conda_python3",
208+
"language": "python",
209+
"name": "conda_python3"
210+
},
211+
"language_info": {
212+
"codemirror_mode": {
213+
"name": "ipython",
214+
"version": 3
215+
},
216+
"file_extension": ".py",
217+
"mimetype": "text/x-python",
218+
"name": "python",
219+
"nbconvert_exporter": "python",
220+
"pygments_lexer": "ipython3",
221+
"version": "3.6.10"
222+
}
223+
},
224+
"nbformat": 4,
225+
"nbformat_minor": 4
226+
}

0 commit comments

Comments
 (0)