Skip to content

Commit 6d2c0a7

Browse files
authored
Merge pull request aws#291 from awslabs/pytorch
Add Sagemaker PyTorch notebooks examples.
2 parents 898fb73 + 122ae25 commit 6d2c0a7

File tree

14 files changed

+2091
-0
lines changed

14 files changed

+2091
-0
lines changed
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
2+
{
3+
"default-runtime": "nvidia",
4+
"runtimes": {
5+
"nvidia": {
6+
"path": "/usr/bin/nvidia-container-runtime",
7+
"runtimeArgs": []
8+
}
9+
}
10+
}
Lines changed: 296 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,296 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# PyTorch Cifar10 local training \n",
8+
"\n",
9+
"## Pre-requisites\n",
10+
"\n",
11+
"This notebook shows how to use the SageMaker Python SDK to run your code in a local container before deploying to SageMaker's managed training or hosting environments. This can speed up iterative testing and debugging while using the same familiar Python SDK interface. Just change your estimator's `train_instance_type` to `local` (or `local_gpu` if you're using an ml.p2 or ml.p3 notebook instance).\n",
12+
"\n",
13+
"In order to use this feature you'll need to install docker-compose (and nvidia-docker if training with a GPU).\n",
14+
"\n",
15+
"**Note, you can only run a single local notebook at one time.**"
16+
]
17+
},
18+
{
19+
"cell_type": "code",
20+
"execution_count": null,
21+
"metadata": {},
22+
"outputs": [],
23+
"source": [
24+
"!/bin/bash ./setup.sh"
25+
]
26+
},
27+
{
28+
"cell_type": "markdown",
29+
"metadata": {},
30+
"source": [
31+
"## Overview\n",
32+
"\n",
33+
"The **SageMaker Python SDK** helps you deploy your models for training and hosting in optimized, productions ready containers in SageMaker. The SageMaker Python SDK is easy to use, modular, extensible and compatible with TensorFlow, MXNet, PyTorch and Chainer. This tutorial focuses on how to create a convolutional neural network model to train the [Cifar10 dataset](https://www.cs.toronto.edu/~kriz/cifar.html) using **PyTorch in local mode**.\n",
34+
"\n",
35+
"### Set up the environment\n",
36+
"\n",
37+
"This notebook was created and tested on a single ml.p2.xlarge notebook instance.\n",
38+
"\n",
39+
"Let's start by specifying:\n",
40+
"\n",
41+
"- The S3 bucket and prefix that you want to use for training and model data. This should be within the same region as the Notebook Instance, training, and hosting.\n",
42+
"- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the sagemaker.get_execution_role() with appropriate full IAM role arn string(s)."
43+
]
44+
},
45+
{
46+
"cell_type": "code",
47+
"execution_count": null,
48+
"metadata": {},
49+
"outputs": [],
50+
"source": [
51+
"import sagemaker\n",
52+
"\n",
53+
"sagemaker_session = sagemaker.Session()\n",
54+
"\n",
55+
"bucket = sagemaker_session.default_bucket()\n",
56+
"prefix = 'sagemaker/DEMO-pytorch-cnn-cifar10'\n",
57+
"\n",
58+
"role = sagemaker.get_execution_role()"
59+
]
60+
},
61+
{
62+
"cell_type": "code",
63+
"execution_count": null,
64+
"metadata": {},
65+
"outputs": [],
66+
"source": [
67+
"import os\n",
68+
"import subprocess\n",
69+
"\n",
70+
"instance_type = 'local'\n",
71+
"\n",
72+
"if subprocess.call('nvidia-smi') == 0:\n",
73+
" ## Set type to GPU if one is present\n",
74+
" instance_type = 'local_gpu'\n",
75+
" \n",
76+
"print(\"Instance type = \" + instance_type)"
77+
]
78+
},
79+
{
80+
"cell_type": "markdown",
81+
"metadata": {},
82+
"source": [
83+
"### Download the Cifar10 dataset"
84+
]
85+
},
86+
{
87+
"cell_type": "code",
88+
"execution_count": null,
89+
"metadata": {},
90+
"outputs": [],
91+
"source": [
92+
"from utils_cifar import get_train_data_loader, get_test_data_loader, imshow, classes\n",
93+
"\n",
94+
"trainloader = get_train_data_loader()\n",
95+
"testloader = get_test_data_loader()"
96+
]
97+
},
98+
{
99+
"cell_type": "markdown",
100+
"metadata": {},
101+
"source": [
102+
"### Data Preview"
103+
]
104+
},
105+
{
106+
"cell_type": "code",
107+
"execution_count": null,
108+
"metadata": {},
109+
"outputs": [],
110+
"source": [
111+
"import numpy as np\n",
112+
"import torchvision, torch\n",
113+
"\n",
114+
"# get some random training images\n",
115+
"dataiter = iter(trainloader)\n",
116+
"images, labels = dataiter.next()\n",
117+
"\n",
118+
"# show images\n",
119+
"imshow(torchvision.utils.make_grid(images))\n",
120+
"\n",
121+
"# print labels\n",
122+
"print(' '.join('%9s' % classes[labels[j]] for j in range(4)))"
123+
]
124+
},
125+
{
126+
"cell_type": "markdown",
127+
"metadata": {},
128+
"source": [
129+
"### Upload the data\n",
130+
"We use the ```sagemaker.Session.upload_data``` function to upload our datasets to an S3 location. The return value inputs identifies the location -- we will use this later when we start the training job."
131+
]
132+
},
133+
{
134+
"cell_type": "code",
135+
"execution_count": null,
136+
"metadata": {},
137+
"outputs": [],
138+
"source": [
139+
"inputs = sagemaker_session.upload_data(path='data', bucket=bucket, key_prefix='data/cifar10')"
140+
]
141+
},
142+
{
143+
"cell_type": "markdown",
144+
"metadata": {},
145+
"source": [
146+
"# Construct a script for training \n",
147+
"Here is the full code for the network model:"
148+
]
149+
},
150+
{
151+
"cell_type": "code",
152+
"execution_count": null,
153+
"metadata": {
154+
"scrolled": false
155+
},
156+
"outputs": [],
157+
"source": [
158+
"!pygmentize source/cifar10.py"
159+
]
160+
},
161+
{
162+
"cell_type": "markdown",
163+
"metadata": {},
164+
"source": [
165+
"## Script Functions\n",
166+
"\n",
167+
"SageMaker invokes the main function defined within your training script for training. When deploying your trained model to an endpoint, the model_fn() is called to determine how to load your trained model. The model_fn() along with a few other functions list below are called to enable predictions on SageMaker.\n",
168+
"\n",
169+
"### [Predicting Functions](https://github.com/aws/sagemaker-pytorch-containers/blob/master/src/sagemaker_pytorch_container/serving.py)\n",
170+
"* model_fn(model_dir) - loads your model.\n",
171+
"* input_fn(serialized_input_data, content_type) - deserializes predictions to predict_fn.\n",
172+
"* output_fn(prediction_output, accept) - serializes predictions from predict_fn.\n",
173+
"* predict_fn(input_data, model) - calls a model on data deserialized in input_fn.\n",
174+
"\n",
175+
"The model_fn() is the only function that doesn't have a default implementation and is required by the user for using PyTorch on SageMaker. "
176+
]
177+
},
178+
{
179+
"cell_type": "markdown",
180+
"metadata": {},
181+
"source": [
182+
"## Create a training job using the sagemaker.PyTorch estimator\n",
183+
"\n",
184+
"The `PyTorch` class allows us to run our training function on SageMaker. We need to configure it with our training script, an IAM role, the number of training instances, and the training instance type. For local training with GPU, we could set this to \"local_gpu\". In this case, `instance_type` was set above based on your whether you're running a GPU instance.\n",
185+
"\n",
186+
"After we've constructed our `PyTorch` object, we fit it using the data we uploaded to S3. Even though we're in local mode, using S3 as our data source makes sense because it maintains consistency with how SageMaker's distributed, managed training ingests data."
187+
]
188+
},
189+
{
190+
"cell_type": "code",
191+
"execution_count": null,
192+
"metadata": {},
193+
"outputs": [],
194+
"source": [
195+
"from sagemaker.pytorch import PyTorch\n",
196+
"\n",
197+
"cifar10_estimator = PyTorch(entry_point=\"source/cifar10.py\",\n",
198+
" role=role,\n",
199+
" train_instance_count=1,\n",
200+
" train_instance_type=instance_type)\n",
201+
"\n",
202+
"cifar10_estimator.fit(inputs)"
203+
]
204+
},
205+
{
206+
"cell_type": "markdown",
207+
"metadata": {},
208+
"source": [
209+
"# Deploy the trained model to prepare for predictions\n",
210+
"\n",
211+
"The deploy() method creates an endpoint (in this case locally) which serves prediction requests in real-time."
212+
]
213+
},
214+
{
215+
"cell_type": "code",
216+
"execution_count": null,
217+
"metadata": {},
218+
"outputs": [],
219+
"source": [
220+
"from sagemaker.pytorch import PyTorchModel\n",
221+
"\n",
222+
"cifar10_predictor = cifar10_estimator.deploy(initial_instance_count=1,\n",
223+
" instance_type=instance_type)"
224+
]
225+
},
226+
{
227+
"cell_type": "markdown",
228+
"metadata": {},
229+
"source": [
230+
"# Invoking the endpoint"
231+
]
232+
},
233+
{
234+
"cell_type": "code",
235+
"execution_count": null,
236+
"metadata": {},
237+
"outputs": [],
238+
"source": [
239+
"# get some test images\n",
240+
"dataiter = iter(testloader)\n",
241+
"images, labels = dataiter.next()\n",
242+
"\n",
243+
"# print images\n",
244+
"imshow(torchvision.utils.make_grid(images))\n",
245+
"print('GroundTruth: ', ' '.join('%4s' % classes[labels[j]] for j in range(4)))\n",
246+
"\n",
247+
"outputs = cifar10_predictor.predict(images.numpy())\n",
248+
"\n",
249+
"_, predicted = torch.max(torch.from_numpy(np.array(outputs)), 1)\n",
250+
"\n",
251+
"print('Predicted: ', ' '.join('%4s' % classes[predicted[j]]\n",
252+
" for j in range(4)))"
253+
]
254+
},
255+
{
256+
"cell_type": "markdown",
257+
"metadata": {},
258+
"source": [
259+
"# Clean-up\n",
260+
"\n",
261+
"Deleting the local endpoint when you're finished is important since you can only run one local endpoint at a time."
262+
]
263+
},
264+
{
265+
"cell_type": "code",
266+
"execution_count": null,
267+
"metadata": {},
268+
"outputs": [],
269+
"source": [
270+
"cifar10_estimator.delete_endpoint()"
271+
]
272+
}
273+
],
274+
"metadata": {
275+
"kernelspec": {
276+
"display_name": "Environment (conda_pytorch_p27)",
277+
"language": "python",
278+
"name": "conda_pytorch_p27"
279+
},
280+
"language_info": {
281+
"codemirror_mode": {
282+
"name": "ipython",
283+
"version": 2
284+
},
285+
"file_extension": ".py",
286+
"mimetype": "text/x-python",
287+
"name": "python",
288+
"nbconvert_exporter": "python",
289+
"pygments_lexer": "ipython2",
290+
"version": "2.7.14"
291+
},
292+
"notice": "Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
293+
},
294+
"nbformat": 4,
295+
"nbformat_minor": 2
296+
}
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
#!/bin/bash
2+
3+
# Do we have GPU support?
4+
nvidia-smi > /dev/null 2>&1
5+
if [ $? -eq 0 ]; then
6+
# check if we have nvidia-docker
7+
NVIDIA_DOCKER=`rpm -qa | grep -c nvidia-docker2`
8+
if [ $NVIDIA_DOCKER -eq 0 ]; then
9+
# Install nvidia-docker2
10+
#sudo pkill -SIGHUP dockerd
11+
sudo yum -y remove docker
12+
sudo yum -y install docker-17.09.1ce-1.111.amzn1
13+
14+
sudo /etc/init.d/docker start
15+
16+
curl -s -L https://nvidia.github.io/nvidia-docker/amzn1/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
17+
sudo yum install -y nvidia-docker2-2.0.3-1.docker17.09.1.ce.amzn1
18+
sudo cp daemon.json /etc/docker/daemon.json
19+
sudo pkill -SIGHUP dockerd
20+
echo "installed nvidia-docker2"
21+
else
22+
echo "nvidia-docker2 already installed. We are good to go!"
23+
fi
24+
fi
25+
26+
# This is common for both GPU and CPU instances
27+
28+
# check if we have docker-compose
29+
docker-compose version >/dev/null 2>&1
30+
if [ $? -ne 0 ]; then
31+
# install docker compose
32+
pip install docker-compose
33+
fi
34+
35+
# check if we need to configure our docker interface
36+
SAGEMAKER_NETWORK=`docker network ls | grep -c sagemaker-local`
37+
if [ $SAGEMAKER_NETWORK -eq 0 ]; then
38+
docker network create --driver bridge sagemaker-local
39+
fi
40+
41+
# Notebook instance Docker networking fixes
42+
RUNNING_ON_NOTEBOOK_INSTANCE=`sudo iptables -S OUTPUT -t nat | grep -c 169.254.0.2`
43+
44+
# Get the Docker Network CIDR and IP for the sagemaker-local docker interface.
45+
SAGEMAKER_INTERFACE=br-`docker network ls | grep sagemaker-local | cut -d' ' -f1`
46+
DOCKER_NET=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f1`
47+
DOCKER_IP=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f12`
48+
49+
# check if both IPTables and the Route Table are OK.
50+
IPTABLES_PATCHED=`sudo iptables -S PREROUTING -t nat | grep -c 169.254.0.2`
51+
ROUTE_TABLE_PATCHED=`sudo ip route show table agent | grep -c $SAGEMAKER_INTERFACE`
52+
53+
if [ $RUNNING_ON_NOTEBOOK_INSTANCE -gt 0 ]; then
54+
55+
if [ $ROUTE_TABLE_PATCHED -eq 0 ]; then
56+
# fix routing
57+
sudo ip route add $DOCKER_NET via $DOCKER_IP dev $SAGEMAKER_INTERFACE table agent
58+
else
59+
echo "SageMaker instance route table setup is ok. We are good to go."
60+
fi
61+
62+
if [ $IPTABLES_PATCHED -eq 0 ]; then
63+
sudo iptables -t nat -A PREROUTING -i $SAGEMAKER_INTERFACE -d 169.254.169.254/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 169.254.0.2:9081
64+
echo "iptables for Docker setup done"
65+
else
66+
echo "SageMaker instance routing for Docker is ok. We are good to go!"
67+
fi
68+
fi

0 commit comments

Comments
 (0)