Skip to content

Commit 988bc3c

Browse files
authored
Merge pull request aws#238 from awslabs/arpin_cifar_local_mode
Added: MXNet Gluon CIFAR-10 local mode example
2 parents 8b1dd26 + e58ca73 commit 988bc3c

File tree

3 files changed

+362
-0
lines changed

3 files changed

+362
-0
lines changed
Lines changed: 284 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,284 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Gluon CIFAR-10 Trained in Local Mode\n",
8+
"_**ResNet model in Gluon trained locally in a notebook instance**_\n",
9+
"\n",
10+
"---\n",
11+
"\n",
12+
"---\n",
13+
"\n",
14+
"_This notebook was created and tested on an ml.p3.8xlarge notebook instance._\n",
15+
"\n",
16+
"## Setup\n",
17+
"\n",
18+
"Import libraries and set IAM role ARN."
19+
]
20+
},
21+
{
22+
"cell_type": "code",
23+
"execution_count": null,
24+
"metadata": {},
25+
"outputs": [],
26+
"source": [
27+
"import sagemaker\n",
28+
"from sagemaker.mxnet import MXNet\n",
29+
"\n",
30+
"sagemaker_session = sagemaker.Session()\n",
31+
"role = sagemaker.get_execution_role()"
32+
]
33+
},
34+
{
35+
"cell_type": "markdown",
36+
"metadata": {},
37+
"source": [
38+
"Install pre-requisites for local training."
39+
]
40+
},
41+
{
42+
"cell_type": "code",
43+
"execution_count": null,
44+
"metadata": {},
45+
"outputs": [],
46+
"source": [
47+
"!/bin/bash setup.sh"
48+
]
49+
},
50+
{
51+
"cell_type": "markdown",
52+
"metadata": {},
53+
"source": [
54+
"---\n",
55+
"\n",
56+
"## Data\n",
57+
"\n",
58+
"We use the helper scripts to download CIFAR-10 training data and sample images."
59+
]
60+
},
61+
{
62+
"cell_type": "code",
63+
"execution_count": null,
64+
"metadata": {},
65+
"outputs": [],
66+
"source": [
67+
"from cifar10_utils import download_training_data\n",
68+
"download_training_data()"
69+
]
70+
},
71+
{
72+
"cell_type": "markdown",
73+
"metadata": {},
74+
"source": [
75+
"We use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value `inputs` identifies the location -- we will use this later when we start the training job.\n",
76+
"\n",
77+
"Even though we are training within our notebook instance, we'll continue to use the S3 data location since it will allow us to easily transition to training in SageMaker's managed environment."
78+
]
79+
},
80+
{
81+
"cell_type": "code",
82+
"execution_count": null,
83+
"metadata": {},
84+
"outputs": [],
85+
"source": [
86+
"inputs = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-gluon-cifar10')\n",
87+
"print('input spec (in this case, just an S3 path): {}'.format(inputs))"
88+
]
89+
},
90+
{
91+
"cell_type": "markdown",
92+
"metadata": {},
93+
"source": [
94+
"---\n",
95+
"\n",
96+
"## Script\n",
97+
"\n",
98+
"We need to provide a training script that can run on the SageMaker platform. When SageMaker calls your function, it will pass in arguments that describe the training environment. Check the script below to see how this works.\n",
99+
"\n",
100+
"The network itself is a pre-built version contained in the [Gluon Model Zoo](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/model_zoo.html)."
101+
]
102+
},
103+
{
104+
"cell_type": "code",
105+
"execution_count": null,
106+
"metadata": {},
107+
"outputs": [],
108+
"source": [
109+
"!cat 'cifar10.py'"
110+
]
111+
},
112+
{
113+
"cell_type": "markdown",
114+
"metadata": {},
115+
"source": [
116+
"---\n",
117+
"\n",
118+
"## Train (Local Mode)\n",
119+
"\n",
120+
"The ```MXNet``` estimator will create our training job. To switch from training in SageMaker's managed environment to training within a notebook instance, just set `train_instance_type` to `local_gpu`."
121+
]
122+
},
123+
{
124+
"cell_type": "code",
125+
"execution_count": null,
126+
"metadata": {},
127+
"outputs": [],
128+
"source": [
129+
"m = MXNet('cifar10.py', \n",
130+
" role=role, \n",
131+
" train_instance_count=1, \n",
132+
" train_instance_type='local_gpu',\n",
133+
" hyperparameters={'batch_size': 1024, \n",
134+
" 'epochs': 50, \n",
135+
" 'learning_rate': 0.1, \n",
136+
" 'momentum': 0.9})"
137+
]
138+
},
139+
{
140+
"cell_type": "markdown",
141+
"metadata": {},
142+
"source": [
143+
"After we've constructed our `MXNet` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk."
144+
]
145+
},
146+
{
147+
"cell_type": "code",
148+
"execution_count": null,
149+
"metadata": {
150+
"scrolled": true
151+
},
152+
"outputs": [],
153+
"source": [
154+
"m.fit(inputs)"
155+
]
156+
},
157+
{
158+
"cell_type": "markdown",
159+
"metadata": {},
160+
"source": [
161+
"---\n",
162+
"\n",
163+
"## Host\n",
164+
"\n",
165+
"After training, we use the MXNet estimator object to deploy an endpoint. Because we trained locally, we'll also deploy the endpoint locally. The predictor object returned by `deploy` lets us call the endpoint and perform inference on our sample images."
166+
]
167+
},
168+
{
169+
"cell_type": "code",
170+
"execution_count": null,
171+
"metadata": {},
172+
"outputs": [],
173+
"source": [
174+
"predictor = m.deploy(initial_instance_count=1, instance_type='local_gpu')"
175+
]
176+
},
177+
{
178+
"cell_type": "markdown",
179+
"metadata": {},
180+
"source": [
181+
"### Evaluate\n",
182+
"\n",
183+
"We'll use these CIFAR-10 sample images to test the service:\n",
184+
"\n",
185+
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/airplane1.png\" />\n",
186+
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/automobile1.png\" />\n",
187+
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/bird1.png\" />\n",
188+
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/cat1.png\" />\n",
189+
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/deer1.png\" />\n",
190+
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/dog1.png\" />\n",
191+
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/frog1.png\" />\n",
192+
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/horse1.png\" />\n",
193+
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/ship1.png\" />\n",
194+
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/truck1.png\" />\n",
195+
"\n"
196+
]
197+
},
198+
{
199+
"cell_type": "code",
200+
"execution_count": null,
201+
"metadata": {},
202+
"outputs": [],
203+
"source": [
204+
"# load the CIFAR-10 samples, and convert them into format we can use with the prediction endpoint\n",
205+
"from cifar10_utils import read_images\n",
206+
"\n",
207+
"filenames = ['images/airplane1.png',\n",
208+
" 'images/automobile1.png',\n",
209+
" 'images/bird1.png',\n",
210+
" 'images/cat1.png',\n",
211+
" 'images/deer1.png',\n",
212+
" 'images/dog1.png',\n",
213+
" 'images/frog1.png',\n",
214+
" 'images/horse1.png',\n",
215+
" 'images/ship1.png',\n",
216+
" 'images/truck1.png']\n",
217+
"\n",
218+
"image_data = read_images(filenames)"
219+
]
220+
},
221+
{
222+
"cell_type": "markdown",
223+
"metadata": {},
224+
"source": [
225+
"The predictor runs inference on our input data and returns the predicted class label (as a float value, so we convert to int for display)."
226+
]
227+
},
228+
{
229+
"cell_type": "code",
230+
"execution_count": null,
231+
"metadata": {
232+
"scrolled": true
233+
},
234+
"outputs": [],
235+
"source": [
236+
"for i, img in enumerate(image_data):\n",
237+
" response = predictor.predict(img)\n",
238+
" print('image {}: class: {}'.format(i, int(response)))"
239+
]
240+
},
241+
{
242+
"cell_type": "markdown",
243+
"metadata": {},
244+
"source": [
245+
"---\n",
246+
"\n",
247+
"## Cleanup\n",
248+
"\n",
249+
"After you have finished with this example, remember to delete the prediction endpoint. Only one local endpoint can be running at a time."
250+
]
251+
},
252+
{
253+
"cell_type": "code",
254+
"execution_count": null,
255+
"metadata": {},
256+
"outputs": [],
257+
"source": [
258+
"m.delete_endpoint()"
259+
]
260+
}
261+
],
262+
"metadata": {
263+
"kernelspec": {
264+
"display_name": "conda_mxnet_p27",
265+
"language": "python",
266+
"name": "conda_mxnet_p27"
267+
},
268+
"language_info": {
269+
"codemirror_mode": {
270+
"name": "ipython",
271+
"version": 2
272+
},
273+
"file_extension": ".py",
274+
"mimetype": "text/x-python",
275+
"name": "python",
276+
"nbconvert_exporter": "python",
277+
"pygments_lexer": "ipython2",
278+
"version": "2.7.14"
279+
},
280+
"notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
281+
},
282+
"nbformat": 4,
283+
"nbformat_minor": 2
284+
}
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
2+
{
3+
"default-runtime": "nvidia",
4+
"runtimes": {
5+
"nvidia": {
6+
"path": "/usr/bin/nvidia-container-runtime",
7+
"runtimeArgs": []
8+
}
9+
}
10+
}
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
#!/bin/bash
2+
3+
# Do we have GPU support?
4+
nvidia-smi > /dev/null 2>&1
5+
if [ $? -eq 0 ]; then
6+
# check if we have nvidia-docker
7+
NVIDIA_DOCKER=`rpm -qa | grep -c nvidia-docker2`
8+
if [ $NVIDIA_DOCKER -eq 0 ]; then
9+
# Install nvidia-docker2
10+
#sudo pkill -SIGHUP dockerd
11+
sudo yum -y remove docker
12+
sudo yum -y install docker-17.09.1ce-1.111.amzn1
13+
14+
sudo /etc/init.d/docker start
15+
16+
curl -s -L https://nvidia.github.io/nvidia-docker/amzn1/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
17+
sudo yum install -y nvidia-docker2
18+
sudo cp daemon.json /etc/docker/daemon.json
19+
sudo pkill -SIGHUP dockerd
20+
echo "installed nvidia-docker2"
21+
else
22+
echo "nvidia-docker2 already installed. We are good to go!"
23+
fi
24+
fi
25+
26+
# This is common for both GPU and CPU instances
27+
28+
# check if we have docker-compose
29+
docker-compose version >/dev/null 2>&1
30+
if [ $? -ne 0 ]; then
31+
# install docker compose
32+
pip install docker-compose
33+
fi
34+
35+
# check if we need to configure our docker interface
36+
SAGEMAKER_NETWORK=`docker network ls | grep -c sagemaker-local`
37+
if [ $SAGEMAKER_NETWORK -eq 0 ]; then
38+
docker network create --driver bridge sagemaker-local
39+
fi
40+
41+
# Notebook instance Docker networking fixes
42+
RUNNING_ON_NOTEBOOK_INSTANCE=`sudo iptables -S OUTPUT -t nat | grep -c 169.254.0.2`
43+
44+
# Get the Docker Network CIDR and IP for the sagemaker-local docker interface.
45+
SAGEMAKER_INTERFACE=br-`docker network ls | grep sagemaker-local | cut -d' ' -f1`
46+
DOCKER_NET=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f1`
47+
DOCKER_IP=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f12`
48+
49+
# check if both IPTables and the Route Table are OK.
50+
IPTABLES_PATCHED=`sudo iptables -S PREROUTING -t nat | grep -c 169.254.0.2`
51+
ROUTE_TABLE_PATCHED=`sudo ip route show table agent | grep -c $SAGEMAKER_INTERFACE`
52+
53+
if [ $RUNNING_ON_NOTEBOOK_INSTANCE -gt 0 ]; then
54+
55+
if [ $ROUTE_TABLE_PATCHED -eq 0 ]; then
56+
# fix routing
57+
sudo ip route add $DOCKER_NET via $DOCKER_IP dev $SAGEMAKER_INTERFACE table agent
58+
else
59+
echo "SageMaker instance route table setup is ok. We are good to go."
60+
fi
61+
62+
if [ $IPTABLES_PATCHED -eq 0 ]; then
63+
sudo iptables -t nat -A PREROUTING -i $SAGEMAKER_INTERFACE -d 169.254.169.254/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 169.254.0.2:9081
64+
echo "iptables for Docker setup done"
65+
else
66+
echo "SageMaker instance routing for Docker is ok. We are good to go!"
67+
fi
68+
fi

0 commit comments

Comments
 (0)