Skip to content

Added: MXNet Gluon CIFAR-10 local mode example #238

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 25, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
284 changes: 284 additions & 0 deletions sagemaker-python-sdk/mxnet_gluon_cifar10/cifar10_local_mode.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Gluon CIFAR-10 Trained in Local Mode\n",
"_**ResNet model in Gluon trained locally in a notebook instance**_\n",
"\n",
"---\n",
"\n",
"---\n",
"\n",
"_This notebook was created and tested on an ml.p3.8xlarge notebook instance._\n",
"\n",
"## Setup\n",
"\n",
"Import libraries and set IAM role ARN."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import sagemaker\n",
"from sagemaker.mxnet import MXNet\n",
"\n",
"sagemaker_session = sagemaker.Session()\n",
"role = sagemaker.get_execution_role()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Install pre-requisites for local training."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!/bin/bash setup.sh"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Data\n",
"\n",
"We use the helper scripts to download CIFAR-10 training data and sample images."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from cifar10_utils import download_training_data\n",
"download_training_data()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value `inputs` identifies the location -- we will use this later when we start the training job.\n",
"\n",
"Even though we are training within our notebook instance, we'll continue to use the S3 data location since it will allow us to easily transition to training in SageMaker's managed environment."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"inputs = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-gluon-cifar10')\n",
"print('input spec (in this case, just an S3 path): {}'.format(inputs))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Script\n",
"\n",
"We need to provide a training script that can run on the SageMaker platform. When SageMaker calls your function, it will pass in arguments that describe the training environment. Check the script below to see how this works.\n",
"\n",
"The network itself is a pre-built version contained in the [Gluon Model Zoo](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/model_zoo.html)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"!cat 'cifar10.py'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Train (Local Mode)\n",
"\n",
"The ```MXNet``` estimator will create our training job. To switch from training in SageMaker's managed environment to training within a notebook instance, just set `train_instance_type` to `local_gpu`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"m = MXNet('cifar10.py', \n",
" role=role, \n",
" train_instance_count=1, \n",
" train_instance_type='local_gpu',\n",
" hyperparameters={'batch_size': 1024, \n",
" 'epochs': 50, \n",
" 'learning_rate': 0.1, \n",
" 'momentum': 0.9})"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After we've constructed our `MXNet` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"m.fit(inputs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Host\n",
"\n",
"After training, we use the MXNet estimator object to deploy an endpoint. Because we trained locally, we'll also deploy the endpoint locally. The predictor object returned by `deploy` lets us call the endpoint and perform inference on our sample images."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"predictor = m.deploy(initial_instance_count=1, instance_type='local_gpu')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate\n",
"\n",
"We'll use these CIFAR-10 sample images to test the service:\n",
"\n",
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/airplane1.png\" />\n",
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/automobile1.png\" />\n",
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/bird1.png\" />\n",
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/cat1.png\" />\n",
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/deer1.png\" />\n",
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/dog1.png\" />\n",
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/frog1.png\" />\n",
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/horse1.png\" />\n",
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/ship1.png\" />\n",
"<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/truck1.png\" />\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# load the CIFAR-10 samples, and convert them into format we can use with the prediction endpoint\n",
"from cifar10_utils import read_images\n",
"\n",
"filenames = ['images/airplane1.png',\n",
" 'images/automobile1.png',\n",
" 'images/bird1.png',\n",
" 'images/cat1.png',\n",
" 'images/deer1.png',\n",
" 'images/dog1.png',\n",
" 'images/frog1.png',\n",
" 'images/horse1.png',\n",
" 'images/ship1.png',\n",
" 'images/truck1.png']\n",
"\n",
"image_data = read_images(filenames)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The predictor runs inference on our input data and returns the predicted class label (as a float value, so we convert to int for display)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"for i, img in enumerate(image_data):\n",
" response = predictor.predict(img)\n",
" print('image {}: class: {}'.format(i, int(response)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---\n",
"\n",
"## Cleanup\n",
"\n",
"After you have finished with this example, remember to delete the prediction endpoint. Only one local endpoint can be running at a time."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"m.delete_endpoint()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "conda_mxnet_p27",
"language": "python",
"name": "conda_mxnet_p27"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.14"
},
"notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
},
"nbformat": 4,
"nbformat_minor": 2
}
10 changes: 10 additions & 0 deletions sagemaker-python-sdk/mxnet_gluon_cifar10/daemon.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@

{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
68 changes: 68 additions & 0 deletions sagemaker-python-sdk/mxnet_gluon_cifar10/setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
#!/bin/bash

# Do we have GPU support?
nvidia-smi > /dev/null 2>&1
if [ $? -eq 0 ]; then
# check if we have nvidia-docker
NVIDIA_DOCKER=`rpm -qa | grep -c nvidia-docker2`
if [ $NVIDIA_DOCKER -eq 0 ]; then
# Install nvidia-docker2
#sudo pkill -SIGHUP dockerd
sudo yum -y remove docker
sudo yum -y install docker-17.09.1ce-1.111.amzn1

sudo /etc/init.d/docker start

curl -s -L https://nvidia.github.io/nvidia-docker/amzn1/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-docker2
sudo cp daemon.json /etc/docker/daemon.json
sudo pkill -SIGHUP dockerd
echo "installed nvidia-docker2"
else
echo "nvidia-docker2 already installed. We are good to go!"
fi
fi

# This is common for both GPU and CPU instances

# check if we have docker-compose
docker-compose version >/dev/null 2>&1
if [ $? -ne 0 ]; then
# install docker compose
pip install docker-compose
fi

# check if we need to configure our docker interface
SAGEMAKER_NETWORK=`docker network ls | grep -c sagemaker-local`
if [ $SAGEMAKER_NETWORK -eq 0 ]; then
docker network create --driver bridge sagemaker-local
fi

# Notebook instance Docker networking fixes
RUNNING_ON_NOTEBOOK_INSTANCE=`sudo iptables -S OUTPUT -t nat | grep -c 169.254.0.2`

# Get the Docker Network CIDR and IP for the sagemaker-local docker interface.
SAGEMAKER_INTERFACE=br-`docker network ls | grep sagemaker-local | cut -d' ' -f1`
DOCKER_NET=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f1`
DOCKER_IP=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f12`

# check if both IPTables and the Route Table are OK.
IPTABLES_PATCHED=`sudo iptables -S PREROUTING -t nat | grep -c 169.254.0.2`
ROUTE_TABLE_PATCHED=`sudo ip route show table agent | grep -c $SAGEMAKER_INTERFACE`

if [ $RUNNING_ON_NOTEBOOK_INSTANCE -gt 0 ]; then

if [ $ROUTE_TABLE_PATCHED -eq 0 ]; then
# fix routing
sudo ip route add $DOCKER_NET via $DOCKER_IP dev $SAGEMAKER_INTERFACE table agent
else
echo "SageMaker instance route table setup is ok. We are good to go."
fi

if [ $IPTABLES_PATCHED -eq 0 ]; then
sudo iptables -t nat -A PREROUTING -i $SAGEMAKER_INTERFACE -d 169.254.169.254/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 169.254.0.2:9081
echo "iptables for Docker setup done"
else
echo "SageMaker instance routing for Docker is ok. We are good to go!"
fi
fi