Merge pull request aws#238 from awslabs/arpin_cifar_local_mode

djarpin · web-flow · commit 988bc3cac6d6 · 2018-04-25T14:29:55.000-07:00
Added: MXNet Gluon CIFAR-10 local mode example
diff --git a/sagemaker-python-sdk/mxnet_gluon_cifar10/cifar10_local_mode.ipynb b/sagemaker-python-sdk/mxnet_gluon_cifar10/cifar10_local_mode.ipynb
@@ -0,0 +1,284 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Gluon CIFAR-10 Trained in Local Mode\n",
+    "_**ResNet model in Gluon trained locally in a notebook instance**_\n",
+    "\n",
+    "---\n",
+    "\n",
+    "---\n",
+    "\n",
+    "_This notebook was created and tested on an ml.p3.8xlarge notebook instance._\n",
+    "\n",
+    "## Setup\n",
+    "\n",
+    "Import libraries and set IAM role ARN."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sagemaker\n",
+    "from sagemaker.mxnet import MXNet\n",
+    "\n",
+    "sagemaker_session = sagemaker.Session()\n",
+    "role = sagemaker.get_execution_role()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Install pre-requisites for local training."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!/bin/bash setup.sh"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Data\n",
+    "\n",
+    "We use the helper scripts to download CIFAR-10 training data and sample images."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from cifar10_utils import download_training_data\n",
+    "download_training_data()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We use the `sagemaker.Session.upload_data` function to upload our datasets to an S3 location. The return value `inputs` identifies the location -- we will use this later when we start the training job.\n",
+    "\n",
+    "Even though we are training within our notebook instance, we'll continue to use the S3 data location since it will allow us to easily transition to training in SageMaker's managed environment."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "inputs = sagemaker_session.upload_data(path='data', key_prefix='data/DEMO-gluon-cifar10')\n",
+    "print('input spec (in this case, just an S3 path): {}'.format(inputs))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Script\n",
+    "\n",
+    "We need to provide a training script that can run on the SageMaker platform. When SageMaker calls your function, it will pass in arguments that describe the training environment. Check the script below to see how this works.\n",
+    "\n",
+    "The network itself is a pre-built version contained in the [Gluon Model Zoo](https://mxnet.incubator.apache.org/versions/master/api/python/gluon/model_zoo.html)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!cat 'cifar10.py'"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Train (Local Mode)\n",
+    "\n",
+    "The ```MXNet``` estimator will create our training job. To switch from training in SageMaker's managed environment to training within a notebook instance, just set `train_instance_type` to `local_gpu`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "m = MXNet('cifar10.py', \n",
+    "          role=role, \n",
+    "          train_instance_count=1, \n",
+    "          train_instance_type='local_gpu',\n",
+    "          hyperparameters={'batch_size': 1024, \n",
+    "                           'epochs': 50, \n",
+    "                           'learning_rate': 0.1, \n",
+    "                           'momentum': 0.9})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After we've constructed our `MXNet` object, we can fit it using the data we uploaded to S3. SageMaker makes sure our data is available in the local filesystem, so our training script can simply read the data from disk."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "m.fit(inputs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Host\n",
+    "\n",
+    "After training, we use the MXNet estimator object to deploy an endpoint. Because we trained locally, we'll also deploy the endpoint locally.  The predictor object returned by `deploy` lets us call the endpoint and perform inference on our sample images."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "predictor = m.deploy(initial_instance_count=1, instance_type='local_gpu')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Evaluate\n",
+    "\n",
+    "We'll use these CIFAR-10 sample images to test the service:\n",
+    "\n",
+    "<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/airplane1.png\" />\n",
+    "<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/automobile1.png\" />\n",
+    "<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/bird1.png\" />\n",
+    "<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/cat1.png\" />\n",
+    "<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/deer1.png\" />\n",
+    "<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/dog1.png\" />\n",
+    "<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/frog1.png\" />\n",
+    "<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/horse1.png\" />\n",
+    "<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/ship1.png\" />\n",
+    "<img style=\"display: inline; height: 32px; margin: 0.25em\" src=\"images/truck1.png\" />\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# load the CIFAR-10 samples, and convert them into format we can use with the prediction endpoint\n",
+    "from cifar10_utils import read_images\n",
+    "\n",
+    "filenames = ['images/airplane1.png',\n",
+    "             'images/automobile1.png',\n",
+    "             'images/bird1.png',\n",
+    "             'images/cat1.png',\n",
+    "             'images/deer1.png',\n",
+    "             'images/dog1.png',\n",
+    "             'images/frog1.png',\n",
+    "             'images/horse1.png',\n",
+    "             'images/ship1.png',\n",
+    "             'images/truck1.png']\n",
+    "\n",
+    "image_data = read_images(filenames)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The predictor runs inference on our input data and returns the predicted class label (as a float value, so we convert to int for display)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "for i, img in enumerate(image_data):\n",
+    "    response = predictor.predict(img)\n",
+    "    print('image {}: class: {}'.format(i, int(response)))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "\n",
+    "## Cleanup\n",
+    "\n",
+    "After you have finished with this example, remember to delete the prediction endpoint.  Only one local endpoint can be running at a time."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "m.delete_endpoint()"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "conda_mxnet_p27",
+   "language": "python",
+   "name": "conda_mxnet_p27"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.14"
+  },
+  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/sagemaker-python-sdk/mxnet_gluon_cifar10/daemon.json b/sagemaker-python-sdk/mxnet_gluon_cifar10/daemon.json
@@ -0,0 +1,10 @@
+
+{
+	"default-runtime": "nvidia",
+    "runtimes": {
+        "nvidia": {
+            "path": "/usr/bin/nvidia-container-runtime",
+            "runtimeArgs": []
+        }
+    }
+}
diff --git a/sagemaker-python-sdk/mxnet_gluon_cifar10/setup.sh b/sagemaker-python-sdk/mxnet_gluon_cifar10/setup.sh
@@ -0,0 +1,68 @@
+#!/bin/bash
+
+# Do we have GPU support?
+nvidia-smi > /dev/null 2>&1
+if [ $? -eq 0 ]; then
+  # check if we have nvidia-docker
+  NVIDIA_DOCKER=`rpm -qa | grep -c nvidia-docker2`
+  if [ $NVIDIA_DOCKER -eq 0 ]; then
+    # Install nvidia-docker2
+    #sudo pkill -SIGHUP dockerd
+    sudo yum -y remove docker
+    sudo yum -y install docker-17.09.1ce-1.111.amzn1
+
+    sudo /etc/init.d/docker start
+
+    curl -s -L https://nvidia.github.io/nvidia-docker/amzn1/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
+    sudo yum install -y nvidia-docker2
+    sudo cp daemon.json /etc/docker/daemon.json
+    sudo pkill -SIGHUP dockerd
+    echo "installed nvidia-docker2"
+  else
+    echo "nvidia-docker2 already installed. We are good to go!"
+  fi
+fi
+
+# This is common for both GPU and CPU instances
+
+# check if we have docker-compose
+docker-compose version >/dev/null 2>&1
+if [ $? -ne 0 ]; then
+  # install docker compose
+  pip install docker-compose
+fi
+
+# check if we need to configure our docker interface
+SAGEMAKER_NETWORK=`docker network ls | grep -c sagemaker-local`
+if [ $SAGEMAKER_NETWORK -eq 0 ]; then
+  docker network create --driver bridge sagemaker-local
+fi
+
+# Notebook instance Docker networking fixes
+RUNNING_ON_NOTEBOOK_INSTANCE=`sudo iptables -S OUTPUT -t nat | grep -c 169.254.0.2`
+
+# Get the Docker Network CIDR and IP for the sagemaker-local docker interface.
+SAGEMAKER_INTERFACE=br-`docker network ls | grep sagemaker-local | cut -d' ' -f1`
+DOCKER_NET=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f1`
+DOCKER_IP=`ip route | grep $SAGEMAKER_INTERFACE | cut -d" " -f12`
+
+# check if both IPTables and the Route Table are OK.
+IPTABLES_PATCHED=`sudo iptables -S PREROUTING -t nat | grep -c 169.254.0.2`
+ROUTE_TABLE_PATCHED=`sudo ip route show table agent | grep -c $SAGEMAKER_INTERFACE`
+
+if [ $RUNNING_ON_NOTEBOOK_INSTANCE -gt 0 ]; then
+
+  if [ $ROUTE_TABLE_PATCHED -eq 0 ]; then
+    # fix routing
+    sudo ip route add $DOCKER_NET via $DOCKER_IP dev $SAGEMAKER_INTERFACE table agent
+  else
+    echo "SageMaker instance route table setup is ok. We are good to go."
+  fi
+
+  if [ $IPTABLES_PATCHED -eq 0 ]; then
+    sudo iptables -t nat -A PREROUTING  -i $SAGEMAKER_INTERFACE -d 169.254.169.254/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination 169.254.0.2:9081
+    echo "iptables for Docker setup done"
+  else
+    echo "SageMaker instance routing for Docker is ok. We are good to go!"
+  fi
+fi