More WIP BYO stuff

tomfaulhaber · tomfaulhaber · commit 092d65b6eea5 · 2017-11-25T17:55:47.000-08:00
diff --git a/scikit_bring_your_own/container/decision_trees/train b/scikit_bring_your_own/container/decision_trees/train
@@ -65,21 +65,20 @@ def train():
         # save the model
         with open(os.path.join(model_path, 'decision-tree-model.pkl'), 'w') as out:
             pickle.dump(clf, out)
-
-        # Write out the success file
-        with open(os.path.join(output_path, 'success'), 'w') as s:
-            s.write('Done')
         print('Training complete.')
     except Exception as e:
-        # Write out an error file
-        # Either a failure file or non-zero exit code will indicate failure. Here we do
-        # both.
+        # Write out an error file. This will be returned as the failureReason in the
+        # DescribeTrainingJob result.
         trc = traceback.format_exc()
         with open(os.path.join(output_path, 'failure'), 'w') as s:
             s.write('Exception during training: ' + str(e) + '\n' + trc)
+        # Printing this causes the exception to be in the training job logs, as well.
         print('Exception during training: ' + str(e) + '\n' + trc, file=sys.stderr)
+        # A non-zero exit code causes the training job to be marked as Failed.
         sys.exit(255)
 
 if __name__ == '__main__':
     train()
+
+    # A zero exit code causes the job to be marked a Succeeded.
     sys.exit(0)
diff --git a/scikit_bring_your_own/scikit_bring_your_own.ipynb b/scikit_bring_your_own/scikit_bring_your_own.ipynb
@@ -10,7 +10,27 @@
     "\n",
     "By packaging an algorithm in a container, you can bring almost any code to the Amazon SageMaker environment, regardless of programming language, environment, framework, or dependencies. \n",
     "\n",
-    "_TODO: Insert TOC here_\n",
+    "_TODO: make sure TOC is up-to-date_\n",
+    "\n",
+    "1. [Building your own algorithm container](#Building-your-own-algorithm-container)\n",
+    "  1. [When should I build my own algorithm container?](#When-should-I-build-my-own-algorithm-container?)\n",
+    "  1. [Permissions](#Permissions)\n",
+    "  1. [The example](#The-example)\n",
+    "  1. [The presentation](#The-presentation)\n",
+    "1. [Part 1: Packaging and Uploading your Algorithm for use with Amazon SageMaker](#Part-1:-Packaging-and-Uploading-your-Algorithm-for-use-with-Amazon-SageMaker)\n",
+    "    1. [An overview of Docker](#An-overview-of-Docker)\n",
+    "    1. [How Amazon SageMaker runs your Docker container](#How-Amazon-SageMaker-runs-your-Docker-container)\n",
+    "      1. [Running your container during training](#Running-your-container-during-training)\n",
+    "        1. [The input](#The-input)\n",
+    "        1. [The output](#The-output)\n",
+    "      1. [Running your container during hosting](#Running-your-container-during-hosting)\n",
+    "    1. [The parts of the sample container](#The-parts-of-the-sample-container)\n",
+    "    1. [The Dockerfile](#The-Dockerfile)\n",
+    "  1. [Testing your algorithm on your local machine or on an Amazon SageMaker notebook instance](#Testing-your-algorithm-on-your-local-machine-or-on-an-Amazon-SageMaker-notebook-instance)\n",
+    "1. [Part 2: Training and Hosting your Algorithm in Amazon SageMaker](#Part-2:-Training-and-Hosting-your-Algorithm-in-Amazon-SageMaker)\n",
+    "  1. [Upload the data for training](#Upload-the-data-for-training)\n",
+    "  \n",
+    "_or_ I'm impatient, just [let me see the code](#The-Dockerfile)!\n",
     "\n",
     "## When should I build my own algorithm container?\n",
     "\n",
@@ -20,18 +40,23 @@
     "\n",
     "If there isn't direct SDK support for your environment, don't worry. You'll see in this walk-through that building your own container is quite straightforward.\n",
     "\n",
-    "## The example\n",
+    "## Permissions\n",
+    "\n",
+    "Running this notebook requires permissions in addition to the normal `SageMakerFullAccess` permissions. This is because we'll creating new repositories in Amazon ECR. The easiest way to add these permissions is simply to add the managed policy `AmazonEC2ContainerRegistryPowerUser` to the role that you used to start your notebook instance. There's no need to start your notebook instance when you do this, the new permissions will be available immediately.\n",
     "\n",
-    "TODO: links\n",
+    "## The example\n",
     "\n",
-    "Here, we'll show how to package a simple Python example which showcases the decision tree algorithm from the widely used scikit-learn machine learning package. The example is purposefully fairly trivial since the point is to show the surrounding structure that you'll want to add to your own code so you can train and host it in Amazon SageMaker.\n",
+    "Here, we'll show how to package a simple Python example which showcases the [decision tree][] algorithm from the widely used [scikit-learn][] machine learning package. The example is purposefully fairly trivial since the point is to show the surrounding structure that you'll want to add to your own code so you can train and host it in Amazon SageMaker.\n",
     "\n",
     "The ideas shown here will work in any language or environment. You'll need to choose the right tools for your environment to serve HTTP requests for inference, but good HTTP environments are available in every language these days.\n",
     "\n",
     "In this example, we use a single image to support training and hosting. This is easy because it means that we only need to manage one image and we can set it up to do everything. Sometimes you'll want separate images for training and hosting because they have different requirements. Just separate the parts discussed below into separate Dockerfiles and build two images. Choosing whether to have a single image or two images is really a matter of which is more convenient for you to develop and manage.\n",
     "\n",
     "If you're only using Amazon SageMaker for training or hosting, but not both, there is no need to build the unused functionality into your container.\n",
     "\n",
+    "[scikit-learn]: http://scikit-learn.org/stable/\n",
+    "[decision tree]: http://scikit-learn.org/stable/modules/tree.html\n",
+    "\n",
     "## The presentation\n",
     "\n",
     "This presentation is divided into two parts: _building_ the container and _using_ the container."
@@ -41,13 +66,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Part 1: Packaging and Uploading your Algorithm for use with Amazon SageMaker \n",
+    "# Part 1: Packaging and Uploading your Algorithm for use with Amazon SageMaker\n",
     "\n",
     "### An overview of Docker\n",
     "\n",
-    "TODO: links to docker, docker run, and Dockerfile reference. and ECS\n",
-    "TODO: check virtualenv spelling\n",
-    "\n",
     "If you're familiar with Docker already, you can skip ahead to the next section.\n",
     "\n",
     "For many data scientists, Docker containers are a new concept, but they are not difficult, as you'll see here. \n",
@@ -60,12 +82,21 @@
     "\n",
     "Docker uses a simple file called a `Dockerfile` to specify how the image is assembled. We'll see an example of that below. You can build your Docker images based on Docker images built by yourself or others, which can simplify things quite a bit.\n",
     "\n",
-    "Docker has become very popular in the programming and devops communities for its flexibility and well-defined specification of the code to be run. It is the underpinning of many services built in the past few years, such as Amazon ECS.\n",
+    "Docker has become very popular in the programming and devops communities for its flexibility and well-defined specification of the code to be run. It is the underpinning of many services built in the past few years, such as [Amazon ECS].\n",
     "\n",
     "Amazon SageMaker uses Docker to allow users to train and deploy arbitrary algorithms.\n",
     "\n",
     "In Amazon SageMaker, Docker containers are invoked in a certain way for training and a slightly different way for hosting. The following sections outline how to build containers for the SageMaker environment.\n",
     "\n",
+    "Some helpful links:\n",
+    "\n",
+    "* [Docker home page](http://www.docker.com)\n",
+    "* [Getting started with Docker](https://docs.docker.com/get-started/)\n",
+    "* [Dockerfile reference](https://docs.docker.com/engine/reference/builder/)\n",
+    "* [`docker run` reference](https://docs.docker.com/engine/reference/run/)\n",
+    "\n",
+    "[Amazon ECS]: https://aws.amazon.com/ecs/\n",
+    "\n",
     "### How Amazon SageMaker runs your Docker container\n",
     "\n",
     "Because you can run the same image in training or hosting, Amazon SageMaker runs your container with the argument `train` or `serve`. How your container processes this argument depends on the container:\n",
@@ -76,15 +107,50 @@
     "\n",
     "#### Running your container during training\n",
     "\n",
-    "The container is run with the argument \"train\"\n",
+    "When Amazon SageMaker runs training, your `train` script is run just like a regular Python program. A number of files are laid out for your use, under the `/opt/ml` directory:\n",
     "\n",
-    "The container gets some special files:\n",
+    "    /opt/ml\n",
+    "    ├── input\n",
+    "    │   ├── config\n",
+    "    │   │   ├── hyperparameters.json\n",
+    "    │   │   └── resourceConfig.json\n",
+    "    │   └── data\n",
+    "    │       └── <channel_name>\n",
+    "    │           └── <input data>\n",
+    "    ├── model\n",
+    "    │   └── <model files>\n",
+    "    └── output\n",
+    "        └── failure\n",
     "\n",
-    "TODO: Insert overview of file system here\n",
+    "##### The input\n",
+    "\n",
+    "* `/opt/ml/input/config` contains information to control how your program runs. `hyperparameters.json` is a JSON-formatted dictionary of hyperparameter names to values. These values will always be strings, so you may need to convert them. `resourceConfig.json` is a JSON-formatted file that describes the network layout used for distributed training. Since scikit-learn doesn't support distributed training, we'll ignore it here.\n",
+    "* `/opt/ml/input/data/<channel_name>/` (for File mode) contains the input data for that channel. The channels are created based on the call to CreateTrainingJob but it's generally important that channels match what the algorithm expects. The files for each channel will be copied from S3 to this directory, preserving the tree structure indicated by the S3 key structure. \n",
+    "* `/opt/ml/input/data/<channel_name>_<epoch_number>` (for Pipe mode) is the pipe for a given epoch. Epochs start at zero and go up by one each time you read them. There is no limit to the number of epochs that you can run, but you must close each pipe before reading the next epoch.\n",
+    "\n",
+    "##### The output\n",
+    "\n",
+    "* `/opt/ml/model/` is the directory where you write the model that your algorithm generates. Your model can be in any format that you want. It can be a single file or a whole directory tree. SageMaker will package any files in this directory into a compressed tar archive file. This file will be available at the S3 location returned in the `DescribeTrainingJob` result.\n",
+    "* `/opt/ml/output` is a directory where the algorithm can write a file `failure` that describes why the job failed. The contents of this file will be returned in the `FailureReason` field of the `DescribeTrainingJob` result. For jobs that succeed, there is no reason to write this file as it will be ignored.\n",
     "\n",
     "#### Running your container during hosting\n",
     "\n",
-    "The container is run with the argument \"serve\". \n",
+    "Hosting has a very different model that training because hosting is reponding to inference requests that come in via HTTP. In this example, we use our recommended Python serving stack to provide robust and scalable serving of inference requests:\n",
+    "\n",
+    "![Request serving stack](stack.png)\n",
+    "\n",
+    "This stack is implemented in the sample code here and you can mostly just leave it alone. \n",
+    "\n",
+    "Amazon SageMaker uses two URLs in the container:\n",
+    "\n",
+    "* `/ping` will receive `GET` requests from the infrastructure. Your program returns 200 if the container is up and accepting requests.\n",
+    "* `/invocations` is the endpoint that receives client inference `POST` requests. The format of the request and the response is up to the algorithm. If the client supplied `ContentType` and `Accept` headers, these will be passed in as well. \n",
+    "\n",
+    "The container will have the model files in the same place they were written during training:\n",
+    "\n",
+    "    /opt/ml\n",
+    "    └── model\n",
+    "        └── <model files>\n",
     "\n"
    ]
   },
@@ -128,20 +194,96 @@
     "\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### The Dockerfile\n",
+    "\n",
+    "The Dockerfile describes the image that we want to build. You can think of it as describing the complete operating system installation of the system that you want to run. A Docker container running is quite a bit lighter than a full operating system, however, because it takes advantage of Linux on the host machine for the basic operations. \n",
+    "\n",
+    "For the Python science stack, we will start from a standard Ubuntu installation and run the normal tools to install the things needed by scikit-learn. Finally, we add the code that implements our specific algorithm to the container and set up the right environment to run under.\n",
+    "\n",
+    "Along the way, we clean up extra space. This makes the container smaller and faster to start.\n",
+    "\n",
+    "Let's look at the Dockerfile for the example:"
+   ]
+  },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 23,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "# Build an image that can do training and inference in SageMaker\r\n",
+      "# This is a Python 2 image that uses the nginx, gunicorn, flask stack\r\n",
+      "# for serving inferences in a stable way.\r\n",
+      "\r\n",
+      "FROM ubuntu:16.04\r\n",
+      "\r\n",
+      "MAINTAINER Amazon AI <sage-learner@amazon.com>\r\n",
+      "\r\n",
+      "\r\n",
+      "RUN apt-get -y update && apt-get install -y --no-install-recommends \\\r\n",
+      "         wget \\\r\n",
+      "         python \\\r\n",
+      "         nginx \\\r\n",
+      "         ca-certificates \\\r\n",
+      "    && rm -rf /var/lib/apt/lists/*\r\n",
+      "\r\n",
+      "# Here we get all python packages.\r\n",
+      "# There's substantial overlap between scipy and numpy that we eliminate by\r\n",
+      "# linking them together. Likewise, pip leaves the install caches populated which uses\r\n",
+      "# a significant amount of space. These optimizations save a fair amount of space in the\r\n",
+      "# image, which reduces start up time.\r\n",
+      "RUN wget https://bootstrap.pypa.io/get-pip.py && python get-pip.py && \\\r\n",
+      "    pip install numpy scipy scikit-learn pandas flask gevent gunicorn && \\\r\n",
+      "        (cd /usr/local/lib/python2.7/dist-packages/scipy/.libs; rm *; ln ../../numpy/.libs/* .) && \\\r\n",
+      "        rm -rf /root/.cache\r\n",
+      "\r\n",
+      "# Set some environment variables. PYTHONUNBUFFERED keeps Python from buffering our standard\r\n",
+      "# output stream, which means that logs can be delivered to the user quickly. PYTHONDONTWRITEBYTECODE\r\n",
+      "# keeps Python from writing the .pyc files which are unnecessary in this case. We also update\r\n",
+      "# PATH so that the train and serve programs are found when the container is invoked.\r\n",
+      "\r\n",
+      "ENV PYTHONUNBUFFERED=TRUE\r\n",
+      "ENV PYTHONDONTWRITEBYTECODE=TRUE\r\n",
+      "ENV PATH=\"/opt/program:${PATH}\"\r\n",
+      "\r\n",
+      "# Make nginx log to stdout/err so that the log messages will be picked up by the\r\n",
+      "# Docker logger\r\n",
+      "RUN ln -s /dev/stdout /tmp/nginx.access.log && ln -s /dev/stderr /tmp/nginx.error.log\r\n",
+      "\r\n",
+      "# Set up the program in the image\r\n",
+      "COPY decision_trees /opt/program\r\n",
+      "WORKDIR /opt/program\r\n",
+      "\r\n"
+     ]
+    }
+   ],
    "source": [
     "!cat container/Dockerfile"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 24,
    "metadata": {},
-   "outputs": [],
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "sh: 25: docker: not found\n",
+      "sh: 29: docker: not found\n",
+      "sh: 30: docker: not found\n",
+      "sh: 32: docker: not found\n"
+     ]
+    }
+   ],
    "source": [
     "%%sh\n",
     "\n",
@@ -150,8 +292,6 @@
     "\n",
     "cd container\n",
     "\n",
-    "#set -e # stop if anything fails\n",
-    "\n",
     "account=$(aws sts get-caller-identity --query Account --output text)\n",
     "\n",
     "# Get the region defined in the current configuration (default to us-west-2 if none defined)\n",
diff --git a/scikit_bring_your_own/stack.png b/scikit_bring_your_own/stack.png