add sagemaker spark kmeans mnist notebook

andremoeller · andremoeller · commit 1d8f7324e9a2 · 2018-01-17T13:20:34.000-08:00
diff --git a/sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb b/sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb
@@ -0,0 +1,260 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.\n",
+    "\n",
+    "Licensed under the Apache License, Version 2.0 (the \"License\").\n",
+    "You may not use this file except in compliance with the License.\n",
+    "A copy of the License is located at\n",
+    " \n",
+    "  http://aws.amazon.com/apache2.0/\n",
+    "\n",
+    "or in the \"license\" file accompanying this file. This file is distributed\n",
+    "on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either\n",
+    "express or implied. See the License for the specific language governing\n",
+    "permissions and limitations under the License."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# SageMakerPySpark MNIST Example\n",
+    "\n",
+    "1. [Introduction](#Introduction)\n",
+    "2. [Data Inspection](#Data-Inspection)\n",
+    "3. [Training the K-Means Model](#Training-the-K-Means-Model)\n",
+    "4. [Validate the Model for use](#Validate-the-Model-for-use)\n",
+    "5. [Bring your Own Algorithm](#Bring-your-Own-Algorithm)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "This notebook will show how to classify handwritten digits using the KMeans clustering algorithm through the SageMakerPySparkSDK.\n",
+    "\n",
+    "You can visit SageMaker Spark's Github repository at https://github.com/aws/sagemaker-spark for more about SageMaker Spark.\n",
+    "\n",
+    "We will train on Amazon SageMaker using the KMeans Clustering on the MNIST dataset, host the trained model on Amazon SageMaker, and then make predictions against that hosted model.\n",
+    "\n",
+    "First, we load the MNIST dataset into a Spark Dataframe, which dataset is available in LibSVM format at\n",
+    "\n",
+    "s3://sagemaker-sample-data-[region, such as us-east-1]/spark/mnist/train/"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark import SparkContext, SparkConf\n",
+    "from pyspark.sql import SparkSession\n",
+    "import os\n",
+    "import sagemaker_pyspark\n",
+    "import sagemaker\n",
+    "from sagemaker import get_execution_role\n",
+    "\n",
+    "sagemaker_session = sagemaker.Session()\n",
+    "\n",
+    "role = get_execution_role()\n",
+    "\n",
+    "# Configure Spark to use the SageMaker Spark dependency jars\n",
+    "jars = sagemaker_pyspark.classpath_jars()\n",
+    "\n",
+    "classpath = \":\".join(sagemaker_pyspark.classpath_jars())\n",
+    "\n",
+    "# See the SageMaker Spark Github repo under sagemaker-pyspark-sdk\n",
+    "# to learn how to connect to a remote EMR cluster running Spark from a Notebook Instance.\n",
+    "spark = SparkSession.builder.config(\"spark.driver.extraClassPath\", classpath)\\\n",
+    "    .master(\"local[*]\").getOrCreate()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# replace this with your own region, such as us-east-1\n",
+    "region = 'us-east-1'\n",
+    "trainingData = spark.read.format('libsvm')\\\n",
+    "    .option('numFeatures', '784')\\\n",
+    "    .load('s3a://sagemaker-sample-data-{}/spark/mnist/train/'.format(region))\n",
+    "\n",
+    "testData = spark.read.format('libsvm')\\\n",
+    "    .option('numFeatures', '784')\\\n",
+    "    .load('s3a://sagemaker-sample-data-{}/spark/mnist/test/'.format(region))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Data Inspection\n",
+    "In order to train and make inferences our input DataFrame must have a column of Doubles (named \"label\" by default) and a column of Vectors of Doubles (named \"features\" by default).\n",
+    "\n",
+    "Spark's LibSVM DataFrameReader loads a DataFrame already suitable for training and inference."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "trainingData.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Training the K-Means Model\n",
+    "Now we create a KMeansSageMakerEstimator, which uses the KMeans Amazon SageMaker Algorithm to train on our input data, and uses the KMeans Amazon SageMaker model image to host our model.\n",
+    "\n",
+    "Calling fit() on this estimator will train our model on Amazon SageMaker, and then create an Amazon SageMaker Endpoint to host our model.\n",
+    "\n",
+    "We can then use the SageMakerModel returned by this call to fit() to transform Dataframes using our hosted model.\n",
+    "\n",
+    "The following cell runs a training job and creates an endpoint to host the resulting model, so this cell can take up to twenty minutes to complete."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import random\n",
+    "from sagemaker_pyspark import IAMRole, S3DataPath\n",
+    "from sagemaker_pyspark.algorithms import KMeansSageMakerEstimator\n",
+    "\n",
+    "# replace this with your role ARN\n",
+    "kmeans_estimator = KMeansSageMakerEstimator(\n",
+    "    sagemakerRole=IAMRole(role),\n",
+    "    trainingInstanceType='ml.p2.xlarge',\n",
+    "    trainingInstanceCount=1,\n",
+    "    endpointInstanceType='ml.c4.xlarge',\n",
+    "    endpointInitialInstanceCount=1)\n",
+    "\n",
+    "kmeans_estimator.setK(10)\n",
+    "kmeans_estimator.setFeatureDim(784)\n",
+    "\n",
+    "# train\n",
+    "model = kmeans_estimator.fit(trainingData)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Validate the Model for use\n",
+    "Now we transform our DataFrame.\n",
+    "To do this, we serialize each row's \"features\" Vector of Doubles into a Protobuf format for inference against the Amazon SageMaker Endpoint. We deserialize the Protobuf responses back into our DataFrame:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "transformedData = model.transform(testData)\n",
+    "\n",
+    "transformedData.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark.sql.types import DoubleType\n",
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "\n",
+    "# helper function to display a digit\n",
+    "def show_digit(img, caption='', xlabel='', subplot=None):\n",
+    "    if subplot==None:\n",
+    "        _,(subplot)=plt.subplots(1,1)\n",
+    "    imgr=img.reshape((28,28))\n",
+    "    subplot.axes.get_xaxis().set_ticks([])\n",
+    "    subplot.axes.get_yaxis().set_ticks([])\n",
+    "    plt.title(caption)\n",
+    "    plt.xlabel(xlabel)\n",
+    "    subplot.imshow(imgr, cmap='gray')\n",
+    "\n",
+    "images = np.array(transformedData.select(\"features\").cache().take(250))\n",
+    "clusters = transformedData.select(\"closest_cluster\").cache().take(250)\n",
+    "\n",
+    "for cluster in range(10):\n",
+    "    print('\\n\\n\\nCluster {}:'.format(int(cluster)))\n",
+    "    digits = [ img for l, img in zip(clusters, images) if int(l.closest_cluster) == cluster ]\n",
+    "    height=((len(digits)-1)//5)+1\n",
+    "    width=5\n",
+    "    plt.rcParams[\"figure.figsize\"] = (width,height)\n",
+    "    _, subplots = plt.subplots(height, width)\n",
+    "    subplots=np.ndarray.flatten(subplots)\n",
+    "    for subplot, image in zip(subplots, digits):\n",
+    "        show_digit(image, subplot=subplot)\n",
+    "    for subplot in subplots[len(digits):]:\n",
+    "        subplot.axis('off')\n",
+    "\n",
+    "    plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# Delete the endpoint\n",
+    "\n",
+    "from sagemaker_pyspark import SageMakerResourceCleanup\n",
+    "\n",
+    "resource_cleanup = SageMakerResourceCleanup(model.sagemakerClient)\n",
+    "resource_cleanup.deleteResources(model.getCreatedResources())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Bring your Own Algorithm\n",
+    "\n",
+    "The SageMaker Spark Github repository has more about SageMaker Spark, including how to use SageMaker Spark with your own algorithms on Amazon SageMaker: https://github.com/aws/sagemaker-spark\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "conda_python3",
+   "language": "python",
+   "name": "conda_python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}