Merge branch 'master' into arpin_blazingtext_readme

djarpin · web-flow · commit 71242a24268f · 2018-02-02T11:38:47.000-08:00
diff --git a/advanced_functionality/working_with_redshift_data/working_with_redshift_data.ipynb b/advanced_functionality/working_with_redshift_data/working_with_redshift_data.ipynb
@@ -27,6 +27,10 @@
     "1. Preload that cluster with data from the [iris data set](https://archive.ics.uci.edu/ml/datasets/iris) in a table named public.irisdata.\n",
     "1. Update the credential file (`redshift_creds_template.json.nogit`) file with the appropriate information.\n",
     "\n",
+    "Also, note that this Notebook instance needs to resolve to a private IP when connecting to the Redshift instance. There are two ways to resolve the Redshift DNS name to a private IP:\n",
+    "1. The Redshift cluster is not publicly accessible so by default it will resolve to private IP.\n",
+    "1. The Redshift cluster is publicly accessible and has an EIP associated with it but when accessed from within a VPC, it should resolve to private IP of the Redshift cluster. This is possible by setting following two VPC attributes to yes: DNS resolution and DNS hostnames. For instructions on setting that up, see Redshift public docs on [Managing Clusters in an Amazon Virtual Private Cloud (VPC)](https://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-vpc.html).\n",
+    "\n",
     "### Notebook Setup\n",
     "Let's start by installing `psycopg2`, a PostgreSQL database adapter for the Python, adding a few imports and specifying a few configs. "
    ]
diff --git a/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-fulltraining.ipynb b/introduction_to_amazon_algorithms/imageclassification_caltech/Image-classification-fulltraining.ipynb
@@ -298,7 +298,7 @@
     "\n",
     "> `Training job ended with status: Completed`\n",
     "\n",
-    "then that means training sucessfully completed and the output model was stored in the output path specified by `training_params['OutputDataConfig']`.\n",
+    "then that means training successfully completed and the output model was stored in the output path specified by `training_params['OutputDataConfig']`.\n",
     "\n",
     "You can also view information about and the status of a training job using the AWS SageMaker console. Just click on the \"Jobs\" tab."
    ]
diff --git a/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_distributed_mnist.ipynb b/sagemaker-python-sdk/tensorflow_distributed_mnist/tensorflow_distributed_mnist.ipynb
@@ -120,7 +120,7 @@
     "When distributed training happens, the same neural network will be sent to the multiple training instances. Each instance will predict a batch of the dataset, calculate loss and minimize the optimizer. One entire loop of this process is called **training step**.\n",
     "\n",
     "### Syncronizing training steps\n",
-    "A [global step](https://www.tensorflow.org/api_docs/python/tf/train/global_step) is a global variable shared between the instances. It necessary for distributed training, so the optimizer will keep track of the number of **training steps** between runs: \n",
+    "A [global step](https://www.tensorflow.org/api_docs/python/tf/train/global_step) is a global variable shared between the instances. It's necessary for distributed training, so the optimizer will keep track of the number of **training steps** between runs: \n",
     "\n",
     "```python\n",
     "train_op = optimizer.minimize(loss, tf.train.get_or_create_global_step())\n",
diff --git a/sagemaker-spark/README.md b/sagemaker-spark/README.md
@@ -4,4 +4,4 @@
 
 These examples show how to use Amazon SageMaker for model training, hosting, and inference through Apache Spark using [SageMaker Spark](https://github.com/aws/sagemaker-spark). SageMaker Spark allows you to interleave Spark Pipeline stages with Pipeline stages that interact with Amazon SageMaker.
 
-- [MNIST with SageMaker PySpark](pyspark_mnist)
+- [MNIST with SageMaker PySpark](pyspark_mnist)
diff --git a/sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb b/sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb
@@ -0,0 +1,280 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# SageMaker PySpark K-Means Clustering MNIST Example\n",
+    "\n",
+    "1. [Introduction](#Introduction)\n",
+    "2. [Setup](#Setup)\n",
+    "3. [Loading the Data](#Loading-the-Data)\n",
+    "4. [Training and Hosting a Model](#Training-and-Hosting-a-Model)\n",
+    "5. [Inference](#Inference)\n",
+    "6. [More on SageMaker Spark](#More-on-SageMaker-Spark)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "This notebook will show how to classify handwritten digits using the K-Means clustering algorithm through the SageMaker PySpark library. We will train on Amazon SageMaker using K-Means clustering on the MNIST dataset, host the trained model on Amazon SageMaker, and then make predictions against that hosted model.\n",
+    "\n",
+    "Unlike the other notebooks that demonstrate K-Means clustering on Amazon SageMaker, this notebook uses a SparkSession to manipulate data, and uses the SageMaker Spark library to interact with SageMaker with Spark Estimators and Transformers.\n",
+    "\n",
+    "You can visit SageMaker Spark's GitHub repository at https://github.com/aws/sagemaker-spark to learn more about SageMaker Spark.\n",
+    "\n",
+    "This notebook was created and tested on an ml.m4.xlarge notebook instance."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "\n",
+    "First, we import the necessary modules and create the SparkSession with the SageMaker Spark dependencies."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "\n",
+    "from pyspark import SparkContext, SparkConf\n",
+    "from pyspark.sql import SparkSession\n",
+    "\n",
+    "import sagemaker\n",
+    "from sagemaker import get_execution_role\n",
+    "import sagemaker_pyspark\n",
+    "\n",
+    "role = get_execution_role()\n",
+    "\n",
+    "# Configure Spark to use the SageMaker Spark dependency jars\n",
+    "jars = sagemaker_pyspark.classpath_jars()\n",
+    "\n",
+    "classpath = \":\".join(sagemaker_pyspark.classpath_jars())\n",
+    "\n",
+    "# See the SageMaker Spark Github repo under sagemaker-pyspark-sdk\n",
+    "# to learn how to connect to a remote EMR cluster running Spark from a Notebook Instance.\n",
+    "spark = SparkSession.builder.config(\"spark.driver.extraClassPath\", classpath)\\\n",
+    "    .master(\"local[*]\").getOrCreate()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading the Data\n",
+    "\n",
+    "Now, we load the MNIST dataset into a Spark Dataframe, which dataset is available in LibSVM format at\n",
+    "\n",
+    "`s3://sagemaker-sample-data-[region]/spark/mnist/train/`\n",
+    "\n",
+    "where `[region]` is replaced with a supported AWS region, such as us-east-1.\n",
+    "\n",
+    "In order to train and make inferences our input DataFrame must have a column of Doubles (named \"label\" by default) and a column of Vectors of Doubles (named \"features\" by default).\n",
+    "\n",
+    "Spark's LibSVM DataFrameReader loads a DataFrame already suitable for training and inference.\n",
+    "\n",
+    "Here, we load into a DataFrame in the SparkSession running on the local Notebook Instance, but you can connect your Notebook Instance to a remote Spark cluster for heavier workloads. Starting from EMR 5.11.0, SageMaker Spark is pre-installed on EMR Spark clusters. For more on connecting your SageMaker Notebook Instance to a remote EMR cluster, please see [this blog post](https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import boto3\n",
+    "\n",
+    "region = boto3.Session().region_name\n",
+    "\n",
+    "trainingData = spark.read.format('libsvm')\\\n",
+    "    .option('numFeatures', '784')\\\n",
+    "    .load('s3a://sagemaker-sample-data-{}/spark/mnist/train/'.format(region))\n",
+    "\n",
+    "testData = spark.read.format('libsvm')\\\n",
+    "    .option('numFeatures', '784')\\\n",
+    "    .load('s3a://sagemaker-sample-data-{}/spark/mnist/test/'.format(region))\n",
+    "\n",
+    "trainingData.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Training and Hosting a Model\n",
+    "Now we create a KMeansSageMakerEstimator, which uses the KMeans Amazon SageMaker Algorithm to train on our input data, and uses the KMeans Amazon SageMaker model image to host our model.\n",
+    "\n",
+    "Calling fit() on this estimator will train our model on Amazon SageMaker, and then create an Amazon SageMaker Endpoint to host our model.\n",
+    "\n",
+    "We can then use the SageMakerModel returned by this call to fit() to transform Dataframes using our hosted model.\n",
+    "\n",
+    "The following cell runs a training job and creates an endpoint to host the resulting model, so this cell can take up to twenty minutes to complete."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "import random\n",
+    "\n",
+    "from sagemaker_pyspark import IAMRole, S3DataPath\n",
+    "from sagemaker_pyspark.algorithms import KMeansSageMakerEstimator\n",
+    "\n",
+    "kmeans_estimator = KMeansSageMakerEstimator(\n",
+    "    sagemakerRole=IAMRole(role),\n",
+    "    trainingInstanceType='ml.m4.xlarge',\n",
+    "    trainingInstanceCount=1,\n",
+    "    endpointInstanceType='ml.m4.xlarge',\n",
+    "    endpointInitialInstanceCount=1)\n",
+    "\n",
+    "kmeans_estimator.setK(10)\n",
+    "kmeans_estimator.setFeatureDim(784)\n",
+    "\n",
+    "# train\n",
+    "model = kmeans_estimator.fit(trainingData)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Inference\n",
+    "\n",
+    "Now we transform our DataFrame.\n",
+    "To do this, we serialize each row's \"features\" Vector of Doubles into a Protobuf format for inference against the Amazon SageMaker Endpoint. We deserialize the Protobuf responses back into our DataFrame. This serialization and deserialization is handled automatically by the `transform()` method:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "transformedData = model.transform(testData)\n",
+    "\n",
+    "transformedData.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "How well did the algorithm perform? Let us display the digits from each of the clusters and manually inspect the results:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "from pyspark.sql.types import DoubleType\n",
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "\n",
+    "# helper function to display a digit\n",
+    "def show_digit(img, caption='', xlabel='', subplot=None):\n",
+    "    if subplot==None:\n",
+    "        _,(subplot)=plt.subplots(1,1)\n",
+    "    imgr=img.reshape((28,28))\n",
+    "    subplot.axes.get_xaxis().set_ticks([])\n",
+    "    subplot.axes.get_yaxis().set_ticks([])\n",
+    "    plt.title(caption)\n",
+    "    plt.xlabel(xlabel)\n",
+    "    subplot.imshow(imgr, cmap='gray')\n",
+    "\n",
+    "images = np.array(transformedData.select(\"features\").cache().take(250))\n",
+    "clusters = transformedData.select(\"closest_cluster\").cache().take(250)\n",
+    "\n",
+    "for cluster in range(10):\n",
+    "    print('\\n\\n\\nCluster {}:'.format(int(cluster)))\n",
+    "    digits = [ img for l, img in zip(clusters, images) if int(l.closest_cluster) == cluster ]\n",
+    "    height=((len(digits)-1)//5)+1\n",
+    "    width=5\n",
+    "    plt.rcParams[\"figure.figsize\"] = (width,height)\n",
+    "    _, subplots = plt.subplots(height, width)\n",
+    "    subplots=np.ndarray.flatten(subplots)\n",
+    "    for subplot, image in zip(subplots, digits):\n",
+    "        show_digit(image, subplot=subplot)\n",
+    "    for subplot in subplots[len(digits):]:\n",
+    "        subplot.axis('off')\n",
+    "\n",
+    "    plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since we don't need to make any more inferences, now we delete the endpoint:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "# Delete the endpoint\n",
+    "\n",
+    "from sagemaker_pyspark import SageMakerResourceCleanup\n",
+    "\n",
+    "resource_cleanup = SageMakerResourceCleanup(model.sagemakerClient)\n",
+    "resource_cleanup.deleteResources(model.getCreatedResources())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## More on SageMaker Spark\n",
+    "\n",
+    "The SageMaker Spark Github repository has more about SageMaker Spark, including how to use SageMaker Spark with your own algorithms on Amazon SageMaker: https://github.com/aws/sagemaker-spark\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "conda_python3",
+   "language": "python",
+   "name": "conda_python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.2"
+  },
+  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/sagemaker-spark/pyspark_mnist/pyspark_mnist_xgboost.ipynb b/sagemaker-spark/pyspark_mnist/pyspark_mnist_xgboost.ipynb

Original file line number	Diff line number	Diff line change
`@@ -298,7 +298,7 @@`
`298`	`298`	`"\n",`
`299`	`299`	"> `Training job ended with status: Completed`\n",
`300`	`300`	`"\n",`
`301`		- "then that means training sucessfully completed and the output model was stored in the output path specified by `training_params['OutputDataConfig']`.\n",
	`301`	+ "then that means training successfully completed and the output model was stored in the output path specified by `training_params['OutputDataConfig']`.\n",
`302`	`302`	`"\n",`
`303`	`303`	`"You can also view information about and the status of a training job using the AWS SageMaker console. Just click on the \"Jobs\" tab."`
`304`	`304`	`]`
Original file line number	Diff line number	Diff line change
`@@ -4,4 +4,4 @@`
`4`	`4`
`5`	`5`	`These examples show how to use Amazon SageMaker for model training, hosting, and inference through Apache Spark using [SageMaker Spark](https://github.com/aws/sagemaker-spark). SageMaker Spark allows you to interleave Spark Pipeline stages with Pipeline stages that interact with Amazon SageMaker.`
`6`	`6`
`7`		`-- [MNIST with SageMaker PySpark](pyspark_mnist)`
	`7`	`+- [MNIST with SageMaker PySpark](pyspark_mnist)`