Pyspark kmeans clustering on MNIST #168

andremoeller · 2018-01-17T21:31:11Z

SageMaker Spark notebook that runs KMeans on MNIST with local Spark, conda_python3 kernel.

laurenyu

overall looks good to me - just a few small suggestions

laurenyu · 2018-01-19T19:23:20Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb

+    "\n",
+    "First, we load the MNIST dataset into a Spark Dataframe, which dataset is available in LibSVM format at\n",
+    "\n",
+    "s3://sagemaker-sample-data-[region, such as us-east-1]/spark/mnist/train/"


this might be clearer as:

s3://sagemaker-sample-data-[region]/spark/mnist/train/

where [region] is replaced with a supported AWS region, such as us-east-1

Good idea, done.

laurenyu · 2018-01-19T19:41:15Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb

+    "import os\n",
+    "import sagemaker_pyspark\n",
+    "import sagemaker\n",
+    "from sagemaker import get_execution_role\n",


nitpick - Python import statements should be grouped by standard library imports, third-party imports, and local imports. So these should become something like:

import os from pyspark import SparkContext, SparkConf from pyspark.sql import SparkSession import sagemaker import sagemaker_pyspark from sagemaker import get_execution_role

reference: https://www.python.org/dev/peps/pep-0008/#imports

Thanks! Done.

laurenyu · 2018-01-20T01:55:17Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb

+    "\n",
+    "# replace this with your role ARN\n",
+    "kmeans_estimator = KMeansSageMakerEstimator(\n",
+    "    sagemakerRole=IAMRole(role),\n",


does this mean get_execution_role() doesn't need to be called above?

Removed this comment.

laurenyu · 2018-01-20T01:56:40Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb

+   },
+   "outputs": [],
+   "source": [
+    "# Delete the endpoint\n",


making a short text cell to explain deleting the endpoint instead of this comment might be more effective, but YMMV

Good idea, done.

laurenyu

looks good to me!

djarpin

Looks really good. I had a couple comments. Let me know if you want to discuss in person. Thanks for the contribution!

djarpin · 2018-01-23T22:51:26Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb

+    "or in the \"license\" file accompanying this file. This file is distributed\n",
+    "on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either\n",
+    "express or implied. See the License for the specific language governing\n",
+    "permissions and limitations under the License."


Minor point... You may want to add this to the Notebook's metadata (Edit -> Notebook Medata) since putting it at the top seems a bit distracting.

djarpin · 2018-01-23T22:56:27Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb

+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "This notebook will show how to classify handwritten digits using the K-Means clustering algorithm through the SageMaker PySpark library. We will train on Amazon SageMaker using K-Means clustering on the MNIST dataset, host the trained model on Amazon SageMaker, and then make predictions against that hosted model.\n",


Can you add a sentence about how this k-means notebook differs from kmeans_mnist.ipynb (e.g. this one uses a Spark session to manipulate the data and call estimators/transformers).

Done -- added to introduction

djarpin · 2018-01-23T23:05:29Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb

+    "\n",
+    "We will train on Amazon SageMaker using the KMeans Clustering on the MNIST dataset, host the trained model on Amazon SageMaker, and then make predictions against that hosted model.\n",
+    "\n",
+    "First, we load the MNIST dataset into a Spark Dataframe, which dataset is available in LibSVM format at\n",


Can we add a sentence that provides clarity on where this Spark Dataframe is living? (i.e. It's in a Spark context on the Notebook Instance). And maybe a sentence that mentions how this could be extended to connect with a Spark EMR cluster to orchestrate large data manipulations there. Maybe include a link to this post too?

Done and done

djarpin · 2018-01-23T23:08:14Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb

+   "outputs": [],
+   "source": [
+    "# replace this with your own region, such as us-east-1\n",
+    "region = 'us-east-1'\n",


Can we set region dynamically with boto3.Session().region_name?

djarpin · 2018-01-23T23:12:26Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb

+    "    sagemakerRole=IAMRole(role),\n",
+    "    trainingInstanceType='ml.p2.xlarge',\n",
+    "    trainingInstanceCount=1,\n",
+    "    endpointInstanceType='ml.c4.xlarge',\n",


Can we change both training and hosting to an ml.m4.xlarge instance? They are free-tier eligible and so preferred.

djarpin · 2018-01-23T23:14:57Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb

+    "## Inference\n",
+    "\n",
+    "Now we transform our DataFrame.\n",
+    "To do this, we serialize each row's \"features\" Vector of Doubles into a Protobuf format for inference against the Amazon SageMaker Endpoint. We deserialize the Protobuf responses back into our DataFrame:"


Maybe add a sentence that says, "This serialization and deserialization is handled automatically by the .transform() method."

Good idea, done.

djarpin · 2018-01-23T23:20:29Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",


Can we add a comment like:
This notebook was created and tested on an ml.m4.xlarge notebook instance.
?

andremoeller · 2018-01-24T21:59:27Z

Great, thanks David! I also moved a couple of markdown cells related to data loading around, out of the introduction.

djarpin

Awesome! Really excited to add these to the repository.

Update mxnet.ipynb

…aws#168)

andremoeller added 3 commits January 17, 2018 13:20

add sagemaker spark kmeans mnist notebook

1d8f732

add README for SageMaker Spark

0358cb1

change header names

caa8bc5

andremoeller force-pushed the pyspark-kmeans branch from 029cba4 to caa8bc5 Compare January 17, 2018 23:44

andremoeller mentioned this pull request Jan 17, 2018

add xgboost mnist example for pyspark #170

Merged

winstonaws requested review from laurenyu and djarpin January 18, 2018 22:24

laurenyu reviewed Jan 20, 2018

View reviewed changes

reorganize imports, clarify region

e5f480f

laurenyu approved these changes Jan 22, 2018

View reviewed changes

get region from boto client

b044334

djarpin reviewed Jan 23, 2018

View reviewed changes

improve markdown

9693287

djarpin approved these changes Jan 25, 2018

View reviewed changes

djarpin merged commit cca2db6 into aws:master Jan 25, 2018

atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022

Merge pull request aws#168 from awslabs/mxnet-nb

350442e

Update mxnet.ipynb

atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022

Fix create_trial brk when many elements are read in the first attempt (…

a125850

…aws#168)

Pyspark kmeans clustering on MNIST #168

Pyspark kmeans clustering on MNIST #168

Uh oh!

Conversation

andremoeller commented Jan 17, 2018

Uh oh!

laurenyu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laurenyu left a comment

Choose a reason for hiding this comment

Uh oh!

djarpin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andremoeller commented Jan 24, 2018

Uh oh!

djarpin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!