New version of pyspark mnist kmeans example notebook #233

JonathanTaws · 2018-04-05T17:25:40Z

Proposal for a new version and revision of the content for the pyspark mnist kmeans example notebook.

Main changes:

Included example of using Spark pipeline with hybrid or SageMaker-only pipelines
Showed how to create custom SageMakerEstimator in SageMaker-Spark SDK
Showcase re-use of endpoints, models, and training job data
Enhance clean up to clean up models in Pipelines

I also have a few diagrams for these that I can include. As they are images, it might increase the load time of the Jupyter notebook browser however.

andremoeller · 2018-04-11T20:41:39Z

Thanks @JonathanTaws !

The content is excellent. It might make sense to split this up into multiple notebooks, since the running time for the whole notebook is pretty long: one for hybrid pipeline, one for SageMaker-only pipeline, one for custom SageMakerEstimator. (I think it makes sense to keep re-using existing endpoints in this one.).

I think it's fine to include the images in the notebooks. The effect of reasonably sized images on notebook load time should be negligible.

Thoughts @djarpin ?

JonathanTaws · 2018-04-25T22:14:44Z

I went ahead and split up the notebookes, and included some diagrams. Any thoughts @andremoeller and @djarpin ?

djarpin

This is really great content. Thanks a ton for putting this together. I've left a few comments.

djarpin · 2018-04-27T17:49:35Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_pca_kmeans.ipynb

+    "7. [More on SageMaker Spark](#More-on-SageMaker-Spark)\n",
+    "\n",
+    "## Introduction\n",
+    "This notebook will show how to classify handwritten digits through the SageMaker PySpark library. \n",


I might change the wording to say we'll show how to "cluster handwritten digits" rather than "classify" them just to avoid supervised/unsupervised learning confusion.

Note, I realize this same language shows up in the original k-means PySpark notebook, but we should probably change there too.

djarpin · 2018-04-27T17:51:47Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_pca_kmeans.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we've defined the `Pipeline`, we can call fit on the training data. "


Can you add a call-out here that the next code block will take a several minutes to run? There's no immediate cell output, so it's hard to tell if the cell is still working or if something got stuck.

djarpin · 2018-04-27T17:55:14Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_pca_mllib_kmeans.ipynb

+    "7. [More on SageMaker Spark](#More-on-SageMaker-Spark)\n",
+    "\n",
+    "## Introduction\n",
+    "This notebook will show how to classify handwritten digits through the SageMaker PySpark library. \n",


I might change the wording to say we'll show how to "cluster handwritten digits" rather than "classify" them just to avoid supervised/unsupervised learning confusion.

Note, I realize this same language shows up in the original k-means PySpark notebook, but we should probably change there too.

djarpin · 2018-04-27T17:55:55Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_pca_mllib_kmeans.ipynb

+   "metadata": {},
+   "source": [
+    "### Create a hybrid pipeline with Spark PCA and SageMaker K-Means\n",
+    "To perform the clustering task, we will first running PCA on our feature vector, reducing it to 50 features. Then, we can use K-Means on the result of PCA to apply the final clustering. We will create a **Pipeline** consisting of 2 stages: the PCA stage, and the K-Means stage. \n",


I think it's good to have both a notebook that shows PCA being run in Spark versus PCA being run in SageMaker. I'm wondering if it makes sense to add a bit more detail about why you may pick one over the other though. I think in this case, Spark PCA can represent any one of a large number of pre-processing steps that could be done in Spark. So, I think it's key to hit that point that the Spark PCA is representative of all of those.

Good point. Would a sentence such as the following capture this?
" The use of Spark MLLib PCA in this notebook is meant to showcase how you can use different pre-processting steps, ranging from data transformers to algorithms, with tools such as Spark MLLib that are well suited for data pre-processing. You can then use SageMaker algorithms and features through the SageMaker-Spark SDK. Here in our case, PCA is in charge of reducing the feature vector as a pre-processing step, and K-Means responsible for clustering the data. "

This sounds good to me. Let's go with it. Thanks.

djarpin · 2018-04-27T17:56:28Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_pca_mllib_kmeans.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we've defined the `Pipeline`, we can call fit on the training data. "


Can you add a call-out here that the next code block will take a several minutes to run? There's no immediate cell output, so it's hard to tell if the cell is still working or if something got stuck.

djarpin · 2018-04-27T18:17:14Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb

    "## Introduction\n",
-    "This notebook will show how to classify handwritten digits using the K-Means clustering algorithm through the SageMaker PySpark library. We will train on Amazon SageMaker using K-Means clustering on the MNIST dataset, host the trained model on Amazon SageMaker, and then make predictions against that hosted model.\n",
+    "This notebook will show how to classify handwritten digits through the SageMaker PySpark library. \n",


Same "classify" to "cluster" comment as below.

djarpin · 2018-04-27T18:35:51Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_kmeans.ipynb

@@ -34,18 +32,19 @@
   "source": [
    "## Setup\n",
    "\n",
-    "First, we import the necessary modules and create the SparkSession with the SageMaker Spark dependencies."
+    "First, we import the necessary modules and create the SparkSession and `SparkSession` with the SageMaker-Spark dependencies attached. "


Do we need the double "SparkSession and SparkSession" here?

djarpin · 2018-04-27T18:47:39Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_pca_kmeans.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark.sql.types import DoubleType\n",


Same note as below on the "%matplotlib inline"

djarpin · 2018-04-27T19:25:44Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_custom_estimator.ipynb

+   "pygments_lexer": "ipython3",
+   "version": "3.6.4"
+  },
+  "notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved.  Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."


Can you change to 2018?

djarpin · 2018-04-27T19:27:21Z

sagemaker-spark/pyspark_mnist/pyspark_mnist_custom_estimator.ipynb

+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's train this estimator by calling fit on it with the training data."


Can you call out this one and the following cells that will take a bit of time to run? Thanks.

lynnlangit · 2018-04-28T02:42:32Z

I am really looking forward to trying out this set of sample notebooks after the PR is merged.

JonathanTaws · 2018-04-28T08:30:26Z

Thanks @djarpin for looking into this, I've taken into account all changes and they have been pushed into the PR.
Let's review exactly how we want to frame the PCA on Spark vs. PCA on SageMaker part.

JonathanTaws · 2018-05-12T21:59:18Z

Awesome! In that case we're good to go if there's nothing else to address. Le lun. 30 avr. 2018 à 09:15, David Arpin <[email protected]> a écrit :

…

***@***.**** commented on this pull request. ------------------------------ In sagemaker-spark/pyspark_mnist/pyspark_mnist_pca_mllib_kmeans.ipynb <#233 (comment)> : > + "\n", + "In each row:\n", + "* The `label` column identifies the image's label. For example, if the image of the handwritten number is the digit 5, the label value is 5.\n", + "* The `features` column stores a vector (`org.apache.spark.ml.linalg.Vector`) of `Double` values. The length of the vector is 784, as each image consists of 784 pixels. Those pixels are the features we will use. \n", + "\n", + "\n", + "\n", + "As we are interested in clustering the images of digits, the number of pixels represents the feature vector, while the number of classes represents the number of clusters we want to find. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create a hybrid pipeline with Spark PCA and SageMaker K-Means\n", + "To perform the clustering task, we will first running PCA on our feature vector, reducing it to 50 features. Then, we can use K-Means on the result of PCA to apply the final clustering. We will create a **Pipeline** consisting of 2 stages: the PCA stage, and the K-Means stage. \n", This sounds good to me. Let's go with it. Thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#233 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFS21hlFj1fRpjH4PF_shTLPonusXSQlks5ttleJgaJpZM4TI2sY> .

Replace 'collections.ts' with COLLECTIONS_FILE_NAME

* update tests to use pytest * Added cov-append Co-authored-by: Vikas-kum <[email protected]>

New version of pyspark mnist kmeans example notebook

4b5a7c6

JonathanTaws added 2 commits April 25, 2018 15:07

Split up of notebooks and included diagrams

1392c60

Architecture diagrams for sagemaker-spark

9b40034

djarpin suggested changes Apr 27, 2018

View reviewed changes

Modified license year, classify to cluster, and warning on running times

2ceecb2

djarpin approved these changes May 24, 2018

View reviewed changes

djarpin merged commit 7e1c06a into aws:master May 24, 2018

atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022

Replace 'collections.ts' with COLLECTIONS_FILE_NAME (aws#233)

18b8094

Replace 'collections.ts' with COLLECTIONS_FILE_NAME

atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022

CodeCov Support (aws#233)

5a8d8b6

* update tests to use pytest * Added cov-append Co-authored-by: Vikas-kum <[email protected]>

New version of pyspark mnist kmeans example notebook #233

New version of pyspark mnist kmeans example notebook #233

Uh oh!

Conversation

JonathanTaws commented Apr 5, 2018

Uh oh!

andremoeller commented Apr 11, 2018

Uh oh!

JonathanTaws commented Apr 25, 2018

Uh oh!

djarpin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lynnlangit commented Apr 28, 2018

Uh oh!

JonathanTaws commented Apr 28, 2018

Uh oh!

JonathanTaws commented May 12, 2018 via email

Uh oh!

Uh oh!