-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New version of pyspark mnist kmeans example notebook #233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks @JonathanTaws ! The content is excellent. It might make sense to split this up into multiple notebooks, since the running time for the whole notebook is pretty long: one for hybrid pipeline, one for SageMaker-only pipeline, one for custom SageMakerEstimator. (I think it makes sense to keep re-using existing endpoints in this one.). I think it's fine to include the images in the notebooks. The effect of reasonably sized images on notebook load time should be negligible. Thoughts @djarpin ? |
I went ahead and split up the notebookes, and included some diagrams. Any thoughts @andremoeller and @djarpin ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great content. Thanks a ton for putting this together. I've left a few comments.
"7. [More on SageMaker Spark](#More-on-SageMaker-Spark)\n", | ||
"\n", | ||
"## Introduction\n", | ||
"This notebook will show how to classify handwritten digits through the SageMaker PySpark library. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might change the wording to say we'll show how to "cluster handwritten digits" rather than "classify" them just to avoid supervised/unsupervised learning confusion.
Note, I realize this same language shows up in the original k-means PySpark notebook, but we should probably change there too.
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now that we've defined the `Pipeline`, we can call fit on the training data. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a call-out here that the next code block will take a several minutes to run? There's no immediate cell output, so it's hard to tell if the cell is still working or if something got stuck.
"7. [More on SageMaker Spark](#More-on-SageMaker-Spark)\n", | ||
"\n", | ||
"## Introduction\n", | ||
"This notebook will show how to classify handwritten digits through the SageMaker PySpark library. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might change the wording to say we'll show how to "cluster handwritten digits" rather than "classify" them just to avoid supervised/unsupervised learning confusion.
Note, I realize this same language shows up in the original k-means PySpark notebook, but we should probably change there too.
"metadata": {}, | ||
"source": [ | ||
"### Create a hybrid pipeline with Spark PCA and SageMaker K-Means\n", | ||
"To perform the clustering task, we will first running PCA on our feature vector, reducing it to 50 features. Then, we can use K-Means on the result of PCA to apply the final clustering. We will create a **Pipeline** consisting of 2 stages: the PCA stage, and the K-Means stage. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good to have both a notebook that shows PCA being run in Spark versus PCA being run in SageMaker. I'm wondering if it makes sense to add a bit more detail about why you may pick one over the other though. I think in this case, Spark PCA can represent any one of a large number of pre-processing steps that could be done in Spark. So, I think it's key to hit that point that the Spark PCA is representative of all of those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Would a sentence such as the following capture this?
" The use of Spark MLLib PCA in this notebook is meant to showcase how you can use different pre-processting steps, ranging from data transformers to algorithms, with tools such as Spark MLLib that are well suited for data pre-processing. You can then use SageMaker algorithms and features through the SageMaker-Spark SDK. Here in our case, PCA is in charge of reducing the feature vector as a pre-processing step, and K-Means responsible for clustering the data. "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds good to me. Let's go with it. Thanks.
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now that we've defined the `Pipeline`, we can call fit on the training data. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a call-out here that the next code block will take a several minutes to run? There's no immediate cell output, so it's hard to tell if the cell is still working or if something got stuck.
"## Introduction\n", | ||
"This notebook will show how to classify handwritten digits using the K-Means clustering algorithm through the SageMaker PySpark library. We will train on Amazon SageMaker using K-Means clustering on the MNIST dataset, host the trained model on Amazon SageMaker, and then make predictions against that hosted model.\n", | ||
"This notebook will show how to classify handwritten digits through the SageMaker PySpark library. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same "classify" to "cluster" comment as below.
@@ -34,18 +32,19 @@ | |||
"source": [ | |||
"## Setup\n", | |||
"\n", | |||
"First, we import the necessary modules and create the SparkSession with the SageMaker Spark dependencies." | |||
"First, we import the necessary modules and create the SparkSession and `SparkSession` with the SageMaker-Spark dependencies attached. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the double "SparkSession and SparkSession
" here?
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from pyspark.sql.types import DoubleType\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same note as below on the "%matplotlib inline"
"pygments_lexer": "ipython3", | ||
"version": "3.6.4" | ||
}, | ||
"notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change to 2018?
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Let's train this estimator by calling fit on it with the training data." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you call out this one and the following cells that will take a bit of time to run? Thanks.
I am really looking forward to trying out this set of sample notebooks after the PR is merged. |
Thanks @djarpin for looking into this, I've taken into account all changes and they have been pushed into the PR. |
Awesome! In that case we're good to go if there's nothing else to address.
Le lun. 30 avr. 2018 à 09:15, David Arpin <[email protected]> a
écrit :
… ***@***.**** commented on this pull request.
------------------------------
In sagemaker-spark/pyspark_mnist/pyspark_mnist_pca_mllib_kmeans.ipynb
<#233 (comment)>
:
> + "\n",
+ "In each row:\n",
+ "* The `label` column identifies the image's label. For example, if the image of the handwritten number is the digit 5, the label value is 5.\n",
+ "* The `features` column stores a vector (`org.apache.spark.ml.linalg.Vector`) of `Double` values. The length of the vector is 784, as each image consists of 784 pixels. Those pixels are the features we will use. \n",
+ "\n",
+ "\n",
+ "\n",
+ "As we are interested in clustering the images of digits, the number of pixels represents the feature vector, while the number of classes represents the number of clusters we want to find. "
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Create a hybrid pipeline with Spark PCA and SageMaker K-Means\n",
+ "To perform the clustering task, we will first running PCA on our feature vector, reducing it to 50 features. Then, we can use K-Means on the result of PCA to apply the final clustering. We will create a **Pipeline** consisting of 2 stages: the PCA stage, and the K-Means stage. \n",
This sounds good to me. Let's go with it. Thanks.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#233 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFS21hlFj1fRpjH4PF_shTLPonusXSQlks5ttleJgaJpZM4TI2sY>
.
|
Replace 'collections.ts' with COLLECTIONS_FILE_NAME
* update tests to use pytest * Added cov-append Co-authored-by: Vikas-kum <[email protected]>
Proposal for a new version and revision of the content for the pyspark mnist kmeans example notebook.
Main changes:
I also have a few diagrams for these that I can include. As they are images, it might increase the load time of the Jupyter notebook browser however.