Skip to content

New version of pyspark mnist kmeans example notebook #233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 24, 2018

Conversation

JonathanTaws
Copy link
Contributor

Proposal for a new version and revision of the content for the pyspark mnist kmeans example notebook.

Main changes:

  • Included example of using Spark pipeline with hybrid or SageMaker-only pipelines
  • Showed how to create custom SageMakerEstimator in SageMaker-Spark SDK
  • Showcase re-use of endpoints, models, and training job data
  • Enhance clean up to clean up models in Pipelines

I also have a few diagrams for these that I can include. As they are images, it might increase the load time of the Jupyter notebook browser however.

@andremoeller
Copy link
Contributor

Thanks @JonathanTaws !

The content is excellent. It might make sense to split this up into multiple notebooks, since the running time for the whole notebook is pretty long: one for hybrid pipeline, one for SageMaker-only pipeline, one for custom SageMakerEstimator. (I think it makes sense to keep re-using existing endpoints in this one.).

I think it's fine to include the images in the notebooks. The effect of reasonably sized images on notebook load time should be negligible.

Thoughts @djarpin ?

@JonathanTaws
Copy link
Contributor Author

I went ahead and split up the notebookes, and included some diagrams. Any thoughts @andremoeller and @djarpin ?

Copy link
Contributor

@djarpin djarpin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great content. Thanks a ton for putting this together. I've left a few comments.

"7. [More on SageMaker Spark](#More-on-SageMaker-Spark)\n",
"\n",
"## Introduction\n",
"This notebook will show how to classify handwritten digits through the SageMaker PySpark library. \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might change the wording to say we'll show how to "cluster handwritten digits" rather than "classify" them just to avoid supervised/unsupervised learning confusion.

Note, I realize this same language shows up in the original k-means PySpark notebook, but we should probably change there too.

"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we've defined the `Pipeline`, we can call fit on the training data. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a call-out here that the next code block will take a several minutes to run? There's no immediate cell output, so it's hard to tell if the cell is still working or if something got stuck.

"7. [More on SageMaker Spark](#More-on-SageMaker-Spark)\n",
"\n",
"## Introduction\n",
"This notebook will show how to classify handwritten digits through the SageMaker PySpark library. \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might change the wording to say we'll show how to "cluster handwritten digits" rather than "classify" them just to avoid supervised/unsupervised learning confusion.

Note, I realize this same language shows up in the original k-means PySpark notebook, but we should probably change there too.

"metadata": {},
"source": [
"### Create a hybrid pipeline with Spark PCA and SageMaker K-Means\n",
"To perform the clustering task, we will first running PCA on our feature vector, reducing it to 50 features. Then, we can use K-Means on the result of PCA to apply the final clustering. We will create a **Pipeline** consisting of 2 stages: the PCA stage, and the K-Means stage. \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good to have both a notebook that shows PCA being run in Spark versus PCA being run in SageMaker. I'm wondering if it makes sense to add a bit more detail about why you may pick one over the other though. I think in this case, Spark PCA can represent any one of a large number of pre-processing steps that could be done in Spark. So, I think it's key to hit that point that the Spark PCA is representative of all of those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Would a sentence such as the following capture this?
" The use of Spark MLLib PCA in this notebook is meant to showcase how you can use different pre-processting steps, ranging from data transformers to algorithms, with tools such as Spark MLLib that are well suited for data pre-processing. You can then use SageMaker algorithms and features through the SageMaker-Spark SDK. Here in our case, PCA is in charge of reducing the feature vector as a pre-processing step, and K-Means responsible for clustering the data. "

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good to me. Let's go with it. Thanks.

"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we've defined the `Pipeline`, we can call fit on the training data. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a call-out here that the next code block will take a several minutes to run? There's no immediate cell output, so it's hard to tell if the cell is still working or if something got stuck.

"## Introduction\n",
"This notebook will show how to classify handwritten digits using the K-Means clustering algorithm through the SageMaker PySpark library. We will train on Amazon SageMaker using K-Means clustering on the MNIST dataset, host the trained model on Amazon SageMaker, and then make predictions against that hosted model.\n",
"This notebook will show how to classify handwritten digits through the SageMaker PySpark library. \n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same "classify" to "cluster" comment as below.

@@ -34,18 +32,19 @@
"source": [
"## Setup\n",
"\n",
"First, we import the necessary modules and create the SparkSession with the SageMaker Spark dependencies."
"First, we import the necessary modules and create the SparkSession and `SparkSession` with the SageMaker-Spark dependencies attached. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the double "SparkSession and SparkSession" here?

"metadata": {},
"outputs": [],
"source": [
"from pyspark.sql.types import DoubleType\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same note as below on the "%matplotlib inline"

"pygments_lexer": "ipython3",
"version": "3.6.4"
},
"notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change to 2018?

"cell_type": "markdown",
"metadata": {},
"source": [
"Let's train this estimator by calling fit on it with the training data."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you call out this one and the following cells that will take a bit of time to run? Thanks.

@lynnlangit
Copy link

I am really looking forward to trying out this set of sample notebooks after the PR is merged.

@JonathanTaws
Copy link
Contributor Author

Thanks @djarpin for looking into this, I've taken into account all changes and they have been pushed into the PR.
Let's review exactly how we want to frame the PCA on Spark vs. PCA on SageMaker part.

@JonathanTaws
Copy link
Contributor Author

JonathanTaws commented May 12, 2018 via email

@djarpin djarpin merged commit 7e1c06a into aws:master May 24, 2018
atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022
Replace 'collections.ts' with COLLECTIONS_FILE_NAME
atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022
* update tests to use pytest
* Added cov-append

Co-authored-by: Vikas-kum <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants