Skip to content

Pyspark kmeans clustering on MNIST #168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jan 25, 2018
Merged

Conversation

andremoeller
Copy link
Contributor

SageMaker Spark notebook that runs KMeans on MNIST with local Spark, conda_python3 kernel.

Copy link
Contributor

@laurenyu laurenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good to me - just a few small suggestions

"\n",
"First, we load the MNIST dataset into a Spark Dataframe, which dataset is available in LibSVM format at\n",
"\n",
"s3://sagemaker-sample-data-[region, such as us-east-1]/spark/mnist/train/"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might be clearer as:

s3://sagemaker-sample-data-[region]/spark/mnist/train/

where [region] is replaced with a supported AWS region, such as us-east-1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

"import os\n",
"import sagemaker_pyspark\n",
"import sagemaker\n",
"from sagemaker import get_execution_role\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick - Python import statements should be grouped by standard library imports, third-party imports, and local imports. So these should become something like:

import os

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession

import sagemaker
import sagemaker_pyspark
from sagemaker import get_execution_role

reference: https://www.python.org/dev/peps/pep-0008/#imports

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done.

"\n",
"# replace this with your role ARN\n",
"kmeans_estimator = KMeansSageMakerEstimator(\n",
" sagemakerRole=IAMRole(role),\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean get_execution_role() doesn't need to be called above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this comment.

},
"outputs": [],
"source": [
"# Delete the endpoint\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making a short text cell to explain deleting the endpoint instead of this comment might be more effective, but YMMV

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

Copy link
Contributor

@laurenyu laurenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me!

Copy link
Contributor

@djarpin djarpin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks really good. I had a couple comments. Let me know if you want to discuss in person. Thanks for the contribution!

"or in the \"license\" file accompanying this file. This file is distributed\n",
"on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either\n",
"express or implied. See the License for the specific language governing\n",
"permissions and limitations under the License."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point... You may want to add this to the Notebook's metadata (Edit -> Notebook Medata) since putting it at the top seems a bit distracting.

"metadata": {},
"source": [
"## Introduction\n",
"This notebook will show how to classify handwritten digits using the K-Means clustering algorithm through the SageMaker PySpark library. We will train on Amazon SageMaker using K-Means clustering on the MNIST dataset, host the trained model on Amazon SageMaker, and then make predictions against that hosted model.\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a sentence about how this k-means notebook differs from kmeans_mnist.ipynb (e.g. this one uses a Spark session to manipulate the data and call estimators/transformers).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done -- added to introduction

"\n",
"We will train on Amazon SageMaker using the KMeans Clustering on the MNIST dataset, host the trained model on Amazon SageMaker, and then make predictions against that hosted model.\n",
"\n",
"First, we load the MNIST dataset into a Spark Dataframe, which dataset is available in LibSVM format at\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a sentence that provides clarity on where this Spark Dataframe is living? (i.e. It's in a Spark context on the Notebook Instance). And maybe a sentence that mentions how this could be extended to connect with a Spark EMR cluster to orchestrate large data manipulations there. Maybe include a link to this post too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done and done

"outputs": [],
"source": [
"# replace this with your own region, such as us-east-1\n",
"region = 'us-east-1'\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we set region dynamically with boto3.Session().region_name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

" sagemakerRole=IAMRole(role),\n",
" trainingInstanceType='ml.p2.xlarge',\n",
" trainingInstanceCount=1,\n",
" endpointInstanceType='ml.c4.xlarge',\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change both training and hosting to an ml.m4.xlarge instance? They are free-tier eligible and so preferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, done

"## Inference\n",
"\n",
"Now we transform our DataFrame.\n",
"To do this, we serialize each row's \"features\" Vector of Doubles into a Protobuf format for inference against the Amazon SageMaker Endpoint. We deserialize the Protobuf responses back into our DataFrame:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a sentence that says, "This serialization and deserialization is handled automatically by the .transform() method."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

"cell_type": "markdown",
"metadata": {},
"source": [
"## Introduction\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment like:
This notebook was created and tested on an ml.m4.xlarge notebook instance.
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@andremoeller
Copy link
Contributor Author

Great, thanks David! I also moved a couple of markdown cells related to data loading around, out of the introduction.

Copy link
Contributor

@djarpin djarpin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Really excited to add these to the repository.

@djarpin djarpin merged commit cca2db6 into aws:master Jan 25, 2018
atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022
atqy pushed a commit to atqy/amazon-sagemaker-examples that referenced this pull request Aug 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants