-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Pyspark kmeans clustering on MNIST #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
029cba4
to
caa8bc5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good to me - just a few small suggestions
"\n", | ||
"First, we load the MNIST dataset into a Spark Dataframe, which dataset is available in LibSVM format at\n", | ||
"\n", | ||
"s3://sagemaker-sample-data-[region, such as us-east-1]/spark/mnist/train/" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might be clearer as:
s3://sagemaker-sample-data-[region]/spark/mnist/train/
where
[region]
is replaced with a supported AWS region, such asus-east-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done.
"import os\n", | ||
"import sagemaker_pyspark\n", | ||
"import sagemaker\n", | ||
"from sagemaker import get_execution_role\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick - Python import statements should be grouped by standard library imports, third-party imports, and local imports. So these should become something like:
import os
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import sagemaker
import sagemaker_pyspark
from sagemaker import get_execution_role
reference: https://www.python.org/dev/peps/pep-0008/#imports
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done.
"\n", | ||
"# replace this with your role ARN\n", | ||
"kmeans_estimator = KMeansSageMakerEstimator(\n", | ||
" sagemakerRole=IAMRole(role),\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean get_execution_role()
doesn't need to be called above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this comment.
}, | ||
"outputs": [], | ||
"source": [ | ||
"# Delete the endpoint\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making a short text cell to explain deleting the endpoint instead of this comment might be more effective, but YMMV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks really good. I had a couple comments. Let me know if you want to discuss in person. Thanks for the contribution!
"or in the \"license\" file accompanying this file. This file is distributed\n", | ||
"on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either\n", | ||
"express or implied. See the License for the specific language governing\n", | ||
"permissions and limitations under the License." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor point... You may want to add this to the Notebook's metadata (Edit -> Notebook Medata) since putting it at the top seems a bit distracting.
"metadata": {}, | ||
"source": [ | ||
"## Introduction\n", | ||
"This notebook will show how to classify handwritten digits using the K-Means clustering algorithm through the SageMaker PySpark library. We will train on Amazon SageMaker using K-Means clustering on the MNIST dataset, host the trained model on Amazon SageMaker, and then make predictions against that hosted model.\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a sentence about how this k-means notebook differs from kmeans_mnist.ipynb (e.g. this one uses a Spark session to manipulate the data and call estimators/transformers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done -- added to introduction
"\n", | ||
"We will train on Amazon SageMaker using the KMeans Clustering on the MNIST dataset, host the trained model on Amazon SageMaker, and then make predictions against that hosted model.\n", | ||
"\n", | ||
"First, we load the MNIST dataset into a Spark Dataframe, which dataset is available in LibSVM format at\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a sentence that provides clarity on where this Spark Dataframe is living? (i.e. It's in a Spark context on the Notebook Instance). And maybe a sentence that mentions how this could be extended to connect with a Spark EMR cluster to orchestrate large data manipulations there. Maybe include a link to this post too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done and done
"outputs": [], | ||
"source": [ | ||
"# replace this with your own region, such as us-east-1\n", | ||
"region = 'us-east-1'\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we set region dynamically with boto3.Session().region_name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
" sagemakerRole=IAMRole(role),\n", | ||
" trainingInstanceType='ml.p2.xlarge',\n", | ||
" trainingInstanceCount=1,\n", | ||
" endpointInstanceType='ml.c4.xlarge',\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change both training and hosting to an ml.m4.xlarge instance? They are free-tier eligible and so preferred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, done
"## Inference\n", | ||
"\n", | ||
"Now we transform our DataFrame.\n", | ||
"To do this, we serialize each row's \"features\" Vector of Doubles into a Protobuf format for inference against the Amazon SageMaker Endpoint. We deserialize the Protobuf responses back into our DataFrame:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a sentence that says, "This serialization and deserialization is handled automatically by the .transform()
method."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, done.
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Introduction\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment like:
This notebook was created and tested on an ml.m4.xlarge notebook instance.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Great, thanks David! I also moved a couple of markdown cells related to data loading around, out of the introduction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Really excited to add these to the repository.
Update mxnet.ipynb
SageMaker Spark notebook that runs KMeans on MNIST with local Spark, conda_python3 kernel.