Edit markdown in sklearn_byom.ipynb (#3287)

atqy · EC2 Default User · EC2 Default User · web-flow · commit 8f73d51e66d7 · 2022-04-06T13:32:08.000-07:00
* edit markdown in sklearn_byom.ipynb

* quick type

* reformat

* reformat

* add time

* change to present tense

Co-authored-by: EC2 Default User &lt;ec2-user@ip-172-16-68-92.us-west-2.compute.internal&gt;
Co-authored-by: EC2 Default User &lt;ec2-user@ip-172-16-41-188.us-west-2.compute.internal&gt;
diff --git a/sagemaker-script-mode/sklearn/sklearn_byom.ipynb b/sagemaker-script-mode/sklearn/sklearn_byom.ipynb
@@ -2,27 +2,37 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "ee5abd3d",
+   "id": "e950fa8e",
    "metadata": {},
    "source": [
-    "# SKLearn Script Mode + Bring Your Own Model\n",
+    "# Train a SKLearn Model using Script Mode\n",
     "\n",
-    "- [Documentation](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html)\n",
-    "- Dataset: [Iris](https://archive.ics.uci.edu/ml/datasets/iris)"
+    "The aim of this notebook is to demonstrate how to train and deploy a scikit-learn model in Amazon SageMaker. The method used is called Script Mode, in which we write a script to train our model and submit it to the SageMaker Python SDK. For more information, feel free to read [Using Scikit-learn with the SageMaker Python SDK](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/using_sklearn.html).\n",
+    "\n",
+    "## Runtime\n",
+    "This notebook takes approximately 15 minutes to run.\n",
+    "\n",
+    "## Contents\n",
+    "1. [Download data](#Download-data)\n",
+    "1. [Prepare data](#Prepare-data)\n",
+    "1. [Train model](#Train-model)\n",
+    "1. [Deploy and test endpoint](#Deploy-and-test-endpoint)\n",
+    "1. [Cleanup](#Cleanup)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "04fb9160",
+   "id": "a16db1a6",
    "metadata": {},
    "source": [
-    "# Read Data"
+    "## Download data \n",
+    "Download the [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris), which is the data used to trained the model in this demo."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "1de023a4",
+   "id": "a670c242",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -39,10 +49,19 @@
     "df.head()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "7c03b3d2",
+   "metadata": {},
+   "source": [
+    "## Prepare data\n",
+    "Next, we prepare the data for training by first converting the labels from string to integers. Then we split the data into a train dataset (80% of the data) and test dataset (the remaining 20% of the data) before saving them into CSV files. Then, these files are uploaded to S3 where the SageMaker SDK can access and use them to train the model."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "3eb3ab67",
+   "id": "72748b04",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -56,7 +75,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "e75f4bc4",
+   "id": "fb5ea6cf",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -71,7 +90,7 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "61c07854",
+   "id": "48770a6b",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -80,18 +99,10 @@
     "test.to_csv(\"test.csv\", index=False)"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "id": "90842a0d",
-   "metadata": {},
-   "source": [
-    "# Upload Data to S3"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "f79c8e43",
+   "id": "ba40dab3",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -107,16 +118,17 @@
   },
   {
    "cell_type": "markdown",
-   "id": "40d87112",
+   "id": "9d52c534",
    "metadata": {},
    "source": [
-    "# Train Estimator"
+    "## Train model\n",
+    "The model is trained using the SageMaker SDK's Estimator class. Firstly, get the execution role for training. This role allows us to access the S3 bucket in the last step, where the train and test data set is located."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "98d6eb28",
+   "id": "f7cbdad2",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -125,10 +137,25 @@
     "print(role)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "10cdcfb6",
+   "metadata": {},
+   "source": [
+    "Then, it is time to define the SageMaker SDK Estimator class. We use an Estimator class specifically desgined to train scikit-learn models called `SKLearn`. In this estimator, we define the following parameters:\n",
+    "1. The script that we want to use to train the model (i.e. `entry_point`). This is the heart of the Script Mode method. Additionally, set the `script_mode` parameter to `True`.\n",
+    "1. The role which allows us access to the S3 bucket containing the train and test data set (i.e. `role`)\n",
+    "1. How many instances we want to use in training (i.e. `instance_count`) and what type of instance we want to use in training (i.e. `instance_type`)\n",
+    "1. Which version of scikit-learn to use (i.e. `framework_version`)\n",
+    "1. Training hyperparameters (i.e. `hyperparameters`)\n",
+    "\n",
+    "After setting these parameters, the `fit` function is invoked to train the model."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "df113b51",
+   "id": "ac14dcb7",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -153,16 +180,17 @@
   },
   {
    "cell_type": "markdown",
-   "id": "5a8762c7",
+   "id": "3813b62c",
    "metadata": {},
    "source": [
-    "# Deploy Endpoint"
+    "## Deploy and test endpoint\n",
+    "After training the model, it is time to deploy it as an endpoint. To do so, we invoke the `deploy` function within the scikit-learn estimator. As shown in the code below, one can define the number of instances (i.e. `initial_instance_count`) and instance type (i.e. `instance_type`) used to deploy the model."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "9d39a7af",
+   "id": "06aace5c",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -176,18 +204,16 @@
   },
   {
    "cell_type": "markdown",
-   "id": "6e1c4ac0",
+   "id": "bbc747e1",
    "metadata": {},
    "source": [
-    "# Test Endpoint\n",
-    "- Can use [invoke endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html) or [predictor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-predictor), using invoke endpoint for this example. \n",
-    "- For predictor make sure to [serialize](https://sagemaker.readthedocs.io/en/stable/api/inference/serializers.html) properly."
+    "After the endpoint has been completely deployed, it can be invoked using the [SageMaker Runtime Client](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime.html) (which is the method used in the code cell below) or [Scikit Learn Predictor](https://sagemaker.readthedocs.io/en/stable/frameworks/sklearn/sagemaker.sklearn.html#scikit-learn-predictor). If you plan to use the latter method, make sure to use a [Serializer](https://sagemaker.readthedocs.io/en/stable/api/inference/serializers.html) to serialize your data properly."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "02bb9960",
+   "id": "85491166",
    "metadata": {},
    "outputs": [],
    "source": [
@@ -209,16 +235,17 @@
   },
   {
    "cell_type": "markdown",
-   "id": "8bae026f",
+   "id": "90f26921",
    "metadata": {},
    "source": [
-    "# Cleanup"
+    "## Cleanup\n",
+    "If the model and endpoint are no longer in use, they should be deleted to save costs and free up resources."
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "15f90752",
+   "id": "ec5a3a83",
    "metadata": {},
    "outputs": [],
    "source": [