|
44 | 44 | "\n",
|
45 | 45 | "Let's start by specifying:\n",
|
46 | 46 | "\n",
|
47 |
| - "* The SageMaker role arn used to give learning and hosting access to your data. See the documentation for how to create these. Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto call with a the appropriate full SageMaker role arn string.\n", |
| 47 | + "* The SageMaker role arn used to give learning and hosting access to your data. The snippet below will use the same role used by your SageMaker notebook instance, if you're using other. Otherwise, specify the full ARN of a role with the SageMakerFullAccess policy attached.\n", |
48 | 48 | "* The S3 bucket that you want to use for training and storing model objects."
|
49 | 49 | ]
|
50 | 50 | },
|
|
158 | 158 | "metadata": {},
|
159 | 159 | "source": [
|
160 | 160 | "#### Key observations:\n",
|
161 |
| - "* Data has 569 observations and 33 columns\n", |
162 |
| - "* First field is id\n", |
163 |
| - "* Second field is an indicator of the diagnosis (M - Malignant and B- Benign)\n", |
164 |
| - "* There are 30 other numeric features available for prediction" |
| 161 | + "* Data has 569 observations and 32 columns.\n", |
| 162 | + "* First field is 'id'.\n", |
| 163 | + "* Second field, 'diagnosis', is an indicator of the actual diagnosis ('M' = Malignant; 'B' = Benign).\n", |
| 164 | + "* There are 30 other numeric features available for prediction." |
165 | 165 | ]
|
166 | 166 | },
|
167 | 167 | {
|
168 | 168 | "cell_type": "markdown",
|
169 | 169 | "metadata": {},
|
170 | 170 | "source": [
|
171 |
| - "#### Create features and labels\n", |
172 |
| - "#### Split the data into 80% training, 10% validation and 10% testing" |
| 171 | + "## Create Features and Labels\n", |
| 172 | + "#### Split the data into 80% training, 10% validation and 10% testing." |
173 | 173 | ]
|
174 | 174 | },
|
175 | 175 | {
|
|
205 | 205 | }
|
206 | 206 | },
|
207 | 207 | "source": [
|
208 |
| - "Now, we'll convert the datasets to the recordIO wrapped protobuf format used by the Amazon SageMaker algorithms and upload this data to S3. We'll start with training data." |
| 208 | + "Now, we'll convert the datasets to the recordIO-wrapped protobuf format used by the Amazon SageMaker algorithms, and then upload this data to S3. We'll start with training data." |
209 | 209 | ]
|
210 | 210 | },
|
211 | 211 | {
|
|
288 | 288 | "metadata": {},
|
289 | 289 | "outputs": [],
|
290 | 290 | "source": [
|
| 291 | + "# See 'Algorithms Provided by Amazon SageMaker: Common Parameters' in the SageMaker documentation for an explanation of these values.", |
291 | 292 | "containers = {'us-west-2': '174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest',\n",
|
292 | 293 | " 'us-east-1': '382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest',\n",
|
293 | 294 | " 'us-east-2': '404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:latest',\n",
|
|
370 | 371 | "cell_type": "markdown",
|
371 | 372 | "metadata": {},
|
372 | 373 | "source": [
|
373 |
| - "Now let's kick off our training job in SageMaker's distributed, managed training, using the parameters we just created. Because training is managed, we don't have to wait for our job to finish to continue, but for this case, let's setup a while loop so we can monitor the status of our training." |
| 374 | + "Now let's kick off our training job in SageMaker's distributed, managed training, using the parameters we just created. Because training is managed, we don't have to wait for our job to finish to continue, but for this case, let's use boto3's 'TrainingJob_Created' waiter so we can ensure that the job has been started." |
374 | 375 | ]
|
375 | 376 | },
|
376 | 377 | {
|
|
612 | 613 | "cell_type": "markdown",
|
613 | 614 | "metadata": {},
|
614 | 615 | "source": [
|
615 |
| - "###### Uncomment the cell below to delete endpoint once you are done" |
| 616 | + "###### Uncomment the cell below to delete endpoint once you are done." |
616 | 617 | ]
|
617 | 618 | },
|
618 | 619 | {
|
|
631 | 632 | "---\n",
|
632 | 633 | "## Extensions\n",
|
633 | 634 | "\n",
|
634 |
| - "Our linear model does a good job of predicting breast cancer and has an overall accuracy of close to 92%. We can re-run the model with different values of the hyper-parameters, loss functions etc and see if we get improved prediction. Re-running the model with further tweaks to these hyperparameters may provide more accurate out-of-sample predictions. We also did not do much feature engineering. We can create additional features by considering cross-product/intreaction of multiple features, squaring or raising higher powers of the features to induce non-linear effects, etc. If we expand the features using non-linear terms and interactions, we can then tweak the regulaization parameter to optimize the expanded model and hence generate improved forecasts. As a further extension, we can use many of non-linear models available through SageMaker such as Xgboost, mxnet etc.\n" |
| 635 | + "- Our linear model does a good job of predicting breast cancer and has an overall accuracy of close to 92%. We can re-run the model with different values of the hyper-parameters, loss functions etc and see if we get improved prediction. Re-running the model with further tweaks to these hyperparameters may provide more accurate out-of-sample predictions.\n", |
| 636 | + "- We also did not do much feature engineering. We can create additional features by considering cross-product/intreaction of multiple features, squaring or raising higher powers of the features to induce non-linear effects, etc. If we expand the features using non-linear terms and interactions, we can then tweak the regulaization parameter to optimize the expanded model and hence generate improved forecasts.\n", |
| 637 | + "- As a further extension, we can use many of non-linear models available through SageMaker such as XGBoost, MXNet etc.\n" |
635 | 638 | ]
|
636 | 639 | }
|
637 | 640 | ],
|
|
0 commit comments