|
17 | 17 | "1. [Prepration](#Preparation)\n",
|
18 | 18 | "1. [Data](#Data)\n",
|
19 | 19 | " 1. [Exploration and Transformation](#Exploration) \n",
|
20 |
| - "1. [Training Xgboost model using Sagemaker](#Training)\n", |
| 20 | + "1. [Training Xgboost model using SageMaker](#Training)\n", |
21 | 21 | "1. [Hosting the model](#Hosting)\n",
|
22 | 22 | "1. [Evaluating the model on test samples](#Evaluation)\n",
|
23 |
| - "1. [Training a second Logistic Regression model using Sagemaker](#Linear-Model)\n", |
| 23 | + "1. [Training a second Logistic Regression model using SageMaker](#Linear-Model)\n", |
24 | 24 | "1. [Hosting the Second model](#Hosting:Linear-Learner)\n",
|
25 | 25 | "1. [Evaluating the model on test samples](#Prediction:Linear-Learner)\n",
|
26 | 26 | "1. [Combining the model results](#Ensemble)\n",
|
|
35 | 35 | "\n",
|
36 | 36 | "This notebook presents an illustrative example to predict if a person makes over 50K a year based on information about their education, work-experience, geneder etc.\n",
|
37 | 37 | "\n",
|
38 |
| - "* Preparing your _Sagemaker_ notebook\n", |
39 |
| - "* Loading a dataset from S3 using Sagemaker\n", |
40 |
| - "* Investigating and transforming the data so that it can be fed to _Sagemaker_ algorithms\n", |
41 |
| - "* Estimating a model using Sagemaker's XGBoost (eXtreme Gradient Boosting) algorithm\n", |
42 |
| - "* Hosting the model on Sagemaker to make on-going predictions\n", |
43 |
| - "* Estimating a second model using Sagemaker's -linear learner method\n", |
| 38 | + "* Preparing your _SageMaker_ notebook\n", |
| 39 | + "* Loading a dataset from S3 using SageMaker\n", |
| 40 | + "* Investigating and transforming the data so that it can be fed to _SageMaker_ algorithms\n", |
| 41 | + "* Estimating a model using SageMaker's XGBoost (eXtreme Gradient Boosting) algorithm\n", |
| 42 | + "* Hosting the model on SageMaker to make on-going predictions\n", |
| 43 | + "* Estimating a second model using SageMaker's Linear Learner method\n", |
44 | 44 | "* Combining the predictions from both the models and evluating the combined prediction\n",
|
45 | 45 | "* Generating final predictions on the test data set\n",
|
46 | 46 | "\n",
|
|
85 | 85 | "Now let's bring in the Python libraries that we'll use throughout the analysis"
|
86 | 86 | ]
|
87 | 87 | },
|
88 |
| - { |
89 |
| - "cell_type": "code", |
90 |
| - "execution_count": null, |
91 |
| - "metadata": {}, |
92 |
| - "outputs": [], |
93 |
| - "source": [ |
94 |
| - "!conda install -y -c conda-forge scikit-learn" |
95 |
| - ] |
96 |
| - }, |
97 | 88 | {
|
98 | 89 | "cell_type": "code",
|
99 | 90 | "execution_count": null,
|
|
245 | 236 | "\n",
|
246 | 237 | "## Training\n",
|
247 | 238 | "\n",
|
248 |
| - "As our first training algorithm we pick `xgboost` algorithm. `xgboost` is an extremely popular, open-source package for gradient boosted trees. It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions. Let's start with a simple `xgboost` model, trained using `Sagemaker's` serverless, distributed training framework.\n", |
| 239 | + "As our first training algorithm we pick `xgboost` algorithm. `xgboost` is an extremely popular, open-source package for gradient boosted trees. It is computationally powerful, fully featured, and has been successfully used in many machine learning competitions. Let's start with a simple `xgboost` model, trained using `SageMaker's` managed, distributed training framework.\n", |
249 | 240 | "\n",
|
250 | 241 | "First we'll need to specify training parameters. This includes:\n",
|
251 | 242 | "1. The role to use\n",
|
|
266 | 257 | "For csv input, right now we assume the input is separated by delimiter(automatically detect the separator by Python’s builtin sniffer tool), without a header line and also label is in the first column.\n",
|
267 | 258 | "Scoring Output Format: csv.\n",
|
268 | 259 | "\n",
|
269 |
| - "* Since our data is in CSV format, we will convert our dataset to the way Sagemaker's XGboost supports.\n", |
| 260 | + "* Since our data is in CSV format, we will convert our dataset to the way SageMaker's XGboost supports.\n", |
270 | 261 | "* We will keep the target field in first column and remaining features in the next few columns\n",
|
271 | 262 | "* We will remove the header line\n",
|
272 | 263 | "* We will also split the data into a separate training and validation sets\n",
|
|
411 | 402 | "cell_type": "markdown",
|
412 | 403 | "metadata": {},
|
413 | 404 | "source": [
|
414 |
| - "Now let's kick off our training job in SageMaker's distributed, serverless training, using the parameters we just created. Because training is serverless, we don't have to wait for our job to finish to continue, but for this case, let's setup a while loop so we can monitor the status of our training." |
| 405 | + "Now let's kick off our training job in SageMaker's distributed, managed training, using the parameters we just created. Because training is managed, we don't have to wait for our job to finish to continue, but for this case, let's setup a while loop so we can monitor the status of our training." |
415 | 406 | ]
|
416 | 407 | },
|
417 | 408 | {
|
|
496 | 487 | "source": [
|
497 | 488 | "Once we've setup a model, we can configure what our hosting endpoints should be. Here we specify:\n",
|
498 | 489 | "1. EC2 instance type to use for hosting\n",
|
499 |
| - "1. Lower and upper bounds for number of instances\n", |
| 490 | + "1. Initial number of instances\n", |
500 | 491 | "1. Our hosting model name"
|
501 | 492 | ]
|
502 | 493 | },
|
|
690 | 681 | "source": [
|
691 | 682 | "---\n",
|
692 | 683 | "## Linear-Model\n",
|
693 |
| - "### Train a second model using Sagemaker's Linear Learner" |
| 684 | + "### Train a second model using SageMaker's Linear Learner" |
694 | 685 | ]
|
695 | 686 | },
|
696 | 687 | {
|
|
699 | 690 | "metadata": {},
|
700 | 691 | "outputs": [],
|
701 | 692 | "source": [
|
702 |
| - "prefix = 'sagemaker/linear' ##subfolder inside the data bucket to be used for linear learner\n", |
| 693 | + "prefix = 'sagemaker/linear' ##subfolder inside the data bucket to be used for Linear Learner\n", |
703 | 694 | "\n",
|
704 | 695 | "data_train = pd.read_csv(\"formatted_train.csv\", sep=',', header=None) \n",
|
705 | 696 | "data_test = pd.read_csv(\"formatted_test.csv\", sep=',', header=None) \n",
|
|
871 | 862 | "cell_type": "markdown",
|
872 | 863 | "metadata": {},
|
873 | 864 | "source": [
|
874 |
| - "Now let's kick off our training job in SageMaker's distributed, serverless training, using the parameters we just created. Because training is serverless, we don't have to wait for our job to finish to continue, but for this case, let's setup a while loop so we can monitor the status of our training." |
| 865 | + "Now let's kick off our training job in SageMaker's distributed, managed training, using the parameters we just created. Because training is managed, we don't have to wait for our job to finish to continue, but for this case, let's setup a while loop so we can monitor the status of our training." |
875 | 866 | ]
|
876 | 867 | },
|
877 | 868 | {
|
|
936 | 927 | "source": [
|
937 | 928 | "Once we've setup a model, we can configure what our hosting endpoints should be. Here we specify:\n",
|
938 | 929 | "1. EC2 instance type to use for hosting\n",
|
939 |
| - "1. Lower and upper bounds for number of instances\n", |
| 930 | + "1. Initial number of instances\n", |
940 | 931 | "1. Our hosting model name"
|
941 | 932 | ]
|
942 | 933 | },
|
|
1001 | 992 | "metadata": {},
|
1002 | 993 | "source": [
|
1003 | 994 | "### Prediction:Linear-Learner\n",
|
1004 |
| - "#### Predict using Sagemaker's linear learner and evaluate the performance\n", |
| 995 | + "#### Predict using SageMaker's Linear Learner and evaluate the performance\n", |
1005 | 996 | "\n",
|
1006 | 997 | "Now that we have our hosted endpoint, we can generate statistical predictions from it. Let's predict on our test dataset to understand how accurate our model is on unseen samples using AUC metric."
|
1007 | 998 | ]
|
|
1202 | 1193 | "## Extensions\n",
|
1203 | 1194 | "\n",
|
1204 | 1195 | "This example analyzed a relatively small dataset, but utilized SageMaker features such as,\n",
|
1205 |
| - "* serverless single-machine training of XGboost model \n", |
1206 |
| - "* serverless training of Linear Learner\n", |
1207 |
| - "* highly available, autoscaling model hosting, \n", |
| 1196 | + "* managed single-machine training of XGboost model \n", |
| 1197 | + "* managed training of Linear Learner\n", |
| 1198 | + "* highly available, real-time model hosting, \n", |
1208 | 1199 | "* doing a batch prediction using the hosted model\n",
|
1209 | 1200 | "* Doing an ensemble of Xgboost and Linear Learner\n",
|
1210 | 1201 | "\n",
|
|
0 commit comments