Skip to content

Commit 3be93e7

Browse files
authored
Merge branch 'master' into fix-tf-cifar
2 parents 0fbb6d1 + a0437cd commit 3be93e7

File tree

68 files changed

+9938
-265
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+9938
-265
lines changed

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,8 @@ These examples provide a gentle introduction to machine learning concepts as the
1515
- [Ensembling](introduction_to_applying_machine_learning/ensemble_modeling) predicts income using two Amazon SageMaker models to show the advantages in ensembling.
1616
- [Video Game Sales](introduction_to_applying_machine_learning/video_game_sales) develops a binary prediction model for the success of video games based on review scores.
1717
- [MXNet Gluon Recommender System](introduction_to_applying_machine_learning/gluon_recommender_system) uses neural network embeddings for non-linear matrix factorization to predict user movie ratings on Amazon digital reviews.
18+
- [Fair Linear Learner](introduction_to_applying_machine_learning/fair_linear_learner) is an example of an effective way to create fair linear models with respect to sensitive features.
19+
- [Population Segmentation of US Census Data using PCA and Kmeans](introduction_to_applying_machine_learning/US-census_population_segmentation_PCA_Kmeans) analyzes US census data and reduces dimensionality using PCA then clusters US counties using KMeans to identify segments of similar counties.
1820

1921
### SageMaker Automatic Model Tuning
2022

advanced_functionality/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,4 +14,4 @@ These examples that showcase unique functionality available in Amazon SageMaker.
1414
- [Bring Your Own R Algorithm](r_bring_your_own) shows how to bring your own algorithm container to Amazon SageMaker using the R language.
1515
- [Bring Your Own scikit Algorithm](scikit_bring_your_own) provides a detailed walkthrough on how to package a scikit learn algorithm for training and production-ready hosting.
1616
- [Bring Your Own MXNet Model](mxnet_mnist_byom) shows how to bring a model trained anywhere using MXNet into Amazon SageMaker
17-
- [Bring Your Own TensorFlow Model](tensorflow_iris_byom) shows how to bring a model trained anywhere using TensorFlow into Amazon SageMaker
17+
- [Bring Your Own TensorFlow Model](tensorflow_iris_byom) shows how to bring a model trained anywhere using TensorFlow into Amazon SageMaker

advanced_functionality/handling_kms_encrypted_data/handling_kms_encrypted_data.ipynb

Lines changed: 112 additions & 87 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,6 @@
5050
"cell_type": "code",
5151
"execution_count": null,
5252
"metadata": {
53-
"collapsed": true,
5453
"isConfigCell": true
5554
},
5655
"outputs": [],
@@ -74,7 +73,7 @@
7473
"bucket='<s3-bucket>' # put your s3 bucket name here, and create s3 bucket\n",
7574
"prefix = 'sagemaker/DEMO-kms'\n",
7675
"# customize to your bucket where you have stored the data\n",
77-
"bucket_path = 'https://s3-{}.amazonaws.com/{}'.format(region,bucket)"
76+
"bucket_path = 's3://{}'.format(bucket)"
7877
]
7978
},
8079
{
@@ -93,9 +92,7 @@
9392
{
9493
"cell_type": "code",
9594
"execution_count": null,
96-
"metadata": {
97-
"collapsed": true
98-
},
95+
"metadata": {},
9996
"outputs": [],
10097
"source": [
10198
"from sklearn.datasets import load_boston\n",
@@ -116,15 +113,13 @@
116113
"source": [
117114
"### Data preprocessing\n",
118115
"\n",
119-
"Now that we have the dataset, we need to split it into *train*, *validation*, and *test* datasets which we can use to evaluate the accuracy of the machine learning algorithm. We randomly split the dataset into 60% training, 20% validation and 20% test. Note that SageMaker Xgboost, expects the label column to be the first one in the datasets. So, we'll move the median value column (`MEDV`) from the last to the first position within the `write_file` method below. "
116+
"Now that we have the dataset, we need to split it into *train*, *validation*, and *test* datasets which we can use to evaluate the accuracy of the machine learning algorithm. We'll also create a test dataset file with the labels removed so it can be fed into a batch transform job. We randomly split the dataset into 60% training, 20% validation and 20% test. Note that SageMaker Xgboost, expects the label column to be the first one in the datasets. So, we'll move the median value column (`MEDV`) from the last to the first position within the `write_file` method below. "
120117
]
121118
},
122119
{
123120
"cell_type": "code",
124121
"execution_count": null,
125-
"metadata": {
126-
"collapsed": true
127-
},
122+
"metadata": {},
128123
"outputs": [],
129124
"source": [
130125
"from sklearn.model_selection import train_test_split\n",
@@ -135,37 +130,31 @@
135130
{
136131
"cell_type": "code",
137132
"execution_count": null,
138-
"metadata": {
139-
"collapsed": true
140-
},
133+
"metadata": {},
141134
"outputs": [],
142135
"source": [
143-
"def write_file(X, y, fname):\n",
136+
"def write_file(X, y, fname, include_labels=True):\n",
144137
" feature_names = boston['feature_names']\n",
145138
" data = pd.DataFrame(X, columns=feature_names)\n",
146-
" target = pd.DataFrame(y, columns={'MEDV'})\n",
147-
" data['MEDV'] = y\n",
148-
" # bring this column to the front before writing the files\n",
149-
" cols = data.columns.tolist()\n",
150-
" cols = cols[-1:] + cols[:-1]\n",
151-
" data = data[cols]\n",
139+
" if include_labels:\n",
140+
" data.insert(0, 'MEDV', y)\n",
152141
" data.to_csv(fname, header=False, index=False)"
153142
]
154143
},
155144
{
156145
"cell_type": "code",
157146
"execution_count": null,
158-
"metadata": {
159-
"collapsed": true
160-
},
147+
"metadata": {},
161148
"outputs": [],
162149
"source": [
163150
"train_file = 'train.csv'\n",
164151
"validation_file = 'val.csv'\n",
165152
"test_file = 'test.csv'\n",
153+
"test_no_labels_file = 'test_no_labels.csv'\n",
166154
"write_file(X_train, y_train, train_file)\n",
167155
"write_file(X_val, y_val, validation_file)\n",
168-
"write_file(X_test, y_test, test_file)"
156+
"write_file(X_test, y_test, test_file)\n",
157+
"write_file(X_test, y_test, test_no_labels_file, False)"
169158
]
170159
},
171160
{
@@ -178,9 +167,7 @@
178167
{
179168
"cell_type": "code",
180169
"execution_count": null,
181-
"metadata": {
182-
"collapsed": true
183-
},
170+
"metadata": {},
184171
"outputs": [],
185172
"source": [
186173
"s3 = boto3.client('s3')\n",
@@ -207,7 +194,19 @@
207194
" ServerSideEncryption='aws:kms',\n",
208195
" SSEKMSKeyId=kms_key_id)\n",
209196
"\n",
210-
"print(\"Done uploading the validation dataset\")"
197+
"print(\"Done uploading the validation dataset\")\n",
198+
"\n",
199+
"data_test = open(test_no_labels_file, 'rb')\n",
200+
"key_test = '{}/test/{}'.format(prefix,test_no_labels_file)\n",
201+
"\n",
202+
"print(\"Put object...\")\n",
203+
"s3.put_object(Bucket=bucket,\n",
204+
" Key=key_test,\n",
205+
" Body=data_test,\n",
206+
" ServerSideEncryption='aws:kms',\n",
207+
" SSEKMSKeyId=kms_key_id)\n",
208+
"\n",
209+
"print(\"Done uploading the test dataset\")"
211210
]
212211
},
213212
{
@@ -222,9 +221,7 @@
222221
{
223222
"cell_type": "code",
224223
"execution_count": null,
225-
"metadata": {
226-
"collapsed": true
227-
},
224+
"metadata": {},
228225
"outputs": [],
229226
"source": [
230227
"from sagemaker.amazon.amazon_estimator import get_image_uri\n",
@@ -234,9 +231,7 @@
234231
{
235232
"cell_type": "code",
236233
"execution_count": null,
237-
"metadata": {
238-
"collapsed": true
239-
},
234+
"metadata": {},
240235
"outputs": [],
241236
"source": [
242237
"%%time\n",
@@ -334,9 +329,7 @@
334329
{
335330
"cell_type": "code",
336331
"execution_count": null,
337-
"metadata": {
338-
"collapsed": true
339-
},
332+
"metadata": {},
340333
"outputs": [],
341334
"source": [
342335
"%%time\n",
@@ -375,9 +368,7 @@
375368
{
376369
"cell_type": "code",
377370
"execution_count": null,
378-
"metadata": {
379-
"collapsed": true
380-
},
371+
"metadata": {},
381372
"outputs": [],
382373
"source": [
383374
"from time import gmtime, strftime\n",
@@ -401,15 +392,13 @@
401392
"metadata": {},
402393
"source": [
403394
"### Create endpoint\n",
404-
"Lastly, create the endpoint that serves up the model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into production applications. This takes 9-11 minutes to complete."
395+
"Create the endpoint that serves up the model, through specifying the name and configuration defined above. The end result is an endpoint that can be validated and incorporated into production applications. This takes 9-11 minutes to complete."
405396
]
406397
},
407398
{
408399
"cell_type": "code",
409400
"execution_count": null,
410-
"metadata": {
411-
"collapsed": true
412-
},
401+
"metadata": {},
413402
"outputs": [],
414403
"source": [
415404
"%%time\n",
@@ -449,15 +438,13 @@
449438
"metadata": {},
450439
"source": [
451440
"## Validate the model for use\n",
452-
"Finally, you can now validate the model for use. They can obtain the endpoint from the client library using the result from previous operations, and generate classifications from the trained model using that endpoint.\n"
441+
"You can now validate the model for use. Obtain the endpoint from the client library using the result from previous operations, and run a single prediction on the trained model using that endpoint.\n"
453442
]
454443
},
455444
{
456445
"cell_type": "code",
457446
"execution_count": null,
458-
"metadata": {
459-
"collapsed": true
460-
},
447+
"metadata": {},
461448
"outputs": [],
462449
"source": [
463450
"runtime_client = boto3.client('runtime.sagemaker')"
@@ -466,87 +453,125 @@
466453
{
467454
"cell_type": "code",
468455
"execution_count": null,
469-
"metadata": {
470-
"collapsed": true
471-
},
456+
"metadata": {},
472457
"outputs": [],
473458
"source": [
474459
"import sys\n",
475460
"import math\n",
476461
"def do_predict(data, endpoint_name, content_type):\n",
477-
" payload = ''.join(data)\n",
478462
" response = runtime_client.invoke_endpoint(EndpointName=endpoint_name, \n",
479463
" ContentType=content_type, \n",
480-
" Body=payload)\n",
464+
" Body=data)\n",
481465
" result = response['Body'].read()\n",
482466
" result = result.decode(\"utf-8\")\n",
483-
" result = result.split(',')\n",
484467
" return result\n",
485468
"\n",
486-
"def batch_predict(data, batch_size, endpoint_name, content_type):\n",
487-
" items = len(data)\n",
488-
" arrs = []\n",
489-
" \n",
490-
" for offset in range(0, items, batch_size):\n",
491-
" if offset+batch_size < items:\n",
492-
" results = do_predict(data[offset:(offset+batch_size)], endpoint_name, content_type)\n",
493-
" arrs.extend(results)\n",
494-
" else:\n",
495-
" arrs.extend(do_predict(data[offset:items], endpoint_name, content_type))\n",
496-
" sys.stdout.write('.')\n",
497-
" return(arrs)"
469+
"# pull the first item from the test dataset\n",
470+
"with open('test.csv') as f:\n",
471+
" first_line = f.readline()\n",
472+
" features = first_line.split(',')[1:]\n",
473+
" feature_str = ','.join(features)\n",
474+
"\n",
475+
"prediction = do_predict(feature_str, endpoint_name, 'text/csv')\n",
476+
"print('Prediction: ' + prediction)"
498477
]
499478
},
500479
{
501480
"cell_type": "markdown",
502481
"metadata": {},
503482
"source": [
504-
"The following helps us calculate the Median Absolute Percent Error (MdAPE) on the batch dataset. Note that the intent of this example is not to produce the most accurate regressor but to demonstrate how to handle KMS encrypted data with SageMaker. "
483+
"### (Optional) Delete the Endpoint\n",
484+
"\n",
485+
"If you're ready to be done with this notebook, please run the delete_endpoint line in the cell below. This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on."
505486
]
506487
},
507488
{
508489
"cell_type": "code",
509490
"execution_count": null,
510-
"metadata": {
511-
"collapsed": true
512-
},
491+
"metadata": {},
492+
"outputs": [],
493+
"source": [
494+
"client.delete_endpoint(EndpointName=endpoint_name)"
495+
]
496+
},
497+
{
498+
"cell_type": "markdown",
499+
"metadata": {},
500+
"source": [
501+
"## Run batch prediction using batch transform\n",
502+
"Create a transform job to do batch prediction using the trained model. Similar to the training section above, the execution role assumed by this notebook must have permissions to encrypt and decrypt data with the KMS key (`kms_key_id`) used for S3 server-side encryption."
503+
]
504+
},
505+
{
506+
"cell_type": "code",
507+
"execution_count": null,
508+
"metadata": {},
513509
"outputs": [],
514510
"source": [
515511
"%%time\n",
516-
"import json\n",
517-
"import numpy as np\n",
518-
"\n",
512+
"transform_job_name = 'DEMO-xgboost-batch-prediction' + strftime(\"%Y-%m-%d-%H-%M-%S\", gmtime())\n",
513+
"print(\"Transform job\", transform_job_name)\n",
519514
"\n",
520-
"with open('test.csv') as f:\n",
521-
" lines = f.readlines()\n",
522-
"\n",
523-
"#remove the labels\n",
524-
"labels = [line.split(',')[0] for line in lines]\n",
525-
"features = [line.split(',')[1:] for line in lines]\n",
515+
"transform_params = \\\n",
516+
"{\n",
517+
" \"TransformJobName\": transform_job_name,\n",
518+
" \"ModelName\": model_name,\n",
519+
" \"TransformInput\": {\n",
520+
" \"ContentType\": \"text/csv\",\n",
521+
" \"DataSource\": {\n",
522+
" \"S3DataSource\": {\n",
523+
" \"S3DataType\": \"S3Prefix\",\n",
524+
" \"S3Uri\": bucket_path + \"/\"+ prefix + '/test'\n",
525+
" }\n",
526+
" },\n",
527+
" \"SplitType\": \"Line\"\n",
528+
" },\n",
529+
" \"TransformOutput\": {\n",
530+
" \"AssembleWith\": \"Line\",\n",
531+
" \"S3OutputPath\": bucket_path + \"/\"+ prefix + '/predict'\n",
532+
" },\n",
533+
" \"TransformResources\": {\n",
534+
" \"InstanceCount\": 1,\n",
535+
" \"InstanceType\": \"ml.c4.xlarge\"\n",
536+
" }\n",
537+
"}\n",
526538
"\n",
527-
"features_str = [','.join(row) for row in features]\n",
528-
"preds = batch_predict(features_str, 100, endpoint_name, 'text/csv')\n",
529-
"print('\\n Median Absolute Percent Error (MdAPE) = ', np.median(np.abs(np.asarray(labels, dtype=float) - np.asarray(preds, dtype=float)) / np.asarray(labels, dtype=float)))"
539+
"client.create_transform_job(**transform_params)\n",
540+
"\n",
541+
"while True:\n",
542+
" response = client.describe_transform_job(TransformJobName=transform_job_name)\n",
543+
" status = response['TransformJobStatus']\n",
544+
" if status == 'InProgress':\n",
545+
" time.sleep(15)\n",
546+
" elif status == 'Completed':\n",
547+
" print(\"Transform job completed!\")\n",
548+
" break\n",
549+
" else:\n",
550+
" print(\"Unexpected transform job status: \" + status)"
530551
]
531552
},
532553
{
533554
"cell_type": "markdown",
534555
"metadata": {},
535556
"source": [
536-
"### (Optional) Delete the Endpoint\n",
557+
"### Evaluate the batch predictions\n",
537558
"\n",
538-
"If you're ready to be done with this notebook, please run the delete_endpoint line in the cell below. This will remove the hosted endpoint you created and avoid any charges from a stray instance being left on."
559+
"The following helps us calculate the Median Absolute Percent Error (MdAPE) on the batch prediction output in S3. Note that the intent of this example is not to produce the most accurate regressor but to demonstrate how to handle KMS encrypted data with SageMaker."
539560
]
540561
},
541562
{
542563
"cell_type": "code",
543564
"execution_count": null,
544-
"metadata": {
545-
"collapsed": true
546-
},
565+
"metadata": {},
547566
"outputs": [],
548567
"source": [
549-
"client.delete_endpoint(EndpointName=endpoint_name)"
568+
"print(\"Downloading prediction object...\")\n",
569+
"s3.download_file(Bucket=bucket,\n",
570+
" Key=prefix + '/predict/' + test_no_labels_file + '.out',\n",
571+
" Filename='./predictions.csv')\n",
572+
"\n",
573+
"preds = np.loadtxt('predictions.csv')\n",
574+
"print('\\nMedian Absolute Percent Error (MdAPE) = ', np.median(np.abs(y_test - preds) / y_test))"
550575
]
551576
}
552577
],
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
# use minimal alpine base image as we only need python and nothing else here
2+
FROM python:2-alpine3.6
3+
4+
MAINTAINER Amazon SageMaker Examples <[email protected]>
5+
6+
COPY train.py /train.py
7+
8+
ENTRYPOINT ["python2.7", "-u", "/train.py"]

0 commit comments

Comments
 (0)