Skip to content

Commit bbe2e90

Browse files
committed
Merge branch 'master' into scikit_bring_your_own
2 parents f117fc6 + d795f77 commit bbe2e90

File tree

107 files changed

+9781
-15113
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

107 files changed

+9781
-15113
lines changed

.DS_Store

-6 KB
Binary file not shown.

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,4 @@
11
**/.ipynb_checkpoints
2+
**/.idea
3+
**/__pycache__
4+
.DS_Store

README.md

Lines changed: 40 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,53 @@
11
# Amazon SageMaker Examples
22

3-
This repository contains example notebooks that show how to apply machine learning and deep learning in Amazon SageMaker(https://aws.amazon.com/amazon-ai/).
3+
This repository contains example notebooks that show how to apply machine learning and deep learning in [Amazon SageMaker](https://aws.amazon.com/machine-learning/platforms/sagemaker).
44

55
## Examples
66

77
### Introduction to Applying Machine Learning
88

9-
- [XGBoost for Direct Marketing](xgboost_direct_marketing) targets potential customers that are most likely to convert based on customer and aggregate level metrics.
10-
- [PCA and k-means for Movie Clustering](pca_kmeans_movie_clustering) creates clusters of movies based on genre, ratings, and other characteristics.
9+
These examples provide a gentle introduction to machine learning concepts as they are applied in practical use cases across a variety of sectors.
1110

12-
### Amazon Algorithms - Basic Functionality
11+
- [Targeted Direct Marketing](introduction_to_applying_machine_learning/xgboost_direct_marketing) predicts potential customers that are most likely to convert based on customer and aggregate level metrics, using Amazon SageMaker's implementation of [XGBoost](https://github.com/dmlc/xgboost).
12+
- [Predicting Customer Churn](introduction_to_applying_machine_learning/xgboost_customer_churn) uses customer interaction and service usage data to find those most likely to churn, and then walks through the cost/benefit trade-offs of providing retention incentives. This uses Amazon SageMaker's implementation of [XGBoost](https://github.com/dmlc/xgboost) to create a highly predictive model.
13+
- [Time-series Forecasting](introduction_to_applying_machine_learning/linear_time_series_forecast) generates a forecast for topline product demand using Amazon SageMaker's Linear Learner algorithm.
14+
- [Cancer Prediction](introduction_to_applying_machine_learning/breast_cancer_prediction) predicts Breast Cancer based on features derived from images, using SageMaker's Linear Learner.
1315

14-
### Amazon Algorithms - Scientific Detail
16+
### Introduction to Amazon Algorithms
17+
18+
These examples provide quick walkthroughs to get you up and running with Amazon SageMaker's custom developed algorithms. Most of these algorithms can train on distributed hardware, scale incredibly well, and are faster and cheaper than popular alternatives.
19+
20+
- [k-means](introduction_to_amazon_algorithms/1P_kmeans_highlevel) is our introductory example for Amazon SageMaker. It walks through the process of clustering MNIST images of handwritten digits using Amazon SageMaker k-means.
21+
- [Factorization Machines](introduction_to_amazon_algorithms/factorization_machines_mnist) showcases Amazon SageMaker's implementation of the algorithm to predict whether a handwritten digit from the MNIST dataset is a 0 or not using a binary classifier.
22+
- [Latent Dirichlet Allocation (LDA)](introduction_to_amazon_algorithms/lda_topic_modeling) introduces topic modeling using Amazon SageMaker Latent Dirichlet Allocation (LDA) on a synthetic dataset.
23+
- [Linear Learner](introduction_to_amazon_algorithms/linear_learner_mnist) predicts whether a handwritten digit from the MNIST dataset is a 0 or not using a binary classifier from Amazon SageMaker Linear Learner.
24+
- [Neural Topic Model (NTM)](introduction_to_amazon_algorithms/ntm_synthetic) uses Amazon SageMaker Neural Topic Model (NTM) to uncover topics in documents from a synthetic data source, where topic distributions are known.
25+
- [Principal Components Analysis (PCA)](introduction_to_amazon_algorithms/pca_mnist) uses Amazon SageMaker PCA to calculate eigendigits from MNIST.
26+
- [Seq2Seq](introduction_to_amazon_algorithms/seq2seq) uses the Amazon SageMaker Seq2Seq algorithm that's built on top of [Sockeye](https://github.com/awslabs/sockeye), which is a sequence-to-sequence framework for Neural Machine Translation based on MXNet. Seq2Seq implements state-of-the-art encoder-decoder architectures which can also be used for tasks like Abstractive Summarization in addition to Machine Translation. This notebook shows translation from English to German text.
27+
- [XGBoost for regression](introduction_to_amazon_algorithms/xgboost_abalone) predicts the age of abalone ([Abalone dataset](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html)) using regression from Amazon SageMaker's implementation of [XGBoost](https://github.com/dmlc/xgboost).
28+
- [XGBoost for multi-class classification](introduction_to_amazon_algorithms/xgboost_mnist) uses Amazon SageMaker's implementation of [XGBoost](https://github.com/dmlc/xgboost) to classifiy handwritten digits from the MNIST dataset as one of the ten digits using a multi-class classifier. Both single machine and distributed use-cases are presented.
29+
30+
### Scientific Details of Algorithms
31+
32+
These examples provide more thorough mathematical treatment on a select group of algorithms.
33+
34+
- [Latent Dirichlet Allocation (LDA)](scientific_details_of_algorithms/lda_topic_modeling) dives into Amazon SageMaker's spectral decomposition approach to LDA.
1535

1636
### Advanced Amazon SageMaker Functionality
1737

18-
- [Installing the R Kernel](install_r_kernel) shows how to install the R kernel into an Amazon SageMaker Notebook Instance.
19-
- [Bring Your Own Model for k-means](kmeans_bring_your_own_model) shows how to take a model that's been fit elsewhere and use Amazon SageMaker containers to host.
20-
- [Bring Your Own Algorithm with R](r_bring_your_own) shows how to bring your own algorithm container to Amazon SageMaker using the R language.
38+
- [Installing the R Kernel](advanced_functionality/install_r_kernel) shows how to install the R kernel into an Amazon SageMaker Notebook Instance.
39+
- [Bring Your Own Model for k-means](advanced_functionality/kmeans_bring_your_own_model) shows how to take a model that's been fit elsewhere and use Amazon SageMaker Algorithms containers to host it.
40+
- [Bring Your Own Algorithm with R](advanced_functionality/r_bring_your_own) shows how to bring your own algorithm container to Amazon SageMaker using the R language.
2141
- [Bring Your Own Tensorflow Model](sagemaker-python-sdk/tensorflow_iris_byom) shows how to bring a model trained anywhere into Amazon SageMaker
42+
- [Bring Your Own MXNet Model](sagemaker-python-sdk/tensorflow_iris_byom) shows how to bring a model trained anywhere using MXNet into Amazon SageMaker
43+
- [Bring Your Own TensorFlow Model](sagemaker-python-sdk/tensorflow_iris_byom) shows how to bring a model trained anywhere using TensorFlow into Amazon SageMaker
44+
45+
## FAQ
46+
47+
*Will these examples work outside of Amazon SageMaker?*
48+
49+
- Although most examples utilize key Amazon SageMaker functionality like distributed, managed training or real-time hosted endpoints, these notebooks can be run outside of Amazon SageMaker Notebook Instances with minimal modification (updating IAM role definition and installing the necessary libraries).
50+
51+
*How do I contribute my own example notebook?*
52+
53+
- Although we're extremely excited to receive contributions from the community, we're still working on the best mechanism to take in examples from and external source. Please bear with us in the short-term if pull requests take longer than expected or are closed.

advanced_functionality/README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Advanced Functionality
2+
3+
This directory includes examples which showcase unique functionality available in Amazon SageMaker. Examples cover a broad range of topics and will utilize a variety of methods, but aim to provide the user with sufficient insight or inspiration to develop within Amazon SageMaker.
4+
5+
Example Notebooks include:
6+
- *data_distribution_types*: Showcases the difference between two methods for sending data from S3 to Amazon SageMaker Training instances. This has particular implication for scalability and accuracy of distributed training.
7+
- *install_r_kernel*: A quick introduction to getting R installed and running within Amazon SageMaker Notebook Instances.
8+
- *kmeans_bring_your_own_model*: How to use Amazon SageMaker Algorithms containers to bring a pre-trained model to a realtime hosted endpoint without ever needing to think about REST APIs.
9+
- *r_bring_your_own*: How to containerize an R algorithm using Docker and plumber for hosting so that it can be used in Amazon SageMaker's managed training and realtime hosting.
10+
- *xgboost_bring_your_own_model*: How to use Amazon SageMaker Algorithms containers to bring a pre-trained model to a realtime hosted endpoint without ever needing to think about REST APIs.
11+
- *handling_kms_encrypted_data.ipynb*: How to use Server Side KMS encrypted data with Amazon SageMaker training works. The IAM role used for S3 access needs to have permissions to encrypt and decrypt data with the KMS key.
12+
- *parquet_to_recordio_protobuf.ipynb*: How to convert Parquet data format into the recordIO-protobuf format that many SageMaker algorithms consume.
13+
- *working_with_redshift_data.ipynb*: Demonstrates how to copy data from Redshift to S3 and vice-versa.

data_distribution_types/data_distribution_types.ipynb renamed to advanced_functionality/data_distribution_types/data_distribution_types.ipynb

Lines changed: 33 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
"cell_type": "markdown",
55
"metadata": {},
66
"source": [
7-
"# Taking Fully Advantage of Parallelism With Data Distribution\n",
7+
"# Taking Full Advantage of Parallelism With Data Distribution\n",
88
"_**Using Amazon SageMaker's Managed, Distributed Training with Different Data Distribution Methods**_\n",
99
"\n",
1010
"---\n",
@@ -26,12 +26,12 @@
2626
"\n",
2727
"## Background\n",
2828
"\n",
29-
"Amazon SageMaker makes it easy to train machine learning models across a large number of machines. This a non-trivial process, but Amazon SageMaker Algorithms and pre-bruilt MXNet and TensorFlow containers hide most of the complexity from you. Nevertheless, there are decisions on how a user structures their data which will have an implication on how the distributed training is carried out. This notebook will walk through details on setting up your data to take full advantage of distributed processing.\n",
29+
"Amazon SageMaker makes it easy to train machine learning models across a large number of machines. This a non-trivial process, but Amazon SageMaker Algorithms and pre-built MXNet and TensorFlow containers hide most of the complexity from you. Nevertheless, there are decisions on how a user structures their data which will have an implication on how the distributed training is carried out. This notebook will walk through details on setting up your data to take full advantage of distributed processing.\n",
3030
"\n",
3131
"---\n",
3232
"# Setup\n",
3333
"\n",
34-
"_This notebook was created and tested on an ml.m4xlarge notebook instance._\n",
34+
"_This notebook was created and tested on an ml.m4.xlarge notebook instance._\n",
3535
"\n",
3636
"Let's start by specifying:\n",
3737
"\n",
@@ -78,11 +78,12 @@
7878
"import matplotlib.pyplot as plt\n",
7979
"from IPython.display import display\n",
8080
"import io\n",
81-
"import convert_data\n",
8281
"import time\n",
8382
"import copy\n",
8483
"import json\n",
85-
"import sys"
84+
"import sys\n",
85+
"import sagemaker.amazon.common as smac\n",
86+
"import os"
8687
]
8788
},
8889
{
@@ -160,7 +161,7 @@
160161
"metadata": {},
161162
"source": [
162163
"We can see:\n",
163-
"- `EventCode` is pretty unevently distributed, with some events making up 7%+ of the observations and others being a thousandth of a percent.\n",
164+
"- `EventCode` is pretty unevenly distributed, with some events making up 7%+ of the observations and others being a thousandth of a percent.\n",
164165
"- `AvgTone` seems to be reasonably smoothly distributed, while `NumArticles` has a long tail, and `Actor` geo features have suspiciously large spikes near 0.\n",
165166
"\n",
166167
"Let's remove the (0, 0) lat-longs, one hot encode `EventCode`, and prepare our data for a machine learning model. For this example we'll keep things straightforward and try to predict `AvgTone`, using the other variables in our dataset as features.\n",
@@ -193,12 +194,10 @@
193194
"outputs": [],
194195
"source": [
195196
"def write_to_s3(bucket, prefix, channel, file_prefix, X, y):\n",
196-
" f = io.BytesIO()\n",
197-
" feature_size = X.shape[1]\n",
198-
" for features, target in zip(X, y):\n",
199-
" convert_data.write_recordio(f, convert_data.list_to_record_bytes(features, label=target, feature_size=feature_size))\n",
200-
" f.seek(0)\n",
201-
" boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, channel, file_prefix + '.data')).upload_fileobj(f)\n",
197+
" buf = io.BytesIO()\n",
198+
" smac.write_numpy_to_dense_tensor(buf, X.astype('float32'), y.astype('float32'))\n",
199+
" buf.seek(0)\n",
200+
" boto3.Session().resource('s3').Bucket(bucket).Object(os.path.join(prefix, channel, file_prefix + '.data')).upload_fileobj(buf)\n",
202201
"\n",
203202
"def transform_gdelt(df, events=None):\n",
204203
" df = df[['AvgTone', 'EventCode', 'NumArticles', 'Actor1Geo_Lat', 'Actor1Geo_Long', 'Actor2Geo_Lat', 'Actor2Geo_Long']]\n",
@@ -649,7 +648,7 @@
649648
"cell_type": "markdown",
650649
"metadata": {},
651650
"source": [
652-
"Next, because POST requests to our endpoint are limited to ~6MB, we'll setup a small function to split our test data up into mini-batches, loop through and invoke our endpoint to get predictions, and gather them into a single array."
651+
"Next, because POST requests to our endpoint are limited to ~6MB, we'll setup a small function to split our test data up into mini-batches that are each about 5MB, loop through and invoke our endpoint to get predictions for those mini-batches, and gather them into a single array."
653652
]
654653
},
655654
{
@@ -714,6 +713,25 @@
714713
"\n",
715714
"Different algorithms can be expected to show variation in which distribution mechanism is most effective at achieving optimal compute spend per point of model accuracy. The message remains the same though, that the process of finding the right distribution type is another experiment in optimizing model training times."
716715
]
716+
},
717+
{
718+
"cell_type": "markdown",
719+
"metadata": {},
720+
"source": [
721+
"### (Optional) Clean-up\n",
722+
"\n",
723+
"If you're ready to be done with this notebook, please uncomment and run the cell below. This will remove the hosted endpoints you created and avoid any charges from a stray instance being left on."
724+
]
725+
},
726+
{
727+
"cell_type": "code",
728+
"execution_count": null,
729+
"metadata": {},
730+
"outputs": [],
731+
"source": [
732+
"#sm.delete_endpoint(EndpointName=sharded_endpoint)\n",
733+
"#sm.delete_endpoint(EndpointName=replicated_endpoint)"
734+
]
717735
}
718736
],
719737
"metadata": {
@@ -733,7 +751,8 @@
733751
"nbconvert_exporter": "python",
734752
"pygments_lexer": "ipython3",
735753
"version": "3.6.3"
736-
}
754+
},
755+
"notice": "Copyright 2017 Amazon.com, Inc. or its affiliates. All Rights Reserved. Licensed under the Apache License, Version 2.0 (the \"License\"). You may not use this file except in compliance with the License. A copy of the License is located at http://aws.amazon.com/apache2.0/ or in the \"license\" file accompanying this file. This file is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License."
737756
},
738757
"nbformat": 4,
739758
"nbformat_minor": 2

0 commit comments

Comments
 (0)