Skip to content

documentation: Add processing readthedocs #1226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 17, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ Table of Contents
22. `SageMaker Autopilot <#sagemaker-autopilot>`__
23. `Model Monitoring <#amazon-sagemaker-model-monitoring>`__
24. `SageMaker Debugger <#amazon-sagemaker-debugger>`__
25. `SageMaker Processing <#amazon-sagemaker-processing>`__


Installing the SageMaker Python SDK
Expand Down Expand Up @@ -377,3 +378,13 @@ For more information, see `Amazon SageMaker Debugger`_.

.. _Amazon SageMaker Debugger: https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_debugger.html


Amazon SageMaker Processing
---------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make the line of dashes the same length as the line of text

same with the other headers


You can use Amazon SageMaker Processing to perform data processing tasks such as data pre- and post-processing, feature engineering, data validation, and model evaluation


For more information, see `Amazon SageMaker Processing`_.

.. _Amazon SageMaker Processing: https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_processing.html
126 changes: 126 additions & 0 deletions doc/amazon_sagemaker_processing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
.. sectnum::

##############################
Amazon SageMaker Processing
##############################


Amazon SageMaker Processing allows you to run steps for data pre- or post-processing, feature engineering, data validation, or model evaluation workloads on Amazon SageMaker.

.. contents::

Background
==========

Amazon SageMaker lets developers and data scientists train and deploy machine learning models. With Amazon SageMaker Processing, you can run processing jobs on for data processing steps in your machine learning pipeline, which accept data from Amazon S3 as input, and put data into Amazon S3 as output.

.. image:: ./amazon_sagemaker_processing_image1.png

Setup
=====

The fastest way to run get started with Amazon SageMaker Processing is by running a Jupyter notebook. You can follow the `Getting Started with Amazon SageMaker`_ guide to start running notebooks on Amazon SageMaker.

.. _Getting Started with Amazon SageMaker: https://docs.aws.amazon.com/sagemaker/latest/dg/gs.html

You can run notebooks on Amazon SageMaker that demonstrate end-to-end examples of using processing jobs to perform data pre-processing, feature engineering and model evaluation steps. See `Learn More`_ at the bottom of this page for more in-depth information.


Data Pre-Processing and Model Evaluation with Scikit-Learn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scikit-learn's docs leave "scikit-learn" lowercase as far as I can tell while Uncyclopedia capitalizes it as "Scikit-learn"

==================================================================

You can run a Scikit-Learn script to do data processing on SageMaker using the `SKLearnProcessor`_ class.

.. _SKLearnProcessor: https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html#sagemaker.sklearn.processing.SKLearnProcessor

You first create a ``SKLearnProcessor``

.. code:: python

from sagemaker.sklearn.processing import SKLearnProcessor

sklearn_processor = SKLearnProcessor(framework_version='0.20.0',
role='[Your SageMaker-compatible IAM role]',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might remove the brackets from the role placeholder lest anyone think it should be a list 😂

instance_type='ml.m5.xlarge',
instance_count=1)
Comment on lines +42 to +45
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inconsistent indentation

same below


Then you can run a Scikit-Learn script ``preprocessing.py`` in a processing job. In this example, our script takes one input from S3 and one command-line argument, processes the data, then splits the data into two datasets for output. When the job is finished, we can retrive the output from S3.

.. code:: python

from sagemaker.processing import ProcessingInput, ProcessingOutput

sklearn_processor.run(code='preprocessing.py',
inputs=[ProcessingInput(
source='s3://your-bucket/path/to/your/data,
destination='/opt/ml/processing/input')],
outputs=[ProcessingOutput(output_name='train_data',
source='/opt/ml/processing/train'),
ProcessingOutput(output_name='test_data',
source='/opt/ml/processing/test')],
arguments=['--train-test-split-ratio', '0.2']
)

preprocessing_job_description = sklearn_processor.jobs[-1].describe()

For an in-depth look, please see the `Scikit-Learn Data Processing and Model Evaluation`_ example notebook.

.. _Scikit-Learn Data Processing and Model Evaluation: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_processing/scikit_learn_data_processing_and_model_evaluation/scikit_learn_data_processing_and_model_evaluation.ipynb


Data Pre-Processing with Spark
==============================

You can use the `ScriptProcessor`_ class to run a script in a processing container, including your own container.

.. _ScriptProcessor: https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ScriptProcessor

This example shows how you can run a processing job inside of a container that can run a Spark script called ``preprocess.py`` by invoking a command ``/opt/program/submit`` inside the container.

.. code:: python

from sagemaker.processing import ScriptProcessor, ProcessingInput

spark_processor = ScriptProcessor(base_job_name='spark-preprocessor',
image_uri='<ECR repository URI to your Spark processing image>',
command=['/opt/program/submit'],
role=role,
instance_count=2,
instance_type='ml.r5.xlarge',
max_runtime_in_seconds=1200,
env={'mode': 'python'})

spark_processor.run(code='preprocess.py',
arguments=['s3_input_bucket', bucket,
's3_input_key_prefix', input_prefix,
's3_output_bucket', bucket,
's3_output_key_prefix', input_preprocessed_prefix],
logs=False)

For an in-depth look, please see the `Feature Transformation with Spark`_ example notebook.

.. _Feature Transformation with Spark: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_processing/feature_transformation_with_sagemaker_processing/feature_transformation_with_sagemaker_processing.ipynb


Learn More
==========

Processing class documentation
------------------------------

- ``Processor``: https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.Processor
- ``ScriptProcessor``: https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ScriptProcessor
- ``SKLearnProcessor``: https://sagemaker.readthedocs.io/en/stable/sagemaker.sklearn.html#sagemaker.sklearn.processing.SKLearnProcessor
- ``ProcessingInput``: https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ProcessingInput
- ``ProcessingOutput``: https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ProcessingOutput
- ``ProcessingJob``: https://sagemaker.readthedocs.io/en/stable/processing.html#sagemaker.processing.ProcessingJob


Further documentation
---------------------

- Processing class documentation: https://sagemaker.readthedocs.io/en/stable/processing.html
- ​​AWS Documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html
- AWS Notebook examples: https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker_processing
- Processing API documentation: https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateProcessingJob.html
- Processing container specification: https://docs.aws.amazon.com/sagemaker/latest/dg/build-your-own-processing-container.html
Binary file added doc/amazon_sagemaker_processing_image1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 10 additions & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -219,3 +219,13 @@ You can use Amazon SageMaker Debugger to automatically detect anomalies while tr
:maxdepth: 2

amazon_sagemaker_debugger

***************************
Amazon SageMaker Processing
***************************
You can use Amazon SageMaker Processing to perform data processing tasks such as data pre- and post-processing, feature engineering, data validation, and model evaluation

.. toctree::
:maxdepth: 2

amazon_sagemaker_processing
13 changes: 13 additions & 0 deletions doc/overview.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ SageMaker Python SDK provides several high-level abstractions for working with A
- **Models**: Encapsulate built ML models.
- **Predictors**: Provide real-time inference and transformation using Python data-types against a SageMaker endpoint.
- **Session**: Provides a collection of methods for working with SageMaker resources.
- **Transformers**: Encapsulate batch transform jobs for inference on SageMaker
- **Processors**: Encapsulate running processing jobs for data processing on SageMaker

``Estimator`` and ``Model`` implementations for MXNet, TensorFlow, Chainer, PyTorch, scikit-learn, Amazon SageMaker built-in algorithms, Reinforcement Learning, are included.
There's also an ``Estimator`` that runs SageMaker compatible custom Docker containers, enabling you to run your own ML algorithms by using the SageMaker Python SDK.
Expand Down Expand Up @@ -1057,6 +1059,17 @@ For more information, see `SageMaker Debugger`_.

.. _SageMaker Debugger: https://github.com/aws/sagemaker-python-sdk/blob/master/doc/amazon_sagemaker_debugger.rst

********************
SageMaker Processing
********************
You can use Amazon SageMaker Processing with "Processors" to perform data processing tasks such as data pre- and post-processing, feature engineering, data validation, and model evaluation

.. toctree::
:maxdepth: 2

amazon_sagemaker_processing


***
FAQ
***
Expand Down
8 changes: 8 additions & 0 deletions doc/sagemaker.sklearn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,3 +24,11 @@ Scikit Learn Predictor
:members:
:undoc-members:
:show-inheritance:

Scikit Learn Processor
----------------------

.. autoclass:: sagemaker.sklearn.processing.SKLearnProcessor
:members:
:undoc-members:
:show-inheritance: