@@ -10,283 +10,7 @@ For more information about Coach, see https://github.com/NervanaSystems/coach
10
10
Supported versions of Ray: ``0.5.3 `` with TensorFlow.
11
11
For more information about Ray, see https://github.com/ray-project/ray
12
12
13
- Table of Contents
14
- -----------------
15
-
16
- 1. `RL Training <#rl-training >`__
17
- 2. `RL Estimators <#rl-estimators >`__
18
- 3. `Distributed RL Training <#distributed-rl-training >`__
19
- 4. `Saving models <#saving-models >`__
20
- 5. `Deploying RL Models <#deploying-rl-models >`__
21
- 6. `RL Training Examples <#rl-training-examples >`__
22
- 7. `SageMaker RL Docker Containers <#sagemaker-rl-docker-containers >`__
23
-
24
-
25
- RL Training
26
- -----------
27
-
28
- Training RL models using ``RLEstimator `` is a two-step process:
29
-
30
- 1. Prepare a training script to run on SageMaker
31
- 2. Run this script on SageMaker via an ``RlEstimator ``.
32
-
33
- You should prepare your script in a separate source file than the notebook, terminal session, or source file you're
34
- using to submit the script to SageMaker via an ``RlEstimator ``. This will be discussed in further detail below.
35
-
36
- Suppose that you already have a training script called ``coach-train.py ``.
37
- You can then create an ``RLEstimator `` with keyword arguments to point to this script and define how SageMaker runs it:
38
-
39
- .. code :: python
40
-
41
- from sagemaker.rl import RLEstimator, RLToolkit, RLFramework
42
-
43
- rl_estimator = RLEstimator(entry_point = ' coach-train.py' ,
44
- toolkit = RLToolkit.COACH ,
45
- toolkit_version = ' 0.11.1' ,
46
- framework = RLFramework.TENSORFLOW ,
47
- role = ' SageMakerRole' ,
48
- train_instance_type = ' ml.p3.2xlarge' ,
49
- train_instance_count = 1 )
50
-
51
- After that, you simply tell the estimator to start a training job:
52
-
53
- .. code :: python
54
-
55
- rl_estimator.fit()
56
-
57
- In the following sections, we'll discuss how to prepare a training script for execution on SageMaker
58
- and how to run that script on SageMaker using ``RLEstimator ``.
59
-
60
-
61
- Preparing the RL Training Script
62
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
63
-
64
- Your RL training script must be a Python 3.5 compatible source file from MXNet framework or Python 3.6 for TensorFlow.
65
-
66
- The training script is very similar to a training script you might run outside of SageMaker, but you
67
- can access useful properties about the training environment through various environment variables, such as
68
-
69
- * ``SM_MODEL_DIR ``: A string representing the path to the directory to write model artifacts to.
70
- These artifacts are uploaded to S3 for model hosting.
71
- * ``SM_NUM_GPUS ``: An integer representing the number of GPUs available to the host.
72
- * ``SM_OUTPUT_DATA_DIR ``: A string representing the filesystem path to write output artifacts to. Output artifacts may
73
- include checkpoints, graphs, and other files to save, not including model artifacts. These artifacts are compressed
74
- and uploaded to S3 to the same S3 prefix as the model artifacts.
75
-
76
- For the exhaustive list of available environment variables, see the
77
- `SageMaker Containers documentation <https://github.com/aws/sagemaker-containers#list-of-provided-environment-variables-by-sagemaker-containers >`__.
78
-
79
-
80
- RL Estimators
81
- -------------
82
-
83
- The ``RLEstimator `` constructor takes both required and optional arguments.
84
-
85
- Required arguments
86
- ~~~~~~~~~~~~~~~~~~
87
-
88
- The following are required arguments to the ``RLEstimator `` constructor. When you create an instance of RLEstimator, you must include
89
- these in the constructor, either positionally or as keyword arguments.
90
-
91
- - ``entry_point `` Path (absolute or relative) to the Python file which
92
- should be executed as the entry point to training.
93
- - ``role `` An AWS IAM role (either name or full ARN). The Amazon
94
- SageMaker training jobs and APIs that create Amazon SageMaker
95
- endpoints use this role to access training data and model artifacts.
96
- After the endpoint is created, the inference code might use the IAM
97
- role, if accessing AWS resource.
98
- - ``train_instance_count `` Number of Amazon EC2 instances to use for
99
- training.
100
- - ``train_instance_type `` Type of EC2 instance to use for training, for
101
- example, 'ml.m4.xlarge'.
102
-
103
- You must as well include either:
104
-
105
- - ``toolkit `` RL toolkit (Ray RLlib or Coach) you want to use for executing your model training code.
106
-
107
- - ``toolkit_version `` RL toolkit version you want to be use for executing your model training code.
108
-
109
- - ``framework `` Framework (MXNet or TensorFlow) you want to be used as
110
- a toolkit backed for reinforcement learning training.
111
-
112
- or provide:
113
-
114
- - ``image_name `` An alternative docker image to use for training and
115
- serving. If specified, the estimator will use this image for training and
116
- hosting, instead of selecting the appropriate SageMaker official image based on
117
- framework_version and py_version. Refer to: `SageMaker RL Docker Containers
118
- <#sagemaker-rl-docker-containers> `_ for details on what the Official images support
119
- and where to find the source code to build your custom image.
120
-
121
-
122
- Optional arguments
123
- ~~~~~~~~~~~~~~~~~~
124
-
125
- The following are optional arguments. When you create an ``RlEstimator `` object, you can specify these as keyword arguments.
126
-
127
- - ``source_dir `` Path (absolute or relative) to a directory with any
128
- other training source code dependencies including the entry point
129
- file. Structure within this directory will be preserved when training
130
- on SageMaker.
131
- - ``dependencies (list[str]) `` A list of paths to directories (absolute or relative) with
132
- any additional libraries that will be exported to the container (default: ``[] ``).
133
- The library folders will be copied to SageMaker in the same folder where the entrypoint is copied.
134
- If the ``source_dir `` points to S3, code will be uploaded and the S3 location will be used
135
- instead. Example:
136
-
137
- The following call
138
- >>> RLEstimator(entry_point = ' train.py' ,
139
- toolkit=RLToolkit.COACH,
140
- toolkit_version='0.11.0',
141
- framework=RLFramework.TENSORFLOW,
142
- dependencies=['my/libs/common', 'virtual-env'])
143
- results in the following inside the container:
144
-
145
- >>> $ ls
146
-
147
- >>> opt/ ml/ code
148
- >>> ├── train.py
149
- >>> ├── common
150
- >>> └── virtual- env
151
-
152
- - ``hyperparameters `` Hyperparameters that will be used for training.
153
- Will be made accessible as a ``dict[str, str] `` to the training code on
154
- SageMaker. For convenience, accepts other types besides strings, but
155
- ``str `` will be called on keys and values to convert them before
156
- training.
157
- - ``train_volume_size `` Size in GB of the EBS volume to use for storing
158
- input data during training. Must be large enough to store training
159
- data if ``input_mode='File' `` is used (which is the default).
160
- - ``train_max_run `` Timeout in seconds for training, after which Amazon
161
- SageMaker terminates the job regardless of its current status.
162
- - ``input_mode `` The input mode that the algorithm supports. Valid
163
- modes: 'File' - Amazon SageMaker copies the training dataset from the
164
- S3 location to a directory in the Docker container. 'Pipe' - Amazon
165
- SageMaker streams data directly from S3 to the container via a Unix
166
- named pipe.
167
- - ``output_path `` S3 location where you want the training result (model
168
- artifacts and optional output files) saved. If not specified, results
169
- are stored to a default bucket. If the bucket with the specific name
170
- does not exist, the estimator creates the bucket during the ``fit ``
171
- method execution.
172
- - ``output_kms_key `` Optional KMS key ID to optionally encrypt training
173
- output with.
174
- - ``job_name `` Name to assign for the training job that the ``fit` ``
175
- method launches. If not specified, the estimator generates a default
176
- job name, based on the training image name and current timestamp
177
-
178
- Calling fit
179
- ~~~~~~~~~~~
180
-
181
- You start your training script by calling ``fit `` on an ``RLEstimator ``. ``fit `` takes both a few optional
182
- arguments.
183
-
184
- Optional arguments
185
- ''''''''''''''''''
186
-
187
- - ``inputs ``: This can take one of the following forms: A string
188
- S3 URI, for example ``s3://my-bucket/my-training-data ``. In this
189
- case, the S3 objects rooted at the ``my-training-data `` prefix will
190
- be available in the default ``train `` channel. A dict from
191
- string channel names to S3 URIs. In this case, the objects rooted at
192
- each S3 prefix will available as files in each channel directory.
193
- - ``wait ``: Defaults to True, whether to block and wait for the
194
- training script to complete before returning.
195
- - ``logs ``: Defaults to True, whether to show logs produced by training
196
- job in the Python session. Only meaningful when wait is True.
197
-
198
-
199
- Distributed RL Training
200
- -----------------------
201
-
202
- Amazon SageMaker RL supports multi-core and multi-instance distributed training.
203
- Depending on your use case, training and/or environment rollout can be distributed.
204
-
205
- Please see the `Amazon SageMaker examples <https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning >`_
206
- on how it can be done using different RL toolkits.
207
-
208
-
209
- Saving models
210
- -------------
211
-
212
- In order to save your trained PyTorch model for deployment on SageMaker, your training script should save your model
213
- to a certain filesystem path ``/opt/ml/model ``. This value is also accessible through the environment variable
214
- ``SM_MODEL_DIR ``.
215
-
216
- Deploying RL Models
217
- -------------------
218
-
219
- After an RL Estimator has been fit, you can host the newly created model in SageMaker.
220
-
221
- After calling ``fit ``, you can call ``deploy `` on an ``RlEstimator `` Estimator to create a SageMaker Endpoint.
222
- The Endpoint runs one of the SageMaker-provided model server based on the ``framework `` parameter
223
- specified in the ``RLEstimator `` constructor and hosts the model produced by your training script,
224
- which was run when you called ``fit ``. This was the model you saved to ``model_dir ``.
225
- In case if ``image_name `` was specified it would use provided image for the deployment.
226
-
227
- ``deploy `` returns a ``sagemaker.mxnet.MXNetPredictor `` for MXNet or
228
- ``sagemaker.tensorflow.serving.Predictor `` for TensorFlow.
229
-
230
- ``predict `` returns the result of inference against your model.
231
-
232
- .. code :: python
233
-
234
- # Train my estimator
235
- rl_estimator = RLEstimator(entry_point = ' coach-train.py' ,
236
- toolkit = RLToolkit.COACH ,
237
- toolkit_version = ' 0.11.0' ,
238
- framework = RLFramework.MXNET ,
239
- role = ' SageMakerRole' ,
240
- train_instance_type = ' ml.c4.2xlarge' ,
241
- train_instance_count = 1 )
242
-
243
- rl_estimator.fit()
244
-
245
- # Deploy my estimator to a SageMaker Endpoint and get a MXNetPredictor
246
- predictor = rl_estimator.deploy(instance_type = ' ml.m4.xlarge' ,
247
- initial_instance_count = 1 )
248
-
249
- response = predictor.predict(data)
250
-
251
- For more information please see `The SageMaker MXNet Model Server <https://github.com/aws/sagemaker-python-sdk/tree/master/src/sagemaker/mxnet#the-sagemaker-mxnet-model-server >`_
252
- and `Deploying to TensorFlow Serving Endpoints <https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/tensorflow/deploying_tensorflow_serving.rst >`_ documentation.
253
-
254
-
255
- Working with Existing Training Jobs
256
- -----------------------------------
257
-
258
- Attaching to existing training jobs
259
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
260
-
261
- You can attach an RL Estimator to an existing training job using the
262
- ``attach `` method.
263
-
264
- .. code :: python
265
-
266
- my_training_job_name = ' MyAwesomeRLTrainingJob'
267
- rl_estimator = RLEstimator.attach(my_training_job_name)
268
-
269
- After attaching, if the training job has finished with job status "Completed", it can be
270
- ``deploy ``\ ed to create a SageMaker Endpoint and return a ``Predictor ``. If the training job is in progress,
271
- attach will block and display log messages from the training job, until the training job completes.
272
-
273
- The ``attach `` method accepts the following arguments:
274
-
275
- - ``training_job_name: `` The name of the training job to attach
276
- to.
277
- - ``sagemaker_session: `` The Session used
278
- to interact with SageMaker
279
-
280
- RL Training Examples
281
- --------------------
282
-
283
- Amazon provides several example Jupyter notebooks that demonstrate end-to-end training on Amazon SageMaker using RL.
284
- Please refer to:
285
-
286
- https://github.com/awslabs/amazon-sagemaker-examples/tree/master/reinforcement_learning
287
-
288
- These are also available in SageMaker Notebook Instance hosted Jupyter notebooks under the sample notebooks folder.
289
-
13
+ For information about using RL with the SageMaker Python SDK, see https://sagemaker.readthedocs.io/en/stable/using_rl.html.
290
14
291
15
SageMaker RL Docker Containers
292
16
------------------------------
0 commit comments