You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: doc/using_mxnet.rst
+11-7Lines changed: 11 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -202,7 +202,7 @@ It is good practice to save the best model after each training epoch,
202
202
so that you can resume a training job if it gets interrupted.
203
203
This is particularly important if you are using Managed Spot training.
204
204
205
-
To save MXNet model checkpoints, do the following in your training script:
205
+
To save MXNet model checkpoints, do the following in your training script:
206
206
207
207
* Set the ``CHECKPOINTS_DIR`` environment variable and enable checkpoints.
208
208
@@ -213,7 +213,7 @@ To save MXNet model checkpoints, do the following in your training script:
213
213
214
214
* Make sure you are emitting a validation metric to test the model. For information, see `Evaluation Metric API <https://mxnet.incubator.apache.org/api/python/metric/metric.html>`_.
215
215
* After each training epoch, test whether the current model performs the best with respect to the validation metric, and if it does, save that model to ``CHECKPOINTS_DIR``.
216
-
216
+
217
217
.. code:: python
218
218
219
219
if checkpoints_enabled and current_host == hosts[0]:
@@ -224,7 +224,7 @@ To save MXNet model checkpoints, do the following in your training script:
For a complete example of an MXNet training script that impelements checkpointing, see https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/mxnet_gluon_cifar10/cifar10.py.
227
-
227
+
228
228
229
229
Updating your MXNet training script
230
230
-----------------------------------
@@ -331,7 +331,7 @@ The following code sample shows how you train a custom MXNet script "train.py".
Then, when writing a distributed training script, use an MXNet kvstore to store and share model parameters.
374
380
During training, SageMaker automatically starts an MXNet kvstore server and scheduler processes on hosts in your training job cluster.
375
381
Your script runs as an MXNet worker task, with one server process on each host in your cluster.
376
382
One host is selected arbitrarily to run the scheduler process.
377
383
378
384
To learn more about writing distributed MXNet programs, please see `Distributed Training <https://mxnet.incubator.apache.org/versions/master/faq/distributed_training.html>`__ in the MXNet docs.
0 commit comments