Skip to content

Commit cbf9b58

Browse files
committed
Merge remote-tracking branch 'upstream/master'
2 parents 75b127c + 791bf0a commit cbf9b58

File tree

113 files changed

+17436
-1991
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

113 files changed

+17436
-1991
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,6 @@ venv/
2525
*~
2626
.pytest_cache/
2727
*.swp
28-
.docker/
28+
.docker/
29+
env/
30+
.vscode/

CHANGELOG.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,58 @@
11
# Changelog
22

3+
## v2.19.0 (2020-12-08)
4+
5+
### Features
6+
7+
* add tensorflow 1.15.4 and 2.3.1 as valid versions
8+
* add py36 as valid python version for pytorch 1.6.0
9+
* auto-select container version for p4d and smdistributed
10+
* add edge packaging job support
11+
* Add Clarify Processor, Model Bias, Explainability, and Quality Monitors support. (#494)
12+
* add model parallelism support
13+
* add data parallelism support (#454) (#511)
14+
* support creating and updating profiler in training job (#444) (#526)
15+
16+
### Bug Fixes and Other Changes
17+
18+
* bump boto3 and smdebug_rulesconfig versions for reinvent and enable data parallel integ tests
19+
* run UpdateTrainingJob tests only during allowed secondary status
20+
* Remove workarounds and apply fixes to Clarify and MM integ tests
21+
* add p4d to smdataparallel supported instances
22+
* Mount metadata directory when starting local mode docker container
23+
* add integ test for profiler
24+
* Re-enable model monitor integration tests.
25+
26+
### Documentation Changes
27+
28+
* add SageMaker distributed libraries documentation
29+
* update documentation for the new SageMaker Debugger APIs
30+
* minor updates to doc strings
31+
32+
## v2.18.0 (2020-12-03)
33+
34+
### Features
35+
36+
* all de/serializers support content type
37+
* warn on 'Stopped' (non-Completed) jobs
38+
* all predictors support serializer/deserializer overrides
39+
40+
### Bug Fixes and Other Changes
41+
42+
* v2 upgrade tool should ignore cell starting with '%'
43+
* use iterrows to iterate pandas dataframe
44+
* check for distributions in TF estimator
45+
46+
### Documentation Changes
47+
48+
* Update link to Sagemaker PyTorch Docker Containers
49+
* create artifact restricted to SM context note
50+
51+
### Testing and Release Infrastructure
52+
53+
* remove flaky assertion in test_integ_history_server
54+
* adjust assertion of TensorFlow MNIST test
55+
356
## v2.17.0 (2020-12-02)
457

558
### Features

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.17.1.dev0
1+
2.19.1.dev0

doc/_static/theme_overrides.css

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
/* override table width restrictions */
2+
.wy-table-responsive table td, .wy-table-responsive table th {
3+
white-space: normal;
4+
}
5+
6+
.wy-table-responsive {
7+
margin-bottom: 24px;
8+
max-width: 100%;
9+
overflow: visible;
10+
}

doc/api/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,5 +9,6 @@ The SageMaker Python SDK consists of a variety classes for preparing data, train
99

1010
prep_data/feature_store
1111
training/index
12+
training/distributed
1213
inference/index
1314
utility/index

doc/api/training/debugger.rst

Lines changed: 75 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,79 @@
11
Debugger
22
--------
33

4-
.. automodule:: sagemaker.debugger
5-
:members:
6-
:undoc-members:
4+
Amazon SageMaker Debugger provides full visibility
5+
into training jobs of state-of-the-art machine learning models.
6+
This SageMaker Debugger module provides high-level methods
7+
to set up Debugger configurations to
8+
monitor, profile, and debug your training job.
9+
Configure the Debugger-specific parameters when constructing
10+
a SageMaker estimator to gain visibility and insights
11+
into your training job.
12+
13+
.. currentmodule:: sagemaker.debugger
14+
15+
.. autoclass:: get_rule_container_image_uri
16+
:show-inheritance:
17+
18+
.. autoclass:: get_default_profiler_rule
19+
:show-inheritance:
20+
21+
.. class:: sagemaker.debugger.rule_configs
22+
23+
A helper module to configure the SageMaker Debugger built-in rules with
24+
the :class:`~sagemaker.debugger.Rule` classmethods and
25+
and the :class:`~sagemaker.debugger.ProfilerRule` classmethods.
26+
27+
For a full list of built-in rules, see
28+
`List of Debugger Built-in Rules
29+
<https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html>`_.
30+
31+
This module is imported from the Debugger client library for rule configuration.
32+
For more information, see
33+
`Amazon SageMaker Debugger RulesConfig
34+
<https://github.com/awslabs/sagemaker-debugger-rulesconfig>`_.
35+
36+
.. autoclass:: RuleBase
37+
:show-inheritance:
38+
39+
.. autoclass:: Rule
40+
:show-inheritance:
41+
:inherited-members:
42+
43+
.. autoclass:: ProfilerRule
44+
:show-inheritance:
45+
:inherited-members:
46+
47+
.. autoclass:: CollectionConfig
48+
:show-inheritance:
49+
50+
.. autoclass:: DebuggerHookConfig
751
:show-inheritance:
52+
53+
.. autoclass:: TensorBoardOutputConfig
54+
:show-inheritance:
55+
56+
.. autoclass:: ProfilerConfig
57+
:show-inheritance:
58+
59+
.. autoclass:: FrameworkProfile
60+
:show-inheritance:
61+
62+
.. autoclass:: DetailedProfilingConfig
63+
:show-inheritance:
64+
65+
.. autoclass:: DataloaderProfilingConfig
66+
:show-inheritance:
67+
68+
.. autoclass:: PythonProfilingConfig
69+
:show-inheritance:
70+
71+
.. autoclass:: PythonProfiler
72+
:show-inheritance:
73+
74+
.. autoclass:: cProfileTimer
75+
:show-inheritance:
76+
77+
.. automodule:: sagemaker.debugger.metrics_config
78+
:members: StepRange, TimeRange
79+
:undoc-members:

doc/api/training/distributed.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Distributed Training APIs
2+
-------------------------
3+
SageMaker distributed training libraries offer both data parallel and model parallel training strategies.
4+
They combine software and hardware technologies to improve inter-GPU and inter-node communications.
5+
They extend SageMaker’s training capabilities with built-in options that require only small code changes to your training scripts.
6+
7+
.. toctree::
8+
:maxdepth: 3
9+
10+
smd_data_parallel
11+
smd_model_parallel

doc/api/training/index.rst

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,13 @@ Training APIs
33
#############
44

55
.. toctree::
6-
:maxdepth: 1
7-
:glob:
6+
:maxdepth: 4
87

9-
*
8+
analytics
9+
automl
10+
debugger
11+
estimators
12+
algorithm
13+
tuner
14+
parameter
15+
processing

doc/api/training/processing.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,8 @@ Processing
1010
:members:
1111
:undoc-members:
1212
:show-inheritance:
13+
14+
.. automodule:: sagemaker.clarify
15+
:members:
16+
:undoc-members:
17+
:show-inheritance:
Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
###################################
2+
Distributed data parallel
3+
###################################
4+
5+
SageMaker distributed data parallel (SDP) extends SageMaker’s training
6+
capabilities on deep learning models with near-linear scaling efficiency,
7+
achieving fast time-to-train with minimal code changes.
8+
9+
- SDP optimizes your training job for AWS network infrastructure and EC2 instance topology.
10+
- SDP takes advantage of gradient update to communicate between nodes with a custom AllReduce algorithm.
11+
12+
When training a model on a large amount of data, machine learning practitioners
13+
will often turn to distributed training to reduce the time to train.
14+
In some cases, where time is of the essence,
15+
the business requirement is to finish training as quickly as possible or at
16+
least within a constrained time period.
17+
Then, distributed training is scaled to use a cluster of multiple nodes,
18+
meaning not just multiple GPUs in a computing instance, but multiple instances
19+
with multiple GPUs. As the cluster size increases, so does the significant drop
20+
in performance. This drop in performance is primarily caused the communications
21+
overhead between nodes in a cluster.
22+
23+
24+
.. rubric:: Customize your training script
25+
26+
To customize your own training script, you will need the following:
27+
28+
.. raw:: html
29+
30+
<div data-section-style="5" style="">
31+
32+
- You must provide TensorFlow / PyTorch training scripts that are
33+
adapted to use SDP.
34+
- Your input data must be in an S3 bucket or in FSx in the AWS region
35+
that you will use to launch your training job. If you use the Jupyter
36+
notebooks provided, create a SageMaker notebook instance in the same
37+
region as the bucket that contains your input data. For more
38+
information about storing your training data, refer to
39+
the `SageMaker Python SDK data
40+
inputs <https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-inputs>`__ documentation.
41+
42+
.. raw:: html
43+
44+
</div>
45+
46+
Use the API guides for each framework to see
47+
examples of training scripts that can be used to convert your training scripts.
48+
Then, use one of the example notebooks as your template to launch a training job.
49+
You’ll need to swap your training script with the one that came with the
50+
notebook and modify any input functions as necessary.
51+
Once you have launched a training job, you can monitor it using CloudWatch.
52+
53+
Then you can see how to deploy your trained model to an endpoint by
54+
following one of the example notebooks for deploying a model. Finally,
55+
you can follow an example notebook to test inference on your deployed
56+
model.
57+
58+
59+
60+
.. toctree::
61+
:maxdepth: 2
62+
63+
smd_data_parallel_pytorch
64+
smd_data_parallel_tensorflow

0 commit comments

Comments
 (0)