Skip to content

Commit e961fff

Browse files
authored
Merge branch 'master' into master
2 parents f71b8e0 + 9dead38 commit e961fff

File tree

157 files changed

+18568
-3163
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

157 files changed

+18568
-3163
lines changed

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,6 @@ venv/
2525
*~
2626
.pytest_cache/
2727
*.swp
28-
.docker/
28+
.docker/
29+
env/
30+
.vscode/

.pydocstylerc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
[pydocstyle]
22
inherit = false
3-
ignore = D104,D107,D202,D203,D205,D212,D213,D214,D400,D401,D404,D406,D407,D411,D413,D414,D415,D417
3+
ignore = D104,D107,D202,D203,D212,D213,D214,D400,D401,D404,D406,D407,D411,D413,D414,D415,D417
44
match = (?!record_pb2).*\.py

CHANGELOG.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,50 @@
11
# Changelog
22

3+
## v2.20.0 (2020-12-16)
4+
5+
### Features
6+
7+
* add dataset definition support for processing jobs
8+
9+
### Bug Fixes and Other Changes
10+
11+
* include workflow integ tests with clarify and debugger enabled
12+
* only run DataParallel and EdgePackaging tests in supported regions
13+
14+
### Documentation Changes
15+
16+
* fix smp code example, add note for CUDA 11 to sdp
17+
* adding note about CUDA 11 to SMP. Small title update PyTorch
18+
19+
## v2.19.0 (2020-12-08)
20+
21+
### Features
22+
23+
* add tensorflow 1.15.4 and 2.3.1 as valid versions
24+
* add py36 as valid python version for pytorch 1.6.0
25+
* auto-select container version for p4d and smdistributed
26+
* add edge packaging job support
27+
* Add Clarify Processor, Model Bias, Explainability, and Quality Monitors support. (#494)
28+
* add model parallelism support
29+
* add data parallelism support (#454) (#511)
30+
* support creating and updating profiler in training job (#444) (#526)
31+
32+
### Bug Fixes and Other Changes
33+
34+
* bump boto3 and smdebug_rulesconfig versions for reinvent and enable data parallel integ tests
35+
* run UpdateTrainingJob tests only during allowed secondary status
36+
* Remove workarounds and apply fixes to Clarify and MM integ tests
37+
* add p4d to smdataparallel supported instances
38+
* Mount metadata directory when starting local mode docker container
39+
* add integ test for profiler
40+
* Re-enable model monitor integration tests.
41+
42+
### Documentation Changes
43+
44+
* add SageMaker distributed libraries documentation
45+
* update documentation for the new SageMaker Debugger APIs
46+
* minor updates to doc strings
47+
348
## v2.18.0 (2020-12-03)
449

550
### Features

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.18.1.dev0
1+
2.20.1.dev0

doc/_static/theme_overrides.css

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
/* override table width restrictions */
2+
.wy-table-responsive table td, .wy-table-responsive table th {
3+
white-space: normal;
4+
}
5+
6+
.wy-table-responsive {
7+
margin-bottom: 24px;
8+
max-width: 100%;
9+
overflow: visible;
10+
}

doc/api/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,5 +9,6 @@ The SageMaker Python SDK consists of a variety classes for preparing data, train
99

1010
prep_data/feature_store
1111
training/index
12+
training/distributed
1213
inference/index
1314
utility/index

doc/api/training/debugger.rst

Lines changed: 75 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,79 @@
11
Debugger
22
--------
33

4-
.. automodule:: sagemaker.debugger
5-
:members:
6-
:undoc-members:
4+
Amazon SageMaker Debugger provides full visibility
5+
into training jobs of state-of-the-art machine learning models.
6+
This SageMaker Debugger module provides high-level methods
7+
to set up Debugger configurations to
8+
monitor, profile, and debug your training job.
9+
Configure the Debugger-specific parameters when constructing
10+
a SageMaker estimator to gain visibility and insights
11+
into your training job.
12+
13+
.. currentmodule:: sagemaker.debugger
14+
15+
.. autoclass:: get_rule_container_image_uri
16+
:show-inheritance:
17+
18+
.. autoclass:: get_default_profiler_rule
19+
:show-inheritance:
20+
21+
.. class:: sagemaker.debugger.rule_configs
22+
23+
A helper module to configure the SageMaker Debugger built-in rules with
24+
the :class:`~sagemaker.debugger.Rule` classmethods and
25+
and the :class:`~sagemaker.debugger.ProfilerRule` classmethods.
26+
27+
For a full list of built-in rules, see
28+
`List of Debugger Built-in Rules
29+
<https://docs.aws.amazon.com/sagemaker/latest/dg/debugger-built-in-rules.html>`_.
30+
31+
This module is imported from the Debugger client library for rule configuration.
32+
For more information, see
33+
`Amazon SageMaker Debugger RulesConfig
34+
<https://github.com/awslabs/sagemaker-debugger-rulesconfig>`_.
35+
36+
.. autoclass:: RuleBase
37+
:show-inheritance:
38+
39+
.. autoclass:: Rule
40+
:show-inheritance:
41+
:inherited-members:
42+
43+
.. autoclass:: ProfilerRule
44+
:show-inheritance:
45+
:inherited-members:
46+
47+
.. autoclass:: CollectionConfig
48+
:show-inheritance:
49+
50+
.. autoclass:: DebuggerHookConfig
751
:show-inheritance:
52+
53+
.. autoclass:: TensorBoardOutputConfig
54+
:show-inheritance:
55+
56+
.. autoclass:: ProfilerConfig
57+
:show-inheritance:
58+
59+
.. autoclass:: FrameworkProfile
60+
:show-inheritance:
61+
62+
.. autoclass:: DetailedProfilingConfig
63+
:show-inheritance:
64+
65+
.. autoclass:: DataloaderProfilingConfig
66+
:show-inheritance:
67+
68+
.. autoclass:: PythonProfilingConfig
69+
:show-inheritance:
70+
71+
.. autoclass:: PythonProfiler
72+
:show-inheritance:
73+
74+
.. autoclass:: cProfileTimer
75+
:show-inheritance:
76+
77+
.. automodule:: sagemaker.debugger.metrics_config
78+
:members: StepRange, TimeRange
79+
:undoc-members:

doc/api/training/distributed.rst

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
Distributed Training APIs
2+
-------------------------
3+
SageMaker distributed training libraries offer both data parallel and model parallel training strategies.
4+
They combine software and hardware technologies to improve inter-GPU and inter-node communications.
5+
They extend SageMaker’s training capabilities with built-in options that require only small code changes to your training scripts.
6+
7+
.. toctree::
8+
:maxdepth: 3
9+
10+
smd_data_parallel
11+
smd_model_parallel

doc/api/training/index.rst

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,13 @@ Training APIs
33
#############
44

55
.. toctree::
6-
:maxdepth: 1
7-
:glob:
6+
:maxdepth: 4
87

9-
*
8+
analytics
9+
automl
10+
debugger
11+
estimators
12+
algorithm
13+
tuner
14+
parameter
15+
processing

doc/api/training/processing.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,3 +10,8 @@ Processing
1010
:members:
1111
:undoc-members:
1212
:show-inheritance:
13+
14+
.. automodule:: sagemaker.clarify
15+
:members:
16+
:undoc-members:
17+
:show-inheritance:
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
###################################
2+
Distributed data parallel
3+
###################################
4+
5+
SageMaker distributed data parallel (SDP) extends SageMaker’s training
6+
capabilities on deep learning models with near-linear scaling efficiency,
7+
achieving fast time-to-train with minimal code changes.
8+
9+
- SDP optimizes your training job for AWS network infrastructure and EC2 instance topology.
10+
- SDP takes advantage of gradient update to communicate between nodes with a custom AllReduce algorithm.
11+
12+
When training a model on a large amount of data, machine learning practitioners
13+
will often turn to distributed training to reduce the time to train.
14+
In some cases, where time is of the essence,
15+
the business requirement is to finish training as quickly as possible or at
16+
least within a constrained time period.
17+
Then, distributed training is scaled to use a cluster of multiple nodes,
18+
meaning not just multiple GPUs in a computing instance, but multiple instances
19+
with multiple GPUs. As the cluster size increases, so does the significant drop
20+
in performance. This drop in performance is primarily caused the communications
21+
overhead between nodes in a cluster.
22+
23+
.. important::
24+
SDP only supports training jobs using CUDA 11. When you define a PyTorch or TensorFlow
25+
``Estimator`` with ``dataparallel`` parameter ``enabled`` set to ``True``,
26+
it uses CUDA 11. When you extend or customize your own training image
27+
you must use a CUDA 11 base image. See
28+
`SageMaker Python SDK's SDP APIs
29+
<https://docs.aws.amazon.com/sagemaker/latest/dg/data-parallel-use-api.html#data-parallel-use-python-skd-api>`__
30+
for more information.
31+
32+
.. rubric:: Customize your training script
33+
34+
To customize your own training script, you will need the following:
35+
36+
.. raw:: html
37+
38+
<div data-section-style="5" style="">
39+
40+
- You must provide TensorFlow / PyTorch training scripts that are
41+
adapted to use SDP.
42+
- Your input data must be in an S3 bucket or in FSx in the AWS region
43+
that you will use to launch your training job. If you use the Jupyter
44+
notebooks provided, create a SageMaker notebook instance in the same
45+
region as the bucket that contains your input data. For more
46+
information about storing your training data, refer to
47+
the `SageMaker Python SDK data
48+
inputs <https://sagemaker.readthedocs.io/en/stable/overview.html#use-file-systems-as-training-inputs>`__ documentation.
49+
50+
.. raw:: html
51+
52+
</div>
53+
54+
Use the API guides for each framework to see
55+
examples of training scripts that can be used to convert your training scripts.
56+
Then, use one of the example notebooks as your template to launch a training job.
57+
You’ll need to swap your training script with the one that came with the
58+
notebook and modify any input functions as necessary.
59+
Once you have launched a training job, you can monitor it using CloudWatch.
60+
61+
Then you can see how to deploy your trained model to an endpoint by
62+
following one of the example notebooks for deploying a model. Finally,
63+
you can follow an example notebook to test inference on your deployed
64+
model.
65+
66+
67+
68+
.. toctree::
69+
:maxdepth: 2
70+
71+
smd_data_parallel_pytorch
72+
smd_data_parallel_tensorflow

0 commit comments

Comments
 (0)