Skip to content

Commit 1d68db8

Browse files
authored
Merge branch 'master' into pt_171_smddp
2 parents bcbca31 + 5edee07 commit 1d68db8

File tree

171 files changed

+5770
-2041
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

171 files changed

+5770
-2041
lines changed

.pydocstylerc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
11
[pydocstyle]
22
inherit = false
3-
ignore = D104,D107,D202,D203,D205,D212,D213,D214,D400,D401,D404,D406,D407,D411,D413,D414,D415,D417
3+
ignore = D104,D107,D202,D203,D213,D214,D400,D401,D404,D406,D407,D411,D413,D414,D415,D417
44
match = (?!record_pb2).*\.py

CHANGELOG.md

Lines changed: 177 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,182 @@
11
# Changelog
22

3+
## v2.24.5 (2021-02-12)
4+
5+
### Bug Fixes and Other Changes
6+
7+
* test_tag/test_tags method assert fix in association tests
8+
9+
### Documentation Changes
10+
11+
* removing mention of TF 2.4 from SM distributed model parallel docs
12+
* adding details about mpi options, other small updates
13+
14+
## v2.24.4 (2021-02-09)
15+
16+
### Bug Fixes and Other Changes
17+
18+
* add integration test for listing artifacts by type
19+
* List Associations integ tests
20+
21+
## v2.24.3 (2021-02-04)
22+
23+
### Bug Fixes and Other Changes
24+
25+
* Remove pytest fixture and fix test_tag/s method
26+
27+
## v2.24.2 (2021-02-03)
28+
29+
### Bug Fixes and Other Changes
30+
31+
* use 3.5 version of get-pip.py
32+
* SM DDP release notes/changelog files
33+
34+
### Documentation Changes
35+
36+
* adding versioning to sm distributed data parallel docs
37+
38+
## v2.24.1 (2021-01-28)
39+
40+
### Bug Fixes and Other Changes
41+
42+
* fix collect-tests tox env
43+
* create profiler specific unsupported regions
44+
* Update smd_model_parallel_pytorch.rst
45+
46+
## v2.24.0 (2021-01-22)
47+
48+
### Features
49+
50+
* add support for Std:Join for pipelines
51+
* Map image name to image uri
52+
* friendly names for short URIs
53+
54+
### Bug Fixes and Other Changes
55+
56+
* increase allowed time for search to get updated
57+
* refactor distribution config construction
58+
59+
### Documentation Changes
60+
61+
* Add SMP 1.2.0 API docs
62+
63+
## v2.23.6 (2021-01-20)
64+
65+
### Bug Fixes and Other Changes
66+
67+
* add artifact, action, context to virsualizer
68+
69+
## v2.23.5 (2021-01-18)
70+
71+
### Bug Fixes and Other Changes
72+
73+
* increase time allowed for trial components to index
74+
75+
## v2.23.4.post0 (2021-01-14)
76+
77+
### Documentation Changes
78+
79+
* update predict_fn implementation for PyTorch EIA 1.5.1
80+
81+
## v2.23.4 (2021-01-13)
82+
83+
### Bug Fixes and Other Changes
84+
85+
* remove captureWarninig setting
86+
87+
## v2.23.3 (2021-01-12)
88+
89+
### Bug Fixes and Other Changes
90+
91+
* improve optional dependency error message
92+
* add debugger rule container account in PDT
93+
* assert step execution first in pipeline test
94+
* add service inserted fields to generated Hive DDL
95+
96+
### Documentation Changes
97+
98+
* fix description for max_wait
99+
* use correct classpath in V2 alias documentation.
100+
* Bad arg name in feat-store ingestion manager
101+
102+
## v2.23.2 (2021-01-06)
103+
104+
### Bug Fixes and Other Changes
105+
106+
* remove shell=True in subprocess.check_output
107+
* use SecurityConfig dict key
108+
109+
### Documentation Changes
110+
111+
* remove D212 from ignore to comply with PEP257 standards
112+
113+
## v2.23.1 (2020-12-29)
114+
115+
### Bug Fixes and Other Changes
116+
117+
* update git utils temp file
118+
* Allow online store only FeatureGroups
119+
120+
### Documentation Changes
121+
122+
* inform contributors when not to mark integration tests as canaries
123+
* adding change log for smd model parallel
124+
125+
## v2.23.0 (2020-12-23)
126+
127+
### Features
128+
129+
* Add support for actions in debugger rules.
130+
131+
### Bug Fixes and Other Changes
132+
133+
* include sparkml 2.4 in image uri config properly
134+
* Mount metadata dir only if it exists
135+
* allow urllib3 1.26
136+
137+
## v2.22.0 (2020-12-22)
138+
139+
### Features
140+
141+
* Support local mode for Amazon SageMaker Processing jobs
142+
143+
### Bug Fixes and Other Changes
144+
145+
* Add API enhancements for SMP
146+
* adjust naming convention; fix links
147+
* lower value used in featurestore test
148+
149+
### Documentation Changes
150+
151+
* Update GTDD instructions
152+
153+
## v2.21.0 (2020-12-21)
154+
155+
### Features
156+
157+
* remove D205 to enable PEP257 Docstring Conventions
158+
159+
### Bug Fixes and Other Changes
160+
161+
* Pin smdebug-rulesconfig to 1.0.0
162+
* use itertuples to ingest pandas dataframe to FeatureStore
163+
164+
## v2.20.0 (2020-12-16)
165+
166+
### Features
167+
168+
* add dataset definition support for processing jobs
169+
170+
### Bug Fixes and Other Changes
171+
172+
* include workflow integ tests with clarify and debugger enabled
173+
* only run DataParallel and EdgePackaging tests in supported regions
174+
175+
### Documentation Changes
176+
177+
* fix smp code example, add note for CUDA 11 to sdp
178+
* adding note about CUDA 11 to SMP. Small title update PyTorch
179+
3180
## v2.19.0 (2020-12-08)
4181

5182
### Features

CONTRIBUTING.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ If you are writing or modifying a test that creates a SageMaker job (training, t
111111
1. Run all the unit tests as per [Run the Unit Tests](#run-the-unit-tests), and verify that all checks and tests pass.
112112
1. Note that this also runs tools that may be necessary for the automated build to pass (ex: code reformatting by 'black').
113113
1. If your changes include documentation changes, please see the [Documentation Guidelines](#documentation-guidelines).
114+
1. If you include integration tests, do not mark them as canaries if they will not run in all regions.
114115

115116

116117
### Commit Your Change

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
2.19.1.dev0
1+
2.24.6.dev0

buildspec-deploy.yml

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,16 @@ version: 0.2
33
phases:
44
build:
55
commands:
6-
- PACKAGE_FILE="$CODEBUILD_SRC_DIR_ARTIFACT_1/sagemaker-*.tar.gz"
6+
# prepare the release (update versions, changelog etc.)
7+
- git-release --prepare
8+
9+
# generate the distribution package
10+
- python3 setup.py sdist
11+
12+
# publish the release to github
13+
- git-release --publish
14+
15+
- PACKAGE_FILE="dist/sagemaker-*.tar.gz"
716
- PYPI_USER=$(aws secretsmanager get-secret-value --secret-id /codebuild/pypi/user --query SecretString --output text)
817
- PYPI_PASSWORD=$(aws secretsmanager get-secret-value --secret-id /codebuild/pypi/password --query SecretString --output text)
918
- GPG_PRIVATE_KEY=$(aws secretsmanager get-secret-value --secret-id /codebuild/gpg/private_key --query SecretString --output text)

buildspec-release.yml

Lines changed: 1 addition & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,6 @@ version: 0.2
33
phases:
44
build:
55
commands:
6-
# prepare the release (update versions, changelog etc.)
7-
- git-release --prepare
8-
96
# run linters
107
- tox -e flake8,pylint
118

@@ -21,16 +18,4 @@ phases:
2118
tox -e py36,py37,py38 -- tests/unit
2219

2320
# run a subset of the integration tests
24-
- IGNORE_COVERAGE=- tox -e py36 -- tests/integ -m canary_quick -n 64 --boxed --reruns 2
25-
26-
# generate the distribution package
27-
- python3 setup.py sdist
28-
29-
# publish the release to github
30-
- git-release --publish
31-
32-
artifacts:
33-
files:
34-
- dist/sagemaker-*.tar.gz
35-
name: ARTIFACT_1
36-
discard-paths: yes
21+
- IGNORE_COVERAGE=- tox -e py36 -- tests/integ -m not (local_mode or slow_test) -n 32 --boxed --reruns 2

buildspec-slowtests.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
version: 0.2
2+
3+
phases:
4+
pre_build:
5+
commands:
6+
- start-dockerd
7+
8+
build:
9+
commands:
10+
- IGNORE_COVERAGE=-
11+
12+
# slow tests
13+
- start_time=`date +%s`
14+
- execute-command-if-has-matching-changes "tox -e py38 -- tests/integ -m slow_test -n 16 --durations 0" "tests/integ" "tests/data" "tests/conftest.py" "tests/__init__.py" "src/*.py" "setup.py" "setup.cfg" "buildspec-slowtests.yml"
15+
- ./ci-scripts/displaytime.sh 'py38 slow tests' $start_time

buildspec.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ phases:
1616

1717
- start_time=`date +%s`
1818
- |
19-
execute-command-if-has-matching-changes "env -u AWS_DEFAULT_REGION tox -e py38 -- tests/integ -m \"not local_mode and not cron\" -n 384 --reruns 3 --reruns-delay 15 --durations 50 --boto-config '{\"region_name\": \"us-east-2\"}'" "tests/integ" "tests/scripts" "tests/data" "tests/conftest.py" "tests/__init__.py" "src/*.py" "src/sagemaker/image_uri_config/*.json" "setup.py" "setup.cfg" "buildspec.yml"
19+
execute-command-if-has-matching-changes "env -u AWS_DEFAULT_REGION tox -e py38 -- tests/integ -m \"not local_mode and not cron and not slow_test\" -n 384 --reruns 3 --reruns-delay 15 --durations 50 --boto-config '{\"region_name\": \"us-east-2\"}'" "tests/integ" "tests/scripts" "tests/data" "tests/conftest.py" "tests/__init__.py" "src/*.py" "src/sagemaker/image_uri_config/*.json" "setup.py" "setup.cfg" "buildspec.yml"
2020
- ./ci-scripts/displaytime.sh 'py38 tests/integ' $start_time
2121

2222
post_build:

doc/api/training/smd_data_parallel_pytorch.rst renamed to doc/api/training/sdp_versions/v1.0.0/smd_data_parallel_pytorch.rst

Lines changed: 22 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
####################
2-
PyTorch Guide to SDP
3-
####################
1+
##############################################################
2+
PyTorch Guide to SageMaker's distributed data parallel library
3+
##############################################################
44

55
.. admonition:: Contents
66

@@ -13,16 +13,16 @@ Modify a PyTorch training script to use SageMaker data parallel
1313
======================================================================
1414

1515
The following steps show you how to convert a PyTorch training script to
16-
utilize SageMaker Distributed Data Parallel (SDP).
16+
utilize SageMaker's distributed data parallel library.
1717

18-
The SDP APIs are designed to be close to PyTorch Distributed Data
19-
Parallel (DDP) APIs. Please see `SageMaker Distributed Data Parallel
20-
PyTorch API documentation <http://#>`__ for additional details on each
21-
API SDP offers for PyTorch.
18+
The distributed data parallel library APIs are designed to be close to PyTorch Distributed Data
19+
Parallel (DDP) APIs.
20+
See `SageMaker distributed data parallel PyTorch examples <https://sagemaker-examples.readthedocs.io/en/latest/training/distributed_training/index.html#pytorch-distributed>`__ for additional details on how to implement the data parallel library
21+
API offered for PyTorch.
2222

2323

24-
- First import SDP’s PyTorch client and initialize it. You also import
25-
the SDP module for distributed training.
24+
- First import the distributed data parallel library’s PyTorch client and initialize it. You also import
25+
the distributed data parallel library module for distributed training.
2626

2727
.. code:: python
2828
@@ -33,7 +33,7 @@ API SDP offers for PyTorch.
3333
dist.init_process_group()
3434
3535
36-
- Pin each GPU to a single SDP process with ``local_rank`` - this
36+
- Pin each GPU to a single distributed data parallel library process with ``local_rank`` - this
3737
refers to the relative rank of the process within a given node.
3838
``smdistributed.dataparallel.torch.get_local_rank()`` API provides
3939
you the local rank of the device. The leader node will be rank 0, and
@@ -45,12 +45,12 @@ API SDP offers for PyTorch.
4545
torch.cuda.set_device(dist.get_local_rank())
4646
4747
48-
- Then wrap the PyTorch model with SDP’s DDP.
48+
- Then wrap the PyTorch model with the distributed data parallel library’s DDP.
4949

5050
.. code:: python
5151
5252
model = ...
53-
# Wrap model with SDP DistributedDataParallel
53+
# Wrap model with SageMaker's DistributedDataParallel
5454
model = DDP(model)
5555
5656
@@ -82,17 +82,17 @@ API SDP offers for PyTorch.
8282
8383
8484
All put together, the following is an example PyTorch training script
85-
you will have for distributed training with SDP:
85+
you will have for distributed training with the distributed data parallel library:
8686

8787
.. code:: python
8888
89-
# SDP: Import SDP PyTorch API
89+
# Import distributed data parallel library PyTorch API
9090
import smdistributed.dataparallel.torch.distributed as dist
9191
92-
# SDP: Import SDP PyTorch DDP
92+
# Import distributed data parallel library PyTorch DDP
9393
from smdistributed.dataparallel.torch.parallel.distributed import DistributedDataParallel as DDP
9494
95-
# SDP: Initialize SDP
95+
# Initialize distributed data parallel library
9696
dist.init_process_group()
9797
9898
class Net(nn.Module):
@@ -109,25 +109,25 @@ you will have for distributed training with SDP:
109109
110110
def main():
111111
112-
    # SDP: Scale batch size by world size
112+
    # Scale batch size by world size
113113
    batch_size //= dist.get_world_size() // 8
114114
    batch_size = max(batch_size, 1)
115115
116116
    # Prepare dataset
117117
    train_dataset = torchvision.datasets.MNIST(...)
118118
119-
    # SDP: Set num_replicas and rank in DistributedSampler
119+
    # Set num_replicas and rank in DistributedSampler
120120
    train_sampler = torch.utils.data.distributed.DistributedSampler(
121121
            train_dataset,
122122
            num_replicas=dist.get_world_size(),
123123
            rank=dist.get_rank())
124124
125125
    train_loader = torch.utils.data.DataLoader(..)
126126
127-
    # SDP: Wrap the PyTorch model with SDP’s DDP
127+
    # Wrap the PyTorch model with distributed data parallel library’s DDP
128128
    model = DDP(Net().to(device))
129129
130-
    # SDP: Pin each GPU to a single SDP process.
130+
    # Pin each GPU to a single distributed data parallel library process.
131131
    torch.cuda.set_device(local_rank)
132132
    model.cuda(local_rank)
133133
@@ -140,7 +140,7 @@ you will have for distributed training with SDP:
140140
            test(...)
141141
        scheduler.step()
142142
143-
    # SDP: Save model on master node.
143+
    # Save model on master node.
144144
    if dist.get_rank() == 0:
145145
        torch.save(...)
146146

0 commit comments

Comments
 (0)