change: move script mode branch to master #234

mvsusp · 2019-09-19T16:52:16Z

Description of changes:

move script mode branch to master

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

* Scriptmode with cpu docker py2 and py3 docker file * Migrate to sagemaker-containers 2.1 * Remove serving related packages and code from container * Add py3 container * Add integ and unit tests for script mode * Remove non-asci characters from README * Changes based on pr comments * Move conftest to test root dir * Add default values for test args * add docker-compose to test requirement

* Add tox.ini and configure coverage and flake runs * Add more unit tests * Configure unit tests to run with both py2 and py3 * Add flake checks * Fix broken integ tests * Add import style check * Add .flake8 * Add source module in coverage command * Add newlines

* Add mnist sagemaker tests * Use account-id instead of ecr-image * Merge gpu and cpu sagemaker tests * remove _run_mnist_training

* Add Script Mode example

* Add benchmarking script

* edited tf script mode notebook

* Implement distributed support * Launch parameter server if user set sagemaker_parameter_server_enabled to be True * Add integ tests * Add unit tests * Add distributed sagemaker integ test * Add 1.11.0 and modify Dockerfile to reduce image size

* Add CI configuration files

* Setting S3 environment variables before training starts * Remove S3 environment variable setting in test training script * Add unit tests

* Force framework libraries to re-install

* Update sagemaker containers

* Add wait False to run-ps

* Unset CUDA_VISIBLE_DEVICES for worker processes * Add comments

The tests all passed not sure why the sagemaker tests are not reporting success.

* Add Keras support

* Create parameter server in different thread * Fixing some integ tests

This test is only configured to run with 'local'. Change it to use the correct instance type accordingly.

…the test (aws#134) * Skip keras local mode test on gpu

This is for compatibility with a recent SageMaker Containers change: https://github.com/aws/sagemaker-containers/pull/157/files#diff-25848f52cb812bb370f854bffb2e7b40

Need this change for container release. Change is only to disable tests

* Add S3 plugin tests TensorFlow's S3 plugin doesn't work well with S3's eventual consistency model so we have seen training job failing due to checkpoint or model exporting to S3. Recently we have released our prod containers with a S3 plugin patch. This should reduce or eliminate such errors. The Test added writes a checkpoint to S3 after every training step. It fails with vanilla TensorFlow. * Remove distributed_mnist.py * Fix line too long

This test shouldn't save checkpoints since the two hosts are justing running training jobs independently. The checkpoints interfere with each other. Changing the test to use the Keras mnist script here. This change also changed the saved model path to /opt/ml/opt so we can just use the estimator.model_data path to assert the model exists.

* Use the test argement framework_version in all tests * Make flake8 happy

* this change fixes module import errors in the test directory when running with Python2.7 * reduce max training steps in the mnist test from 1000 to 200 in order to shorten test runtime

Add placeholder in test commands for cpu-instance-type and aws-id.

…e-to-master

sagemaker-bot · 2019-09-19T17:05:16Z

AWS CodeBuild CI Report

Result: FAILED
Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

ThomasDelteil · 2019-10-30T00:47:30Z

src/sagemaker_tensorflow_container/training.py

+    # If the training job is part of the multiple training jobs for tuning, we need to append the training job name to
+    # model_dir in case they read from/write to the same object
+    if '_tuning_objective_metric' in hyperparameters:
+        model_dir = _model_dir_with_training_job(hyperparameters.get('model_dir'), env.job_name)


it seems that hyperparameters does not have 'model_dir' while running my tuning job. Should hyperparameters here be user_hyperparameters instead? @mvsusp

icywang86rui and others added 30 commits September 27, 2018 10:23

Add integration tests to run training jobs with sagemaker (aws#81)

99eaf6b

* Add mnist sagemaker tests * Use account-id instead of ecr-image * Merge gpu and cpu sagemaker tests * remove _run_mnist_training

Add Script Mode example (aws#83)

1338820

* Add Script Mode example

Add benchmarking script (aws#86)

a1916a8

* Add benchmarking script

Edited the tf script mode notebook (aws#90)

7047101

* edited tf script mode notebook

Add CI configuration files (aws#109)

177773d

* Add CI configuration files

Set S3 environment variables (aws#112)

a897135

* Setting S3 environment variables before training starts * Remove S3 environment variable setting in test training script * Add unit tests

GPU fix (aws#117)

5913b17

* Force framework libraries to re-install

Update sagemaker containers (aws#119)

1fab499

* Update sagemaker containers

Set parameter process waiting to False (aws#120)

c4abcae

* Add wait False to run-ps

Disable GPU for parameter process (aws#121)

378add5

Unset CUDA_VISIBLE_DEVICES for worker processes (aws#122)

534ffa7

* Unset CUDA_VISIBLE_DEVICES for worker processes * Add comments

Fix broken unit tests (aws#124)

e6bf988

The tests all passed not sure why the sagemaker tests are not reporting success.

Add Keras support (aws#126)

49a0547

* Add Keras support

Create parameter server in different thread (aws#129)

962f15b

* Create parameter server in different thread * Fixing some integ tests

Fix Keras test (aws#132)

8e6c4f2

This test is only configured to run with 'local'. Change it to use the correct instance type accordingly.

Skip keras local mode test on gpu and use random port for serving in …

d2f9f48

…the test (aws#134) * Skip keras local mode test on gpu

Update script_mode_train_any_tf_script_in_sage_maker.ipynb (aws#110)

80aa735

Add python-dev and build-essential to Dockerfiles (aws#141)

441adb0

This is for compatibility with a recent SageMaker Containers change: https://github.com/aws/sagemaker-containers/pull/157/files#diff-25848f52cb812bb370f854bffb2e7b40

Force parameter server to run on CPU (aws#143)

a9e4359

Deprecate get_marker. Use get_closest_marker instead (aws#146)

a4e6cfa

TensorFlow 1.12 and Horovod support (aws#138)

4f66042

Skip horovod integration tests (aws#149)

8be0efe

Need this change for container release. Change is only to disable tests

Add Horovod tests (aws#151)

070e5fb

Skip horovod local CPU test in GPU instances (aws#152)

658ec5a

Use the test argement framework_version in all tests (aws#158)

f339949

* Use the test argement framework_version in all tests * Make flake8 happy

ci and others added 22 commits June 3, 2019 23:46

update development version to v2.0.2.dev0

369c67f

fix: resolve pluggy version conflict (aws#211)

06fbbb4

prepare release v2.0.2

f4ca5a0

update development version to v2.0.3.dev0

bddf48d

fix: only run one test during deployment (aws#212)

73d41fc

prepare release v2.0.3

c8d84bd

update development version to v2.0.4.dev0

90c7e07

fix: fix integ test errors when running with py2 (aws#213)

b0c8879

* this change fixes module import errors in the test directory when running with Python2.7 * reduce max training steps in the mnist test from 1000 to 200 in order to shorten test runtime

prepare release v2.0.4

8367056

update development version to v2.0.5.dev0

30ac4fd

fix: add hyperparameter tuning test (aws#216)

1aa7659

fix: bump sagemaker-containers version to 2.4.10 (aws#217)

4f334d0

prepare release v2.0.5

874f9fd

update development version to v2.0.6.dev0

3a20d42

change: fix horovod mnist script (aws#224)

88b06d2

prepare release v2.0.6

1b86b56

update development version to v2.0.7.dev0

18496c5

change: update no-p2 and no-p3 regions. (aws#230)

c977f5f

Add placeholder in test commands for cpu-instance-type and aws-id.

prepare release v2.0.7

9f4a224

update development version to v2.0.8.dev0

e3a7a65

deleting master content

2757315

Merge remote-tracking branch 'origin/script-mode' into mvs-script-mod…

0b4fd04

…e-to-master

mvsusp changed the title ~~[WIP] change: move script mode branch to master~~ change: move script mode branch to master Sep 19, 2019

mvsusp requested a review from chuyang-deng September 19, 2019 18:05

chuyang-deng approved these changes Sep 19, 2019

View reviewed changes

mvsusp merged commit 12fd7ef into aws:master Sep 19, 2019

mvsusp deleted the mvs-script-mode-to-master branch September 19, 2019 18:11

laurenyu mentioned this pull request Oct 15, 2019

fix: flip master to script mode #222

Closed

ThomasDelteil reviewed Oct 30, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

change: move script mode branch to master #234

change: move script mode branch to master #234

Uh oh!

mvsusp commented Sep 19, 2019

Uh oh!

sagemaker-bot commented Sep 19, 2019

Uh oh!

ThomasDelteil Oct 30, 2019

Uh oh!

Uh oh!

change: move script mode branch to master #234

change: move script mode branch to master #234

Uh oh!

Conversation

mvsusp commented Sep 19, 2019

Uh oh!

sagemaker-bot commented Sep 19, 2019

AWS CodeBuild CI Report

Uh oh!

ThomasDelteil Oct 30, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!