Skip to content

feature: add tensorflow training 1.15.2 py37 support #1458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 6, 2020
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion doc/using_tf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ To train a TensorFlow model by using the SageMaker Python SDK:
Prepare a Script Mode Training Script
======================================

Your TensorFlow training script must be a Python 2.7- or 3.6-compatible source file.
Your TensorFlow training script must be a Python 2.7-, 3.6- or 3.7-compatible source file.

The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, including the following:

Expand Down Expand Up @@ -143,6 +143,11 @@ To use Script Mode, set at least one of these args
- ``py_version='py3'``
- ``script_mode=True``

To use Python 3.7, please specify both of the args:

- ``py_version='py37'``
- ``framework_version='1.15.2'``

When using Script Mode, your training script needs to accept the following args:

- ``model_dir``
Expand Down
8 changes: 6 additions & 2 deletions src/sagemaker/fw_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@
"{} framework does not support version {}. Please use one of the following: {}."
)

VALID_PY_VERSIONS = ["py2", "py3"]
VALID_PY_VERSIONS = ["py2", "py3", "py37"]
VALID_EIA_FRAMEWORKS = [
"tensorflow",
"tensorflow-serving",
Expand All @@ -71,6 +71,7 @@
"pytorch-serving",
]
PY2_RESTRICTED_EIA_FRAMEWORKS = ["pytorch-serving"]
PY37_SUPPORTED_FRAMEWORKS = ["tensorflow-scriptmode"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? should we just return image not found if the given combination (Framework, Framework version, Python version) is not support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we strip this line (L74) Python SDK will have to hit ECR service to find out if the combination exists.

If we do nothing to check for the py37 support before constructing the image uri, here will throw KeyError (key "py37" not found).

Updating the MERGED_FRAMEWORKS_LOWEST_VERSIONS dict to let each entry have a "py_version" key is also doable to prevent Python SDK hit the ECR service, but I think this dict is for checking whether the image is DLC merged frameworks instead of checking Python version support.

VALID_ACCOUNTS_BY_REGION = {
"us-gov-west-1": "246785580436",
"us-iso-east-1": "744548109606",
Expand Down Expand Up @@ -103,7 +104,7 @@
}

MERGED_FRAMEWORKS_LOWEST_VERSIONS = {
"tensorflow-scriptmode": {"py3": [1, 13, 1], "py2": [1, 14, 0]},
"tensorflow-scriptmode": {"py3": [1, 13, 1], "py2": [1, 14, 0], "py37": [1, 15, 2]},
"tensorflow-serving": [1, 13, 0],
"tensorflow-serving-eia": [1, 14, 0],
"mxnet": {"py3": [1, 4, 1], "py2": [1, 6, 0]},
Expand Down Expand Up @@ -257,6 +258,9 @@ def create_image_uri(
if py_version and py_version not in VALID_PY_VERSIONS:
raise ValueError("invalid py_version argument: {}".format(py_version))

if py_version == "py37" and framework not in PY37_SUPPORTED_FRAMEWORKS:
raise ValueError("{} does not support Python 3.7 yet.".format(framework))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe s/yet/at this time. There might some some frameworks that never receive Python 3.7 support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


if _accelerator_type_valid_for_framework(
framework=framework,
py_version=py_version,
Expand Down
2 changes: 1 addition & 1 deletion src/sagemaker/tensorflow/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,7 @@ The latest containers include the following Python packages:
| tensorflow | 1.15.2 | 2.1.0 |
+--------------------------------+---------------+---------------+

Script Mode TensorFlow Docker images support both Python 2.7 and Python 3.6. The Docker images extend Ubuntu 16.04.
Script Mode TensorFlow Docker images support Python 2.7 and Python 3.6, Python 3.7 for TensorFlow version 1.15.2. The Docker images extend Ubuntu 16.04.

You can select version of TensorFlow by passing a ``framework_version`` keyword arg to the TensorFlow Estimator constructor. Currently supported versions are listed in the table above. You can also set ``framework_version`` to only specify major and minor version, e.g ``'1.6'``, which will cause your training script to be run on the latest supported patch version of that minor version, which in this example would be 1.6.0.
Alternatively, you can build your own image by following the instructions in the SageMaker TensorFlow containers
Expand Down
4 changes: 3 additions & 1 deletion tests/integ/test_tf_script_mode.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,9 @@ def test_mnist_distributed(sagemaker_session, instance_type, tf_full_version):
train_instance_count=2,
train_instance_type=instance_type,
sagemaker_session=sagemaker_session,
py_version=tests.integ.PYTHON_VERSION,
py_version="py37"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about other places in this file where py_veresion=tests.integ.PYTHON_VERSION?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to update them all once the container is released. But probably it's better to update them all at once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm...maybe it would be cleaner to put this into a fixture? i.e.

@pytest.fixture
def py_version():
    return "py37" if tf_full_version == TensorFlow._LATEST_1X_VERSION else tests.integ.PYTHON_VERSION

and then just use py_version in all the tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Yes. This would be much better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I think you can leave it in the TF script mode test file, since that's the only file that uses it, and the if condition is rather TF-specific

if tf_full_version == TensorFlow._LATEST_1X_VERSION
else tests.integ.PYTHON_VERSION,
script_mode=True,
framework_version=tf_full_version,
distributions=PARAMETER_SERVER_DISTRIBUTION,
Expand Down
18 changes: 18 additions & 0 deletions tests/unit/test_fw_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,24 @@ def test_create_image_uri_cn_northwest_1():
}


def test_create_image_uri_py37_invalid_framework():
error_message = "{} does not support Python 3.7 yet.".format(MOCK_FRAMEWORK)

with pytest.raises(ValueError) as error:
fw_utils.create_image_uri(REGION, MOCK_FRAMEWORK, "ml.m4.xlarge", "1.4.0", "py37")
assert error_message in str(error)


def test_create_image_uri_py37():
image_uri = fw_utils.create_image_uri(
REGION, "tensorflow-scriptmode", "ml.m4.xlarge", "1.15.2", "py37"
)
assert (
image_uri
== "763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.2-cpu-py37"
)


def test_tf_eia_images():
image_uri = fw_utils.create_image_uri(
"us-west-2",
Expand Down