Skip to content

feature: add tensorflow training 1.15.2 py37 support #1458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 6, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion doc/using_tf.rst
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ To train a TensorFlow model by using the SageMaker Python SDK:
Prepare a Script Mode Training Script
======================================

Your TensorFlow training script must be a Python 2.7- or 3.6-compatible source file.
Your TensorFlow training script must be a Python 2.7-, 3.6- or 3.7-compatible source file.

The training script is very similar to a training script you might run outside of SageMaker, but you can access useful properties about the training environment through various environment variables, including the following:

Expand Down Expand Up @@ -143,6 +143,11 @@ To use Script Mode, set at least one of these args
- ``py_version='py3'``
- ``script_mode=True``

To use Python 3.7, please specify both of the args:

- ``py_version='py37'``
- ``framework_version='1.15.2'``

When using Script Mode, your training script needs to accept the following args:

- ``model_dir``
Expand Down
8 changes: 6 additions & 2 deletions src/sagemaker/fw_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@
"{} framework does not support version {}. Please use one of the following: {}."
)

VALID_PY_VERSIONS = ["py2", "py3"]
VALID_PY_VERSIONS = ["py2", "py3", "py37"]
VALID_EIA_FRAMEWORKS = [
"tensorflow",
"tensorflow-serving",
Expand All @@ -71,6 +71,7 @@
"pytorch-serving",
]
PY2_RESTRICTED_EIA_FRAMEWORKS = ["pytorch-serving"]
PY37_SUPPORTED_FRAMEWORKS = ["tensorflow-scriptmode"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this? should we just return image not found if the given combination (Framework, Framework version, Python version) is not support?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we strip this line (L74) Python SDK will have to hit ECR service to find out if the combination exists.

If we do nothing to check for the py37 support before constructing the image uri, here will throw KeyError (key "py37" not found).

Updating the MERGED_FRAMEWORKS_LOWEST_VERSIONS dict to let each entry have a "py_version" key is also doable to prevent Python SDK hit the ECR service, but I think this dict is for checking whether the image is DLC merged frameworks instead of checking Python version support.

VALID_ACCOUNTS_BY_REGION = {
"us-gov-west-1": "246785580436",
"us-iso-east-1": "744548109606",
Expand Down Expand Up @@ -103,7 +104,7 @@
}

MERGED_FRAMEWORKS_LOWEST_VERSIONS = {
"tensorflow-scriptmode": {"py3": [1, 13, 1], "py2": [1, 14, 0]},
"tensorflow-scriptmode": {"py3": [1, 13, 1], "py2": [1, 14, 0], "py37": [1, 15, 2]},
"tensorflow-serving": [1, 13, 0],
"tensorflow-serving-eia": [1, 14, 0],
"mxnet": {"py3": [1, 4, 1], "py2": [1, 6, 0]},
Expand Down Expand Up @@ -257,6 +258,9 @@ def create_image_uri(
if py_version and py_version not in VALID_PY_VERSIONS:
raise ValueError("invalid py_version argument: {}".format(py_version))

if py_version == "py37" and framework not in PY37_SUPPORTED_FRAMEWORKS:
raise ValueError("{} does not support Python 3.7 at this time.".format(framework))

if _accelerator_type_valid_for_framework(
framework=framework,
py_version=py_version,
Expand Down
76 changes: 38 additions & 38 deletions src/sagemaker/tensorflow/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,45 +31,45 @@ SageMaker TensorFlow Docker containers

The latest containers include the following Python packages:

+--------------------------------+---------------+---------------+
| Dependencies | TF 1.15.2 | TF 2.1 |
+--------------------------------+---------------+---------------+
| awscli | 1.18.1 | 1.18.3 |
+--------------------------------+---------------+---------------+
| boto3 | 1.12.1 | 1.12.3 |
+--------------------------------+---------------+---------------+
| botocore | 1.15.1 | 1.15.3 |
+--------------------------------+---------------+---------------+
| h5py | 2.10.0 | 2.10.0 |
+--------------------------------+---------------+---------------+
| horovod | 0.18.2 | 0.18.2 |
+--------------------------------+---------------+---------------+
| keras | 2.3.1 | 2.3.1 |
+--------------------------------+---------------+---------------+
| mpi4py | 3.0.2 | 3.0.3 |
+--------------------------------+---------------+---------------+
| numpy | 1.18.1 | 1.18.1 |
+--------------------------------+---------------+---------------+
| pandas | 0.24.2 | 1.0.1 |
+--------------------------------+---------------+---------------+
| pip | 20.0.2 | 20.0.2 |
+--------------------------------+---------------+---------------+
| Pillow | 6.2.1 | 7.0.0 |
+--------------------------------+---------------+---------------+
| Python | 2.7 or 3.6 | 2.7 or 3.6 |
+--------------------------------+---------------+---------------+
| requests | 2.22.0 | 2.22.0 |
+--------------------------------+---------------+---------------+
| sagemaker-containers | 2.7.0 | 2.8.0 |
+--------------------------------+---------------+---------------+
| sagemaker-tensorflow-container | 1.15.0.1.1.0 | 2.0.0.1.1.0 |
+--------------------------------+---------------+---------------+
| scipy | 1.2.2 | 1.4.1 |
+--------------------------------+---------------+---------------+
| tensorflow | 1.15.2 | 2.1.0 |
+--------------------------------+---------------+---------------+
+--------------------------------+--------------------+---------------+
| Dependencies | TF 1.15.2 | TF 2.1 |
+--------------------------------+--------------------+---------------+
| awscli | 1.18.1 | 1.18.3 |
+--------------------------------+--------------------+---------------+
| boto3 | 1.12.1 | 1.12.3 |
+--------------------------------+--------------------+---------------+
| botocore | 1.15.1 | 1.15.3 |
+--------------------------------+--------------------+---------------+
| h5py | 2.10.0 | 2.10.0 |
+--------------------------------+--------------------+---------------+
| horovod | 0.18.2 | 0.18.2 |
+--------------------------------+--------------------+---------------+
| keras | 2.3.1 | 2.3.1 |
+--------------------------------+--------------------+---------------+
| mpi4py | 3.0.2 | 3.0.3 |
+--------------------------------+--------------------+---------------+
| numpy | 1.18.1 | 1.18.1 |
+--------------------------------+--------------------+---------------+
| pandas | 0.24.2 | 1.0.1 |
+--------------------------------+--------------------+---------------+
| pip | 20.0.2 | 20.0.2 |
+--------------------------------+--------------------+---------------+
| Pillow | 6.2.1 | 7.0.0 |
+--------------------------------+--------------------+---------------+
| Python | 2.7, 3.6 or 3.7 | 2.7 or 3.6 |
+--------------------------------+--------------------+---------------+
| requests | 2.22.0 | 2.22.0 |
+--------------------------------+--------------------+---------------+
| sagemaker-containers | 2.7.0 | 2.8.0 |
+--------------------------------+--------------------+---------------+
| sagemaker-tensorflow-container | 1.15.0.1.1.0 | 2.0.0.1.1.0 |
+--------------------------------+--------------------+---------------+
| scipy | 1.2.2 | 1.4.1 |
+--------------------------------+--------------------+---------------+
| tensorflow | 1.15.2 | 2.1.0 |
+--------------------------------+--------------------+---------------+

Script Mode TensorFlow Docker images support both Python 2.7 and Python 3.6. The Docker images extend Ubuntu 16.04.
Script Mode TensorFlow Docker images support Python 2.7 and Python 3.6, Python 3.7 for TensorFlow version 1.15.2. The Docker images extend Ubuntu 16.04.

You can select version of TensorFlow by passing a ``framework_version`` keyword arg to the TensorFlow Estimator constructor. Currently supported versions are listed in the table above. You can also set ``framework_version`` to only specify major and minor version, e.g ``'1.6'``, which will cause your training script to be run on the latest supported patch version of that minor version, which in this example would be 1.6.0.
Alternatively, you can build your own image by following the instructions in the SageMaker TensorFlow containers
Expand Down
27 changes: 18 additions & 9 deletions tests/integ/test_tf_script_mode.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,16 @@
TAGS = [{"Key": "some-key", "Value": "some-value"}]


def test_mnist_with_checkpoint_config(sagemaker_session, instance_type, tf_full_version):
@pytest.fixture(scope="module")
def py_version(tf_full_version):
return (
"py37" if tf_full_version == TensorFlow._LATEST_1X_VERSION else tests.integ.PYTHON_VERSION
)


def test_mnist_with_checkpoint_config(
sagemaker_session, instance_type, tf_full_version, py_version
):
checkpoint_s3_uri = "s3://{}/checkpoints/tf-{}".format(
sagemaker_session.default_bucket(), sagemaker_timestamp()
)
Expand All @@ -52,7 +61,7 @@ def test_mnist_with_checkpoint_config(sagemaker_session, instance_type, tf_full_
sagemaker_session=sagemaker_session,
script_mode=True,
framework_version=tf_full_version,
py_version=tests.integ.PYTHON_VERSION,
py_version=py_version,
metric_definitions=[{"Name": "train:global_steps", "Regex": r"global_step\/sec:\s(.*)"}],
checkpoint_s3_uri=checkpoint_s3_uri,
checkpoint_local_path=checkpoint_local_path,
Expand Down Expand Up @@ -82,7 +91,7 @@ def test_mnist_with_checkpoint_config(sagemaker_session, instance_type, tf_full_
assert actual_training_checkpoint_config == expected_training_checkpoint_config


def test_server_side_encryption(sagemaker_session, tf_full_version):
def test_server_side_encryption(sagemaker_session, tf_full_version, py_version):
with kms_utils.bucket_with_encryption(sagemaker_session, ROLE) as (bucket_with_kms, kms_key):
output_path = os.path.join(
bucket_with_kms, "test-server-side-encryption", time.strftime("%y%m%d-%H%M")
Expand All @@ -97,7 +106,7 @@ def test_server_side_encryption(sagemaker_session, tf_full_version):
sagemaker_session=sagemaker_session,
script_mode=True,
framework_version=tf_full_version,
py_version=tests.integ.PYTHON_VERSION,
py_version=py_version,
code_location=output_path,
output_path=output_path,
model_dir="/opt/ml/model",
Expand All @@ -124,14 +133,14 @@ def test_server_side_encryption(sagemaker_session, tf_full_version):


@pytest.mark.canary_quick
def test_mnist_distributed(sagemaker_session, instance_type, tf_full_version):
def test_mnist_distributed(sagemaker_session, instance_type, tf_full_version, py_version):
estimator = TensorFlow(
entry_point=SCRIPT,
role=ROLE,
train_instance_count=2,
train_instance_type=instance_type,
sagemaker_session=sagemaker_session,
py_version=tests.integ.PYTHON_VERSION,
py_version=py_version,
script_mode=True,
framework_version=tf_full_version,
distributions=PARAMETER_SERVER_DISTRIBUTION,
Expand All @@ -149,7 +158,7 @@ def test_mnist_distributed(sagemaker_session, instance_type, tf_full_version):
)


def test_mnist_async(sagemaker_session, cpu_instance_type):
def test_mnist_async(sagemaker_session, cpu_instance_type, tf_full_version, py_version):
estimator = TensorFlow(
entry_point=SCRIPT,
role=ROLE,
Expand Down Expand Up @@ -191,14 +200,14 @@ def test_mnist_async(sagemaker_session, cpu_instance_type):
_assert_model_name_match(sagemaker_session.sagemaker_client, endpoint_name, model_name)


def test_deploy_with_input_handlers(sagemaker_session, instance_type, tf_full_version):
def test_deploy_with_input_handlers(sagemaker_session, instance_type, tf_full_version, py_version):
estimator = TensorFlow(
entry_point="training.py",
source_dir=TFS_RESOURCE_PATH,
role=ROLE,
train_instance_count=1,
train_instance_type=instance_type,
py_version=tests.integ.PYTHON_VERSION,
py_version=py_version,
sagemaker_session=sagemaker_session,
script_mode=True,
framework_version=tf_full_version,
Expand Down
18 changes: 18 additions & 0 deletions tests/unit/test_fw_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -297,6 +297,24 @@ def test_create_image_uri_cn_northwest_1():
}


def test_create_image_uri_py37_invalid_framework():
error_message = "{} does not support Python 3.7 at this time.".format(MOCK_FRAMEWORK)

with pytest.raises(ValueError) as error:
fw_utils.create_image_uri(REGION, MOCK_FRAMEWORK, "ml.m4.xlarge", "1.4.0", "py37")
assert error_message in str(error)


def test_create_image_uri_py37():
image_uri = fw_utils.create_image_uri(
REGION, "tensorflow-scriptmode", "ml.m4.xlarge", "1.15.2", "py37"
)
assert (
image_uri
== "763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.2-cpu-py37"
)


def test_tf_eia_images():
image_uri = fw_utils.create_image_uri(
"us-west-2",
Expand Down