Skip to content

[SageMaker Data Parallel] Upgrade the TF2 version to 2.4.1 and also update the py_version to 3.7 #2069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 10, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
ARG region

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04

RUN pip --no-cache-dir --no-cache install \
scikit-learn==0.23.1 \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"\n",
"HSMDataParallel is a new capability in Amazon SageMaker to train deep learning models faster and cheaper. SMDataParallel is a distributed data parallel training framework for TensorFlow, PyTorch, and MXNet.\n",
"\n",
"This notebook example shows how to use SMDataParallel with TensorFlow(version 2.3.1) on [Amazon SageMaker](https://aws.amazon.com/sagemaker/) to train a BERT model using [Amazon FSx for Lustre file-system](https://aws.amazon.com/fsx/lustre/) as data source.\n",
"This notebook example shows how to use SMDataParallel with TensorFlow(version 2.4.1) on [Amazon SageMaker](https://aws.amazon.com/sagemaker/) to train a BERT model using [Amazon FSx for Lustre file-system](https://aws.amazon.com/fsx/lustre/) as data source.\n",
"\n",
"The outline of steps is as follows:\n",
"\n",
Expand Down Expand Up @@ -244,8 +244,8 @@
" role=role,\n",
" image_uri=docker_image,\n",
" source_dir='deep-learning-models/models/nlp',\n",
" framework_version='2.3.1',\n",
" py_version='py3',\n",
" framework_version='2.4.1',\n",
" py_version='py37',\n",
" instance_count=instance_count,\n",
" instance_type=instance_type,\n",
" sagemaker_session=sagemaker_session,\n",
Expand Down Expand Up @@ -315,4 +315,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
ARG region

FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.3.1-gpu-py37-cu110-ubuntu18.04
FROM 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.4.1-gpu-py37-cu110-ubuntu18.04

RUN pip --no-cache-dir --no-cache install \
Cython \
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
"\n",
"SMDataParallel is a new capability in Amazon SageMaker to train deep learning models faster and cheaper. SMDataParallel is a distributed data parallel training framework for TensorFlow, PyTorch, and MXNet.\n",
"\n",
"This notebook example shows how to use SMDataParallel with TensorFlow(version 2.3.1) on [Amazon SageMaker](https://aws.amazon.com/sagemaker/) to train a MaskRCNN model on [COCO 2017 dataset](https://cocodataset.org/#home) using [Amazon FSx for Lustre file-system](https://aws.amazon.com/fsx/lustre/) as data source.\n",
"This notebook example shows how to use SMDataParallel with TensorFlow(version 2.4.1) on [Amazon SageMaker](https://aws.amazon.com/sagemaker/) to train a MaskRCNN model on [COCO 2017 dataset](https://cocodataset.org/#home) using [Amazon FSx for Lustre file-system](https://aws.amazon.com/fsx/lustre/) as data source.\n",
"\n",
"The outline of steps is as follows:\n",
"\n",
Expand Down Expand Up @@ -238,8 +238,8 @@
" role=role,\n",
" image_uri=docker_image,\n",
" source_dir='.',\n",
" framework_version='2.3.1',\n",
" py_version='py3',\n",
" framework_version='2.4.1',\n",
" py_version='py37',\n",
" instance_count=instance_count,\n",
" instance_type=instance_type,\n",
" sagemaker_session=sagemaker_session,\n",
Expand Down Expand Up @@ -323,4 +323,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@
" entry_point='train_tensorflow_smdataparallel_mnist.py',\n",
" role=role,\n",
" py_version='py37',\n",
" framework_version='2.3.1',\n",
" framework_version='2.4.1',\n",
" # For training with multinode distributed training, set this count. Example: 2\n",
" instance_count=2,\n",
" # For training with p3dn instance use - ml.p3dn.24xlarge, with p4dn instance use - ml.p4d.24xlarge\n",
Expand Down Expand Up @@ -170,4 +170,4 @@
},
"nbformat": 4,
"nbformat_minor": 4
}
}