-
Notifications
You must be signed in to change notification settings - Fork 989
Upgrade to TensorFlow 2.9 #1182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
DO_NOT_SUBMIT: Wait until new base image with TensorFlow 2.9.1 is out. http://b/207851560
http://b/238238619
@kkraus14, do you know why RapidsAI doesn't support CUDA 11.3? If not, do you know who at NVIDIA might know? |
@rosbo I am no longer an NVIDIA employee, but I believe the RAPIDS libraries now require CUDA 11.5+ to build but can run on CUDA 11.0+ using CUDA Enhanced Compatibility. The conda packages have a constraint of |
@rosbo Recently there will be If the final release ( Also, please take care of TPU tensorflow as well. It has been |
Hi @innat, Going from rc0 to stable usually takes several weeks. We want to wait until 2.10 stable is out before doing the upgrade. We also need to wait until the Google Cloud Deep Learning Container image releases a new container image with 2.10 stable. This is the base image we use. We are aiming to release 2.9.1 as soon as possible so we won't wait for 2.10. We will try our best to upgrade to 2.10 as soon as our base image dependency release a stable version for it. For the TPU support, we will upgrade to 2.9.1 shortly after upgrading the CPU/GPU image. |
Dockerfile.tmpl
Outdated
@@ -167,8 +173,9 @@ RUN pip install pysal && \ | |||
# Use `conda install -c h2oai h2o` once Python 3.7 version is released to conda. | |||
apt-get install -y default-jre-headless && \ | |||
pip install -f https://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o && \ | |||
pip install tensorflow-gcs-config==2.6.0 && \ | |||
pip install tensorflow-addons==0.14.0 && \ | |||
pip install tensorflow-gcs-config==${TENSORFLOW_VERSION} && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No version for 2.9.0 but it has one for 2.9.1. Use 2.9.1 once a new base image is released.
This is the newest version that supports both CUDA 11.X & Python 3.7
Dockerfile.tmpl
Outdated
pip install tensorflow-addons==0.14.0 && \ | ||
pip install tensorflow-gcs-config==${TENSORFLOW_VERSION} && \ | ||
# TODO(b/207851560) Upgrade to 0.17.1 once the base image with TensorFlow 2.9.1 is out. | ||
pip install tensorflow-addons==0.17.0 && \ | ||
pip install tensorflow_decision_forests==0.2.0 && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With TensorFlow 2.9.1, you can use tensorflow_decision_forests==0.2.7
: https://github.com/tensorflow/decision-forests/blob/af7633447acc2146ddbf0547165098ff592711e6/configure/setup.py#L31
Dockerfile.tmpl
Outdated
pip install tensorflow-addons==0.14.0 && \ | ||
pip install tensorflow-gcs-config==${TENSORFLOW_VERSION} && \ | ||
# TODO(b/207851560) Upgrade to 0.17.1 once the base image with TensorFlow 2.9.1 is out. | ||
pip install tensorflow-addons==0.17.0 && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be upgraded to 0.17.1
Both CPU & GPU images built successfully! There are some failing tests though. |
Update: allennlp is downgrading pytorch which is causing some of the test failures. There also appears to a be a few numpy binary compatibility issues (ex. nnabla) Still running into conflicts between test_dlib and test_catalyst (if I use LD_PRELOAD to make test_dlib pass, test_catalyst fails) LD_PRELOAD=/opt/conda/lib/libmkl_core.so:/opt/conda/lib/libmkl_sequential.so test_catalyst.py or test_dlib.py tensorflow-gcs-config also seems to have undefined symbol issues despite both it and tensorflow being on v 2.9.1 NotImplementedError: unable to open file: _gcs_config_ops.so, from paths: ['/opt/conda/lib/python3.7/site-packages/tensorflow_gcs_config/_gcs_config_ops.so'] |
Another update:
|
FYI, stable version of tf 2.10 is released. |
We're waiting on our base image to release a stable version of 2.9.1 still, we'll upgrade to 2.10 once they upgrade later. |
Did you already try installing the Should add there are also |
Thanks @jakirkham! My last couple commits are actually hopefully going to fix this. Even when I did install libomp/openmp, it wasn't working because libtorchaudio and libtorchtext for some reason were not depending on libomp, so it was never getting loaded (even though they were looking for symbols from it at runtime). That's why LD_PRELOAD was working, but not LD_LIBRARY_PATH. So I used patchelf, and installed openmp (which comes with libomp) to add an explicit dep between the pytorch *.so and the libomp.so and it seems to pass tests locally, now I'm waiting on the official branch build to see if it fixes it. We also had a build from source due to issues with the prebuilt ones not being compatible with the rest of our installled environment (ex. differing cuda versions). |
Thanks for the context (and working on this generally) Dustin! 🙏 Gotcha we've tried to improve CUDA version compatibility of conda-forge packages. So that shouldn't be an issue. That said, would be happy to learn more about the particular issues that you are running into. If you are open to it, can provide a few suggestions in the diff to move in the direction of using conda-forge more. Though completely understand if there are other approaches preferred by this group. |
Dockerfile.tmpl
Outdated
@@ -81,7 +89,8 @@ RUN conda config --add channels nvidia && \ | |||
|
|||
# b/232247930: uninstall pyarrow to avoid double installation with the GPU specific version. | |||
RUN pip uninstall -y pyarrow && \ | |||
conda install cudf=21.10 cuml=21.10 cudatoolkit=$CUDA_MAJOR_VERSION.$CUDA_MINOR_VERSION && \ | |||
conda install -c conda-forge mamba && \ | |||
mamba install -y cudf cuml cudatoolkit==11.2.2 && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakirkham I'm def interested in any suggestions. My patchelf fixes did fix some of the pytorch issues for us, but the final test failure I'm running into now is that there are two libcusolver.so's, caused by the cudatoolkit patch version upgrade. When we do: It replaces the built in cudatoolkit: However our pytorch build (and possibly other things in our build) are built against 11.2.2. This is the error pytorch gives if it tries to load the libcusolver from 11.2.72:
When I tried to prevent the cudatoolkit version from changing I get an unsat error from mamba: |
[FYI]: TensorFlow 2.11 is now stable release. (Things are moving pretty fast! o_0) |
@djherbis |
@innat I'll see if the deeplearning tensorflow base image has upgraded again, maybe we'll get lucky and rapids + tensorflow + pytorch will work together. For the TPU 1VM image, its using a different Dockerfile than the other ones, and I only installed the minimal packages needed for the TPU (so it doesn't have rapids, or the deeplearning base image etc. installed which are causing upgrade issues on this image which is used for GPU/CPU Notebooks). |
docker run -it gcr.io/deeplearning-platform-release/tf2-gpu.2-10:m100 bash
root@b3c9451d19f8:/# conda list | grep cuda
cudatoolkit 11.2.2 hbe64b41_10 conda-forge Looks like the newest available image is still using 11.2.2 :( we need 11.2.72 |
Upgrade our base image to
gcr.io/deeplearning-platform-release/tf2-cpu.2-9
. This image includes:Also upgrade PyTorch to 1.12
DO_NOT_SUBMIT: Wait until the
m95
version with TensorFlow 2.9.1 (stable) is out.http://b/207851560
http://b/238238619