Skip to content

Upgrade to TensorFlow 2.9 #1182

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 48 commits into from
Feb 9, 2023
Merged

Upgrade to TensorFlow 2.9 #1182

merged 48 commits into from
Feb 9, 2023

Conversation

rosbo
Copy link
Contributor

@rosbo rosbo commented Aug 3, 2022

Upgrade our base image to gcr.io/deeplearning-platform-release/tf2-cpu.2-9. This image includes:

  • TensorFlow 2.9
  • CUDA 11.3

Also upgrade PyTorch to 1.12

DO_NOT_SUBMIT: Wait until the m95 version with TensorFlow 2.9.1 (stable) is out.

http://b/207851560
http://b/238238619

rosbo added 3 commits August 2, 2022 17:41
DO_NOT_SUBMIT: Wait until new base image with TensorFlow 2.9.1 is out.

http://b/207851560
http://b/238238619
@rosbo
Copy link
Contributor Author

rosbo commented Aug 3, 2022

@kkraus14, do you know why RapidsAI doesn't support CUDA 11.3?

If not, do you know who at NVIDIA might know?

AUJA7gkJ2XTvexK

@kkraus14
Copy link

kkraus14 commented Aug 4, 2022

@rosbo I am no longer an NVIDIA employee, but I believe the RAPIDS libraries now require CUDA 11.5+ to build but can run on CUDA 11.0+ using CUDA Enhanced Compatibility. The conda packages have a constraint of cudatoolkit >=11,<12.0a0. The web page you screenshot is unfortunately incorrect.

@innat
Copy link

innat commented Aug 4, 2022

@rosbo

@kkraus14, do you know why RapidsAI doesn't support CUDA 11.3?
If not, do you know who at NVIDIA might know?

cc. @titericz

@innat
Copy link

innat commented Aug 4, 2022

@rosbo
Thanks for upgrading tensorflow 2.9.

Recently there will be tf 2.10 soon, RC0 is released few hours ago.

If the final release (2.10), comes out at the time of working on this PR, and if that doesn't bring much implementation cost, then please consider upgrading tf 2.10 🙏. The Bug Fixes and Other Changes section doesn't look much scary.

Also, please take care of TPU tensorflow as well. It has been 2.4 since the beginning and yet not updated. TensorFlow version on both accelerator should match. Please consider this. Thanks.

@rosbo
Copy link
Contributor Author

rosbo commented Aug 4, 2022

Hi @innat,

Going from rc0 to stable usually takes several weeks. We want to wait until 2.10 stable is out before doing the upgrade. We also need to wait until the Google Cloud Deep Learning Container image releases a new container image with 2.10 stable. This is the base image we use.

We are aiming to release 2.9.1 as soon as possible so we won't wait for 2.10. We will try our best to upgrade to 2.10 as soon as our base image dependency release a stable version for it.

For the TPU support, we will upgrade to 2.9.1 shortly after upgrading the CPU/GPU image.

Dockerfile.tmpl Outdated
@@ -167,8 +173,9 @@ RUN pip install pysal && \
# Use `conda install -c h2oai h2o` once Python 3.7 version is released to conda.
apt-get install -y default-jre-headless && \
pip install -f https://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o && \
pip install tensorflow-gcs-config==2.6.0 && \
pip install tensorflow-addons==0.14.0 && \
pip install tensorflow-gcs-config==${TENSORFLOW_VERSION} && \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No version for 2.9.0 but it has one for 2.9.1. Use 2.9.1 once a new base image is released.

Dockerfile.tmpl Outdated
pip install tensorflow-addons==0.14.0 && \
pip install tensorflow-gcs-config==${TENSORFLOW_VERSION} && \
# TODO(b/207851560) Upgrade to 0.17.1 once the base image with TensorFlow 2.9.1 is out.
pip install tensorflow-addons==0.17.0 && \
pip install tensorflow_decision_forests==0.2.0 && \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dockerfile.tmpl Outdated
pip install tensorflow-addons==0.14.0 && \
pip install tensorflow-gcs-config==${TENSORFLOW_VERSION} && \
# TODO(b/207851560) Upgrade to 0.17.1 once the base image with TensorFlow 2.9.1 is out.
pip install tensorflow-addons==0.17.0 && \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be upgraded to 0.17.1

@djherbis
Copy link
Contributor

Both CPU & GPU images built successfully! There are some failing tests though.

@djherbis
Copy link
Contributor

Update: allennlp is downgrading pytorch which is causing some of the test failures.

There also appears to a be a few numpy binary compatibility issues (ex. nnabla)

Still running into conflicts between test_dlib and test_catalyst (if I use LD_PRELOAD to make test_dlib pass, test_catalyst fails)

LD_PRELOAD=/opt/conda/lib/libmkl_core.so:/opt/conda/lib/libmkl_sequential.so

test_catalyst.py
Intel MKL FATAL ERROR: cannot load libmkl_vml_avx512.so.2 or libmkl_vml_def.so.2.

or

test_dlib.py
Intel MKL FATAL ERROR: Cannot load libmkl_avx512.so.2 or libmkl_def.so.2.

tensorflow-gcs-config also seems to have undefined symbol issues despite both it and tensorflow being on v 2.9.1

NotImplementedError: unable to open file: _gcs_config_ops.so, from paths: ['/opt/conda/lib/python3.7/site-packages/tensorflow_gcs_config/_gcs_config_ops.so']
caused by: ['/opt/conda/lib/python3.7/site-packages/tensorflow_gcs_config/_gcs_config_ops.so: undefined symbol: _ZNK10tensorflow8OpKernel11TraceStringB5cxx11ERKNS_15OpKernelContextEb']

@djherbis
Copy link
Contributor

djherbis commented Sep 1, 2022

Another update:

  • I'm going to drop a couple packages which are hurting the upgrade path.
  • I'm waiting on another tensorflow upgrade which should fix the add-ons ABI breakages (🤞 it should be out in a week).

@innat
Copy link

innat commented Sep 6, 2022

Hi @innat,

Going from rc0 to stable usually takes several weeks. We want to wait until 2.10 stable is out before doing the upgrade. We also need to wait until the Google Cloud Deep Learning Container image releases a new container image with 2.10 stable. This is the base image we use.

We are aiming to release 2.9.1 as soon as possible so we won't wait for 2.10. We will try our best to upgrade to 2.10 as soon as our base image dependency release a stable version for it.

For the TPU support, we will upgrade to 2.9.1 shortly after upgrading the CPU/GPU image.

FYI, stable version of tf 2.10 is released.
https://github.com/tensorflow/tensorflow/releases/tag/v2.10.0

@djherbis
Copy link
Contributor

djherbis commented Sep 6, 2022

Hi @innat,
Going from rc0 to stable usually takes several weeks. We want to wait until 2.10 stable is out before doing the upgrade. We also need to wait until the Google Cloud Deep Learning Container image releases a new container image with 2.10 stable. This is the base image we use.
We are aiming to release 2.9.1 as soon as possible so we won't wait for 2.10. We will try our best to upgrade to 2.10 as soon as our base image dependency release a stable version for it.
For the TPU support, we will upgrade to 2.9.1 shortly after upgrading the CPU/GPU image.

FYI, stable version of tf 2.10 is released. https://github.com/tensorflow/tensorflow/releases/tag/v2.10.0

We're waiting on our base image to release a stable version of 2.9.1 still, we'll upgrade to 2.10 once they upgrade later.
Our goal for this PR is still to get up to 2.9, hopefully this week if the base image releases in time.

@jakirkham
Copy link

Did you already try installing the libgomp package from conda-forge? This hopefully should avoid needing to use LD_PRELOAD.

Should add there are also pytorch packages from conda-forge. One can install them using this package syntax 'pytorch=*=*cuda*'. Maybe this saves some headaches building from source?

@djherbis
Copy link
Contributor

djherbis commented Nov 8, 2022

Did you already try installing the libgomp package from conda-forge? This hopefully should avoid needing to use LD_PRELOAD.

Should add there are also pytorch packages from conda-forge. One can install them using this package syntax 'pytorch=*=*cuda*'. Maybe this saves some headaches building from source?

Thanks @jakirkham! My last couple commits are actually hopefully going to fix this.

Even when I did install libomp/openmp, it wasn't working because libtorchaudio and libtorchtext for some reason were not depending on libomp, so it was never getting loaded (even though they were looking for symbols from it at runtime). That's why LD_PRELOAD was working, but not LD_LIBRARY_PATH. So I used patchelf, and installed openmp (which comes with libomp) to add an explicit dep between the pytorch *.so and the libomp.so and it seems to pass tests locally, now I'm waiting on the official branch build to see if it fixes it.

We also had a build from source due to issues with the prebuilt ones not being compatible with the rest of our installled environment (ex. differing cuda versions).

@jakirkham
Copy link

Thanks for the context (and working on this generally) Dustin! 🙏

Gotcha we've tried to improve CUDA version compatibility of conda-forge packages. So that shouldn't be an issue. That said, would be happy to learn more about the particular issues that you are running into.

If you are open to it, can provide a few suggestions in the diff to move in the direction of using conda-forge more. Though completely understand if there are other approaches preferred by this group.

Dockerfile.tmpl Outdated
@@ -81,7 +89,8 @@ RUN conda config --add channels nvidia && \

# b/232247930: uninstall pyarrow to avoid double installation with the GPU specific version.
RUN pip uninstall -y pyarrow && \
conda install cudf=21.10 cuml=21.10 cudatoolkit=$CUDA_MAJOR_VERSION.$CUDA_MINOR_VERSION && \
conda install -c conda-forge mamba && \
mamba install -y cudf cuml cudatoolkit==11.2.2 && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#15 334.5 Looking for: ['cudf', 'cuml', 'cudatoolkit==11.2.2']

#15 334.5

#15 334.5

#15 334.5 Pinned packages:

#15 334.5 - python 3.7.*

#15 334.5

#15 334.5

#15 334.5 Encountered problems while solving:

#15 334.5 - nothing provides cuda92 needed by libcuml-0.5.0-cuda9.2_3

@djherbis
Copy link
Contributor

djherbis commented Nov 9, 2022

@jakirkham I'm def interested in any suggestions. My patchelf fixes did fix some of the pytorch issues for us, but the final test failure I'm running into now is that there are two libcusolver.so's, caused by the cudatoolkit patch version upgrade.

When we do:
mamba install -y cudf cuml cudatoolkit

It replaces the built in cudatoolkit:
#15 293.8 - cudatoolkit 11.2.2 hbe64b41_10 conda-forge
#15 293.8 + cudatoolkit 11.2.72 h2bc3f7f_0 nvidia/linux-64 979MB

However our pytorch build (and possibly other things in our build) are built against 11.2.2.
One thing I maliciously tried locally was replacing the libcusolver.so version and that seemed to work... but I have no idea the consequences of doing that.

This is the error pytorch gives if it tries to load the libcusolver from 11.2.72:

======================================================================

ERROR: test_linalg (test_pytorch.TestPyTorch)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/input/tests/test_pytorch.py", line 23, in test_linalg

    result = torch.linalg.solve(A, B)

RuntimeError: Error in dlopen: /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda_linalg.so: undefined symbol: cusolverDnXsytrs, version libcusolver.so.11



----------------------------------------------------------------------

When I tried to prevent the cudatoolkit version from changing I get an unsat error from mamba:
https://github.com/Kaggle/docker-python/pull/1182/files#r1018440765

@innat
Copy link

innat commented Dec 2, 2022

[FYI]: TensorFlow 2.11 is now stable release.

(Things are moving pretty fast! o_0)
(I wonder if there's any better alternative to manage packages in such complex environments (kaggle).)

@innat
Copy link

innat commented Dec 6, 2022

@djherbis
Recently, in kaggle, there are two type of TPU (one of is local). In this VM, TensorFlow version is 2.10. Could you inform how is that?

image

@djherbis
Copy link
Contributor

djherbis commented Dec 6, 2022

@innat I'll see if the deeplearning tensorflow base image has upgraded again, maybe we'll get lucky and rapids + tensorflow + pytorch will work together.

For the TPU 1VM image, its using a different Dockerfile than the other ones, and I only installed the minimal packages needed for the TPU (so it doesn't have rapids, or the deeplearning base image etc. installed which are causing upgrade issues on this image which is used for GPU/CPU Notebooks).

@djherbis
Copy link
Contributor

djherbis commented Dec 6, 2022

docker run -it gcr.io/deeplearning-platform-release/tf2-gpu.2-10:m100 bash
root@b3c9451d19f8:/# conda list | grep cuda
cudatoolkit               11.2.2              hbe64b41_10    conda-forge

Looks like the newest available image is still using 11.2.2 :( we need 11.2.72

@djherbis djherbis mentioned this pull request Jan 24, 2023
@djherbis djherbis merged commit 739e1b0 into main Feb 9, 2023
@djherbis djherbis deleted the upgrade-tf2.9 branch February 9, 2023 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants