Upgrade to TensorFlow 2.9 #1182

rosbo · 2022-08-03T21:02:06Z

Upgrade our base image to gcr.io/deeplearning-platform-release/tf2-cpu.2-9. This image includes:

TensorFlow 2.9
CUDA 11.3

Also upgrade PyTorch to 1.12

DO_NOT_SUBMIT: Wait until the m95 version with TensorFlow 2.9.1 (stable) is out.

http://b/207851560
http://b/238238619

DO_NOT_SUBMIT: Wait until new base image with TensorFlow 2.9.1 is out. http://b/207851560

http://b/238238619

rosbo · 2022-08-03T21:05:03Z

@kkraus14, do you know why RapidsAI doesn't support CUDA 11.3?

If not, do you know who at NVIDIA might know?

kkraus14 · 2022-08-04T04:43:36Z

@rosbo I am no longer an NVIDIA employee, but I believe the RAPIDS libraries now require CUDA 11.5+ to build but can run on CUDA 11.0+ using CUDA Enhanced Compatibility. The conda packages have a constraint of cudatoolkit >=11,<12.0a0. The web page you screenshot is unfortunately incorrect.

innat · 2022-08-04T18:39:26Z

@rosbo

@kkraus14, do you know why RapidsAI doesn't support CUDA 11.3?
If not, do you know who at NVIDIA might know?

cc. @titericz

innat · 2022-08-04T18:52:24Z

@rosbo
Thanks for upgrading tensorflow 2.9.

Recently there will be tf 2.10 soon, RC0 is released few hours ago.

If the final release (2.10), comes out at the time of working on this PR, and if that doesn't bring much implementation cost, then please consider upgrading tf 2.10 🙏. The Bug Fixes and Other Changes section doesn't look much scary.

Also, please take care of TPU tensorflow as well. It has been 2.4 since the beginning and yet not updated. TensorFlow version on both accelerator should match. Please consider this. Thanks.

rosbo · 2022-08-04T20:58:04Z

Hi @innat,

Going from rc0 to stable usually takes several weeks. We want to wait until 2.10 stable is out before doing the upgrade. We also need to wait until the Google Cloud Deep Learning Container image releases a new container image with 2.10 stable. This is the base image we use.

We are aiming to release 2.9.1 as soon as possible so we won't wait for 2.10. We will try our best to upgrade to 2.10 as soon as our base image dependency release a stable version for it.

For the TPU support, we will upgrade to 2.9.1 shortly after upgrading the CPU/GPU image.

rosbo · 2022-08-04T22:46:06Z

Dockerfile.tmpl

@@ -167,8 +173,9 @@ RUN pip install pysal && \
    # Use `conda install -c h2oai h2o` once Python 3.7 version is released to conda.
    apt-get install -y default-jre-headless && \
    pip install -f https://h2o-release.s3.amazonaws.com/h2o/latest_stable_Py.html h2o && \
-    pip install tensorflow-gcs-config==2.6.0 && \
-    pip install tensorflow-addons==0.14.0 && \
+    pip install tensorflow-gcs-config==${TENSORFLOW_VERSION} && \


No version for 2.9.0 but it has one for 2.9.1. Use 2.9.1 once a new base image is released.

Dockerfile.tmpl

This is the newest version that supports both CUDA 11.X & Python 3.7

rosbo · 2022-08-17T00:04:18Z

Dockerfile.tmpl

-    pip install tensorflow-addons==0.14.0 && \
+    pip install tensorflow-gcs-config==${TENSORFLOW_VERSION} && \
+    # TODO(b/207851560) Upgrade to 0.17.1 once the base image with TensorFlow 2.9.1 is out.
+    pip install tensorflow-addons==0.17.0 && \
    pip install tensorflow_decision_forests==0.2.0 && \


With TensorFlow 2.9.1, you can use tensorflow_decision_forests==0.2.7: https://github.com/tensorflow/decision-forests/blob/af7633447acc2146ddbf0547165098ff592711e6/configure/setup.py#L31

rosbo · 2022-08-17T00:05:18Z

Dockerfile.tmpl

-    pip install tensorflow-addons==0.14.0 && \
+    pip install tensorflow-gcs-config==${TENSORFLOW_VERSION} && \
+    # TODO(b/207851560) Upgrade to 0.17.1 once the base image with TensorFlow 2.9.1 is out.
+    pip install tensorflow-addons==0.17.0 && \


Should be upgraded to 0.17.1

djherbis · 2022-08-17T14:35:45Z

Both CPU & GPU images built successfully! There are some failing tests though.

djherbis · 2022-08-23T13:03:29Z

Update: allennlp is downgrading pytorch which is causing some of the test failures.

There also appears to a be a few numpy binary compatibility issues (ex. nnabla)

Still running into conflicts between test_dlib and test_catalyst (if I use LD_PRELOAD to make test_dlib pass, test_catalyst fails)

LD_PRELOAD=/opt/conda/lib/libmkl_core.so:/opt/conda/lib/libmkl_sequential.so

test_catalyst.py
Intel MKL FATAL ERROR: cannot load libmkl_vml_avx512.so.2 or libmkl_vml_def.so.2.

or

test_dlib.py
Intel MKL FATAL ERROR: Cannot load libmkl_avx512.so.2 or libmkl_def.so.2.

tensorflow-gcs-config also seems to have undefined symbol issues despite both it and tensorflow being on v 2.9.1

NotImplementedError: unable to open file: _gcs_config_ops.so, from paths: ['/opt/conda/lib/python3.7/site-packages/tensorflow_gcs_config/_gcs_config_ops.so']
caused by: ['/opt/conda/lib/python3.7/site-packages/tensorflow_gcs_config/_gcs_config_ops.so: undefined symbol: _ZNK10tensorflow8OpKernel11TraceStringB5cxx11ERKNS_15OpKernelContextEb']

djherbis · 2022-09-01T17:10:00Z

Another update:

I'm going to drop a couple packages which are hurting the upgrade path.
I'm waiting on another tensorflow upgrade which should fix the add-ons ABI breakages (🤞 it should be out in a week).

innat · 2022-09-06T20:21:38Z

Hi @innat,

Going from rc0 to stable usually takes several weeks. We want to wait until 2.10 stable is out before doing the upgrade. We also need to wait until the Google Cloud Deep Learning Container image releases a new container image with 2.10 stable. This is the base image we use.

We are aiming to release 2.9.1 as soon as possible so we won't wait for 2.10. We will try our best to upgrade to 2.10 as soon as our base image dependency release a stable version for it.

For the TPU support, we will upgrade to 2.9.1 shortly after upgrading the CPU/GPU image.

FYI, stable version of tf 2.10 is released.
https://github.com/tensorflow/tensorflow/releases/tag/v2.10.0

djherbis · 2022-09-06T20:31:22Z

Hi @innat,
Going from rc0 to stable usually takes several weeks. We want to wait until 2.10 stable is out before doing the upgrade. We also need to wait until the Google Cloud Deep Learning Container image releases a new container image with 2.10 stable. This is the base image we use.
We are aiming to release 2.9.1 as soon as possible so we won't wait for 2.10. We will try our best to upgrade to 2.10 as soon as our base image dependency release a stable version for it.
For the TPU support, we will upgrade to 2.9.1 shortly after upgrading the CPU/GPU image.

FYI, stable version of tf 2.10 is released. https://github.com/tensorflow/tensorflow/releases/tag/v2.10.0

We're waiting on our base image to release a stable version of 2.9.1 still, we'll upgrade to 2.10 once they upgrade later.
Our goal for this PR is still to get up to 2.9, hopefully this week if the base image releases in time.

jakirkham · 2022-11-08T04:58:22Z

Did you already try installing the libgomp package from conda-forge? This hopefully should avoid needing to use LD_PRELOAD.

Should add there are also pytorch packages from conda-forge. One can install them using this package syntax 'pytorch=*=*cuda*'. Maybe this saves some headaches building from source?

djherbis · 2022-11-08T13:38:44Z

Did you already try installing the libgomp package from conda-forge? This hopefully should avoid needing to use LD_PRELOAD.

Should add there are also pytorch packages from conda-forge. One can install them using this package syntax 'pytorch=*=*cuda*'. Maybe this saves some headaches building from source?

Thanks @jakirkham! My last couple commits are actually hopefully going to fix this.

Even when I did install libomp/openmp, it wasn't working because libtorchaudio and libtorchtext for some reason were not depending on libomp, so it was never getting loaded (even though they were looking for symbols from it at runtime). That's why LD_PRELOAD was working, but not LD_LIBRARY_PATH. So I used patchelf, and installed openmp (which comes with libomp) to add an explicit dep between the pytorch *.so and the libomp.so and it seems to pass tests locally, now I'm waiting on the official branch build to see if it fixes it.

We also had a build from source due to issues with the prebuilt ones not being compatible with the rest of our installled environment (ex. differing cuda versions).

jakirkham · 2022-11-09T19:13:45Z

Thanks for the context (and working on this generally) Dustin! 🙏

Gotcha we've tried to improve CUDA version compatibility of conda-forge packages. So that shouldn't be an issue. That said, would be happy to learn more about the particular issues that you are running into.

If you are open to it, can provide a few suggestions in the diff to move in the direction of using conda-forge more. Though completely understand if there are other approaches preferred by this group.

djherbis · 2022-11-09T21:40:35Z

Dockerfile.tmpl

@@ -81,7 +89,8 @@ RUN conda config --add channels nvidia && \

 # b/232247930: uninstall pyarrow to avoid double installation with the GPU specific version.
 RUN pip uninstall -y pyarrow && \
-    conda install cudf=21.10 cuml=21.10 cudatoolkit=$CUDA_MAJOR_VERSION.$CUDA_MINOR_VERSION && \
+    conda install -c conda-forge mamba && \
+    mamba install -y cudf cuml cudatoolkit==11.2.2 && \


#15 334.5 Looking for: ['cudf', 'cuml', 'cudatoolkit==11.2.2']

#15 334.5

#15 334.5

#15 334.5 Pinned packages:

#15 334.5 - python 3.7.*

#15 334.5

#15 334.5

#15 334.5 Encountered problems while solving:

#15 334.5 - nothing provides cuda92 needed by libcuml-0.5.0-cuda9.2_3

djherbis · 2022-11-09T21:45:36Z

@jakirkham I'm def interested in any suggestions. My patchelf fixes did fix some of the pytorch issues for us, but the final test failure I'm running into now is that there are two libcusolver.so's, caused by the cudatoolkit patch version upgrade.

When we do:
mamba install -y cudf cuml cudatoolkit

It replaces the built in cudatoolkit:
#15 293.8 - cudatoolkit 11.2.2 hbe64b41_10 conda-forge
#15 293.8 + cudatoolkit 11.2.72 h2bc3f7f_0 nvidia/linux-64 979MB

However our pytorch build (and possibly other things in our build) are built against 11.2.2.
One thing I maliciously tried locally was replacing the libcusolver.so version and that seemed to work... but I have no idea the consequences of doing that.

This is the error pytorch gives if it tries to load the libcusolver from 11.2.72:

======================================================================

ERROR: test_linalg (test_pytorch.TestPyTorch)

----------------------------------------------------------------------

Traceback (most recent call last):

  File "/input/tests/test_pytorch.py", line 23, in test_linalg

    result = torch.linalg.solve(A, B)

RuntimeError: Error in dlopen: /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_cuda_linalg.so: undefined symbol: cusolverDnXsytrs, version libcusolver.so.11



----------------------------------------------------------------------

When I tried to prevent the cudatoolkit version from changing I get an unsat error from mamba:
https://github.com/Kaggle/docker-python/pull/1182/files#r1018440765

innat · 2022-12-02T18:44:58Z

[FYI]: TensorFlow 2.11 is now stable release.

(Things are moving pretty fast! o_0)
(I wonder if there's any better alternative to manage packages in such complex environments (kaggle).)

innat · 2022-12-06T19:20:21Z

@djherbis
Recently, in kaggle, there are two type of TPU (one of is local). In this VM, TensorFlow version is 2.10. Could you inform how is that?

djherbis · 2022-12-06T19:44:07Z

@innat I'll see if the deeplearning tensorflow base image has upgraded again, maybe we'll get lucky and rapids + tensorflow + pytorch will work together.

For the TPU 1VM image, its using a different Dockerfile than the other ones, and I only installed the minimal packages needed for the TPU (so it doesn't have rapids, or the deeplearning base image etc. installed which are causing upgrade issues on this image which is used for GPU/CPU Notebooks).

djherbis · 2022-12-06T22:11:38Z

docker run -it gcr.io/deeplearning-platform-release/tf2-gpu.2-10:m100 bash
root@b3c9451d19f8:/# conda list | grep cuda
cudatoolkit               11.2.2              hbe64b41_10    conda-forge

Looks like the newest available image is still using 11.2.2 :( we need 11.2.72

rosbo added 3 commits August 2, 2022 17:41

Upgrade to TensorFlow 2.9

932bdeb

DO_NOT_SUBMIT: Wait until new base image with TensorFlow 2.9.1 is out. http://b/207851560

Upgrade PyTorch to 1.12

019234e

http://b/238238619

Set CUDA_MINOR_VERSION to 3

9fda507

Add conda libs to LD_LIBRARY_PATH

4bd4496

rosbo commented Aug 4, 2022

View reviewed changes

djherbis added 3 commits August 10, 2022 15:54

Use DLVM image m95

4896dad

The tag name is m95_release for some reason

49aafb7

Merge branch 'main' into upgrade-tf2.9

c8649e5

djherbis reviewed Aug 10, 2022

View reviewed changes

Dockerfile.tmpl Outdated Show resolved Hide resolved

djherbis added 9 commits August 10, 2022 17:10

Reorder FROM & ENV so that build works correctly

776a50a

They fixed the tag to m95

d674a42

Use LIBRARY_PATH instead of LD_LIBRARY_PATH for linking

8356a51

We need both LIBRARY_PATH and LD_LIBRARY_PATH

855fe8f

LIBRARY_PATH & LD_LIBRARY_PATH for linking

85e1c42

ncurses.h is required to install torch audio

fb69472

Bump cudf/cuml to 21.12

74046c2

This is the newest version that supports both CUDA 11.X & Python 3.7

Update Dockerfile.tmpl

ba6160c

remove extra $

df511c3

rosbo commented Aug 17, 2022

View reviewed changes

update tensorflow decision forest and addons

a86e055

Update Dockerfile.tmpl

8f31a85

Update Dockerfile.tmpl

87d59c9

djherbis reviewed Nov 9, 2022

View reviewed changes

djherbis added 4 commits November 21, 2022 16:24

force cudatoolkit 11.2.2

abaa7f0

Use mamba for all conda installs

962daca

Undo force cudatoolkit, causes inconsistent env

6732bc3

Use mamba & include cuda upgrades in build

2ec8ae9

djherbis mentioned this pull request Jan 24, 2023

Torch Tensor RT 1.3.0 #1210

Closed

djherbis added 7 commits January 30, 2023 12:28

Disable rapidsai until compatible with tf cudatoolkit

0910998

Merge branch 'main' into upgrade-tf2.9

3dfb0b0

Update Dockerfile.tmpl

902a5ff

Update Dockerfile.tmpl

f9a5136

Update config.txt

537bd6e

Fix gitextensions bug

f52379a

try 2.11

d527f3e

innat mentioned this pull request Feb 7, 2023

kerasNLP at Kaggle keras-team/keras-hub#726

Closed

djherbis added 4 commits February 7, 2023 12:58

Update config.txt

3aee977

Update Dockerfile.tmpl

5982053

Update Dockerfile.tmpl

c654412

remove test_repids.py

8f810af

djherbis merged commit 739e1b0 into main Feb 9, 2023

djherbis deleted the upgrade-tf2.9 branch February 9, 2023 14:07

Upgrade to TensorFlow 2.9 #1182

Upgrade to TensorFlow 2.9 #1182

Uh oh!

Conversation

rosbo commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rosbo commented Aug 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kkraus14 commented Aug 4, 2022

Uh oh!

innat commented Aug 4, 2022

Uh oh!

innat commented Aug 4, 2022

Uh oh!

rosbo commented Aug 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rosbo Aug 4, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rosbo Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

rosbo Aug 17, 2022

Choose a reason for hiding this comment

Uh oh!

djherbis commented Aug 17, 2022

Uh oh!

djherbis commented Aug 23, 2022

Uh oh!

djherbis commented Sep 1, 2022

Uh oh!

innat commented Sep 6, 2022

Uh oh!

djherbis commented Sep 6, 2022

Uh oh!

jakirkham commented Nov 8, 2022

Uh oh!

djherbis commented Nov 8, 2022

Uh oh!

jakirkham commented Nov 9, 2022

Uh oh!

djherbis Nov 9, 2022

Choose a reason for hiding this comment

Uh oh!

djherbis commented Nov 9, 2022

Uh oh!

innat commented Dec 2, 2022

Uh oh!

innat commented Dec 6, 2022

Uh oh!

djherbis commented Dec 6, 2022

Uh oh!

djherbis commented Dec 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

rosbo commented Aug 3, 2022 •

edited

Loading

rosbo commented Aug 3, 2022 •

edited

Loading

rosbo commented Aug 4, 2022 •

edited

Loading

djherbis commented Dec 6, 2022 •

edited

Loading