Skip to content

Use Colab as a base image. #1444

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Nov 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
594 changes: 75 additions & 519 deletions Dockerfile.tmpl

Large diffs are not rendered by default.

60 changes: 0 additions & 60 deletions Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -21,66 +21,6 @@ pipeline {
}

stages {
stage('Pre-build Packages from Source') {
parallel {
stage('torch') {
options {
timeout(time: 300, unit: 'MINUTES')
}
steps {
sh '''#!/bin/bash
set -exo pipefail
source config.txt
cd packages/
./build_package --base-image $BASE_IMAGE_REPO/$GPU_BASE_IMAGE_NAME:$BASE_IMAGE_TAG \
--package torch \
--version $TORCH_VERSION \
--build-arg TORCHAUDIO_VERSION=$TORCHAUDIO_VERSION \
--build-arg TORCHVISION_VERSION=$TORCHVISION_VERSION \
--build-arg CUDA_MAJOR_VERSION=$CUDA_MAJOR_VERSION \
--build-arg CUDA_MINOR_VERSION=$CUDA_MINOR_VERSION \
--push
'''
}
}
stage('lightgbm') {
options {
timeout(time: 10, unit: 'MINUTES')
}
steps {
sh '''#!/bin/bash
set -exo pipefail
source config.txt
cd packages/
./build_package --base-image $BASE_IMAGE_REPO/$GPU_BASE_IMAGE_NAME:$BASE_IMAGE_TAG \
--package lightgbm \
--version $LIGHTGBM_VERSION \
--build-arg CUDA_MAJOR_VERSION=$CUDA_MAJOR_VERSION \
--build-arg CUDA_MINOR_VERSION=$CUDA_MINOR_VERSION \
--push
'''
}
}
stage('jaxlib') {
options {
timeout(time: 300, unit: 'MINUTES')
}
steps {
sh '''#!/bin/bash
set -exo pipefail
source config.txt
cd packages/
./build_package --base-image $BASE_IMAGE_REPO/$GPU_BASE_IMAGE_NAME:$BASE_IMAGE_TAG \
--package jaxlib \
--version $JAX_VERSION \
--build-arg CUDA_MAJOR_VERSION=$CUDA_MAJOR_VERSION \
--build-arg CUDA_MINOR_VERSION=$CUDA_MINOR_VERSION \
--push
'''
}
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are no longer needed since they are installed in the base image.

stage('Build/Test/Diff') {
parallel {
stage('CPU') {
Expand Down
4 changes: 1 addition & 3 deletions clean-layer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,4 @@ apt-get clean
# Ensures the current working directory won't be deleted
cd /usr/local/src/
# Delete source files used for building binaries
rm -rf /usr/local/src/*
# Delete conda downloaded tarballs
conda clean -y --tarballs
rm -rf /usr/local/src/*
11 changes: 1 addition & 10 deletions config.txt
Original file line number Diff line number Diff line change
@@ -1,11 +1,2 @@
BASE_IMAGE_REPO=gcr.io/deeplearning-platform-release
BASE_IMAGE_TAG=m122
CPU_BASE_IMAGE_NAME=tf2-cpu.2-16.py310
GPU_BASE_IMAGE_NAME=tf2-gpu.2-16.py310
LIGHTGBM_VERSION=4.2.0
TORCH_VERSION=2.4.0
TORCHAUDIO_VERSION=2.4.0
TORCHVISION_VERSION=0.19.0
JAX_VERSION=0.4.26
CUDA_MAJOR_VERSION=12
CUDA_MINOR_VERSION=3
CUDA_MINOR_VERSION=2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this downgrade is to align with Colab?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Colab is on 12.2, so we are too now. Sucks to go backwards, but better that we're aligned.

139 changes: 139 additions & 0 deletions kaggle_requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
altair>=5.4.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super optional: seem sorta alphabetically ordered, is this intentional?
do we want to add a comment at the top to enforce it? ngl it is easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I originally created this list by doing pip freeze on Colab and on Kaggle, then diffing the results and taking the diff subset, and then only including explicitly mentioned packages from our Dockerfile.

It is roughly alpha ordered because of pip freeze (caps is going above other things, caps probably shouldn't matter, I'm not sure why some packages are capitalized). I'll add a comment in my follow-up.

Babel
Boruta
Cartopy
ImageHash
Janome
PyArabic
PyUpSet
Pympler
Rtree
shapely<2
SimpleITK
TPOT
Theano
Wand
annoy
arrow
bayesian-optimization
boto3
catboost
category-encoders
cesium
comm
cytoolz
dask-expr
datasets
datashader
deap
dipy
docker
easyocr
eli5
emoji
fasttext
featuretools
fiona
fury
fuzzywuzzy
geojson
# geopandas > v0.14.4 breaks learn tools
geopandas==v0.14.4
google-cloud-aiplatform
# google-cloud-automl 2.0.0 introduced incompatible API changes, need to pin to 1.0.1
google-cloud-automl==1.0.1
# b/315753846: Unpin translate package.
google-cloud-translate==3.12.1
google-cloud-videointelligence
google-cloud-vision
gpxpy
h2o
haversine
hep-ml
igraph
ipympl
ipywidgets==8.1.5
isoweek
jedi
# b/276358430: fix Jupyter lsp freezing up the jupyter server
jupyter-lsp==1.5.1
# b/333854354: pin jupyter-server to version 2.12.5; later versions break LSP (b/333854354)
jupyter_server==2.12.5
jupyterlab
jupyterlab-lsp
kaggle-environments
kagglehub>=0.3.4
# Keras 3.6 broke test_keras.py > test_train > keras.datasets.mnist.load_data():
# See https://github.com/keras-team/keras/commit/dcefb139863505d166dd1325066f329b3033d45a
keras<3.6
keras-cv
keras-nlp
keras-tuner
kornia
langid
leven
# b/328788268: libpysal 4.10 seems to fail with "module 'shapely' has no attribute 'Geometry'. Did you mean: 'geometry'"
libpysal<=4.9.2
lime
line_profiler
mamba
mlcrate
mne
mpld3
nbdev
nilearn
olefile
onnx
openslide-bin
openslide-python
optuna
pandas-profiling
pandasql
papermill
path
path.py
pdf2image
plotly-express
preprocessing
pudb
pyLDAvis
pycryptodome
pydegensac
pydicom
pydub
pyemd
pyexcel-ods
pymc3
pymongo
pypdf
pytesseract
python-lsp-server
pytorch-ignite
pytorch-lightning
qgrid
qtconsole
ray
rgf-python
s3fs
scikit-learn-intelex
scikit-multilearn
scikit-optimize
scikit-plot
scikit-surprise
git+https://github.com/facebookresearch/segment-anything.git
shap
squarify
tensorflow-cloud
tensorflow-io
tensorflow-text
tensorflow_decision_forests
timm
torchinfo
torchmetrics
tsfresh
vtk
wandb
wavio
xgboost==2.0.3
Copy link
Contributor

@calderjo calderjo Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe include " b/350573866: xgboost v2.1.0 breaks learntools " comment from before

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I've been adding back pins and bug info in #1448

xvfbwrapper
ydata-profiling
8 changes: 5 additions & 3 deletions test
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ set -e

IMAGE_TAG='kaggle/python-build'
IMAGE_TAG_OVERRIDE=''
ADDITONAL_OPTS=''
ADDITONAL_OPTS='--runtime runc ' # Use the CPU runtime by default
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Colab image fails if using the nvidia runtime (which Jenkins does by default) while on a CPU VM.

PATTERN='test*.py'

usage() {
Expand Down Expand Up @@ -69,8 +69,6 @@ readonly ADDITONAL_OPTS
readonly PATTERN

set -x
docker run --rm --net=none -v /tmp/python-build:/tmp/python-build "$IMAGE_TAG" rm -rf /tmp/python-build/*
docker rm jupyter_test || true
mkdir -p /tmp/python-build/tmp
mkdir -p /tmp/python-build/devshm
mkdir -p /tmp/python-build/working
Expand All @@ -97,6 +95,9 @@ fi
# Note about `--hostname localhost` (b/158137436)
# hostname defaults to the container name which fails DNS name
# resolution with --net=none (required to keep tests hermetic). See details in bug.
#
# Note about CLOUDSDK_CONFIG=/tmp/.config/gcloud
# We use the /tmp dir since the filesystem is --read-only and we need writable space for gcloud configs.
docker run --rm -t --read-only --net=none \
-e HOME=/tmp -e KAGGLE_DATA_PROXY_TOKEN=test-key \
-e KAGGLE_USER_SECRETS_TOKEN_KEY=test-secrets-key \
Expand All @@ -105,6 +106,7 @@ docker run --rm -t --read-only --net=none \
-e KAGGLE_DATA_PROXY_PROJECT=test \
-e TF_FORCE_GPU_ALLOW_GROWTH=true \
-e XLA_PYTHON_CLIENT_PREALLOCATE=false \
-e CLOUDSDK_CONFIG=/tmp/.config/gcloud \
--hostname localhost \
--shm-size=2g \
-v $PWD:/input:ro -v /tmp/python-build/working:/working \
Expand Down
1 change: 1 addition & 0 deletions tests/test_cuml.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
class TestCuml(unittest.TestCase):
@gpu_test
@p100_exempt # b/342143152: cuML(>=24.4v) is inompatible with p100 GPUs.
@unittest.skip("b/381287748 cuML is not installed in Colab.")
def test_pca_fit_transform(self):
import unittest
import numpy as np
Expand Down
7 changes: 4 additions & 3 deletions tests/test_fastai.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,9 @@ def test_tabular(self):
"/input/tests/data/train.csv",
cont_names=["pixel"+str(i) for i in range(784)],
y_names='label',
procs=[FillMissing, Categorify, Normalize])
procs=[FillMissing, Categorify, Normalize])
learn = tabular_learner(dls, layers=[200, 100])
learn.fit_one_cycle(n_epoch=1)
with learn.no_bar():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fastprogress was trying to render html while running in the background and it caused issues, so I turned off the progress bar rendering.

learn.fit_one_cycle(n_epoch=1)

self.assertGreater(learn.smooth_loss, 0)
self.assertGreater(learn.smooth_loss, 0)
2 changes: 2 additions & 0 deletions tests/test_lightgbm.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@ def test_cpu(self):

self.assertEqual(1, gbm.best_iteration)

# TODO(b/381256047): Colab needs to install GPU-enabled lightgbm.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you consider this a launch blocker for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how popular lightgbm (on GPU) is, I'm planning to have it be a launch blocker.
I'm working with the Colab team to get it in there :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-added it in #1451

Colab is also working on adding it to the base so we can re-drop once its out in their image

@gpu_test
@unittest.skip("Skipping this test until b/381256047 is resolved.")
def test_gpu(self):
lgb_train, lgb_eval = self.load_datasets()

Expand Down