-
Notifications
You must be signed in to change notification settings - Fork 608
Use cached PyTorch wheels on MacOS jobs #9484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9484
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 New Failures, 2 PendingAs of commit 34fed00 with merge base d16b867 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Testing with GitHub MacOS runner https://github.com/pytorch/executorch/actions/runs/13988905667 |
.ci/scripts/utils.sh
Outdated
TORCH_RELEASE=$(cat version.txt) | ||
TORCH_SHORT_HASH=${TORCH_VERSION:0:7} | ||
TORCH_WHEEL_PATH="cached_artifacts/pytorch/executorch/pytorch_wheels/${SYSTEM_NAME}/${PYTHON_VERSION}" | ||
TORCH_WHEEL_NAME="torch-${TORCH_RELEASE}%2Bgit${TORCH_SHORT_HASH}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-${PLATFORM:-}.whl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to suggestions here on how to figure out the name of the required PyTorch wheel. When building from source, its name is like torch-2.7.0a0+git295f2ed-cp311-cp311-macosx_14_0_arm64.whl
. Maybe there is a way to get this from PyTorch setup.py
without actually building the wheel.
fi | ||
else | ||
echo "Use cached wheel at ${CACHE_TORCH_WHEEL}" | ||
fi | ||
|
||
# Grab the pinned audio and vision commits from PyTorch | ||
TORCHAUDIO_VERSION=$(cat .github/ci_commit_pins/audio.txt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also cache audio, vision, and other wheels, but the gain is probably smaller because it's fast to build them. This can come in subsequent PRs.
.ci/scripts/utils.sh
Outdated
# Cache PyTorch wheel is only needed on MacOS, Linux CI already has this as part | ||
# of the Docker image | ||
if [[ "${SYSTEM_NAME}" == "Darwin" ]]; then | ||
pip install "${CACHE_TORCH_WHEEL}" || TORCH_WHEEL_NOT_FOUND=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you log when no cache is found, and log the wheel name?
.ci/scripts/utils.sh
Outdated
# Only AWS runners have access to S3 | ||
if command -v aws && [[ -z "${GITHUB_RUNNER:-}" ]]; then | ||
for WHEEL_PATH in dist/*.whl; do | ||
WHEEL_NAME=$(basename "${WHEEL_PATH}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log the name of the wheel that's being uploaded
|
||
# Found no such wheel, we will build it from source then | ||
if [[ "${TORCH_WHEEL_NOT_FOUND:-0}" == "1" ]]; then | ||
USE_DISTRIBUTED=1 python setup.py bdist_wheel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log that we're building from source
.ci/scripts/utils.sh
Outdated
@@ -62,10 +62,38 @@ install_pytorch_and_domains() { | |||
git checkout "${TORCH_VERSION}" | |||
git submodule update --init --recursive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can move this command (cloning all submodules) when we haven't found the cache entry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this will reduce even further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good catch, I only need the version.txt from PyTorch
Thank you for doing this! See inline comments
Orthogonally I did some work on reducing mac runner jobs, see context here https://fb.workplace.com/groups/pytorch.edge2.team/posts/1161746828414501 |
|
||
CACHE_TORCH_WHEEL="https://gha-artifacts.s3.us-east-1.amazonaws.com/${TORCH_WHEEL_PATH}/${TORCH_WHEEL_NAME}" | ||
# Cache PyTorch wheel is only needed on MacOS, Linux CI already has this as part | ||
# of the Docker image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need to set default value for TORCH_WHEEL_NOT_FOUND (to handle non Darwin case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, this function is currently used only on MacOS, but I remember reading that we can now build ExecuTorch on Windows too
.ci/scripts/utils.sh
Outdated
SYSTEM_NAME=$(uname) | ||
if [[ "${SYSTEM_NAME}" == "Darwin" ]]; then | ||
PLATFORM=$(python -c 'import sysconfig; import platform; v=platform.mac_ver()[0].split(".")[0]; platform=sysconfig.get_platform().split("-"); platform[1]=f"{v}_0"; print("_".join(platform))') | ||
fi | ||
PYTHON_VERSION=$(python -c 'import platform; v=platform.python_version_tuple(); print(f"{v[0]}{v[1]}")') | ||
TORCH_RELEASE=$(cat version.txt) | ||
TORCH_SHORT_HASH=${TORCH_VERSION:0:7} | ||
TORCH_WHEEL_PATH="cached_artifacts/pytorch/executorch/pytorch_wheels/${SYSTEM_NAME}/${PYTHON_VERSION}" | ||
TORCH_WHEEL_NAME="torch-${TORCH_RELEASE}%2Bgit${TORCH_SHORT_HASH}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-${PLATFORM:-}.whl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have local variables instead of global env variables
local system_name, torch_release etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, after updating these variables with local
, I look around and find some feature request on shellcheck about this koalaman/shellcheck#468, but it hasn't been implemented yet (probably not anytime soon)
if [[ "${system_name}" == "Darwin" ]]; then | ||
pip install "${cached_torch_wheel}" || torch_wheel_not_found=1 | ||
else | ||
torch_wheel_not_found=1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this else statement and just set the local torch_wheel_not_found=1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I want it to default to 0 (found the wheel) because pip install "${cached_torch_wheel}" || torch_wheel_not_found=1
will set it to 1 when the pip command fails or not MacOS
One of the current drawback of using pinned PyTorch commit on CI is that we need to build PyTorch wheel on all MacOS jobs because it doesn't have Docker image. Building PyTorch wheel is usually not too bad because we have sccache in place to make the compilation faster. However, it's still slower than using a prebuilt wheel, and sccache is also not available on GitHub MacOS runner
macos-latest-xlarge
(no access to S3).As all MacOS jobs are building exactly the same PyTorch wheel, the proposal here is to cache the wheel on S3
gha-artifacts
bucket which is publicly readable, i.e. https://gha-artifacts.s3.us-east-1.amazonaws.com/cached_artifacts/pytorch/executorch/pytorch_wheels/Darwin/311/torch-2.7.0a0%2Bgit295f2ed-cp311-cp311-macosx_14_0_arm64.whl. The job can check for matching wheel from S3 and use it instead. If there is no such wheel, it will continue building PyTorch normally. Once a new wheel is built and if the runner has write access to S3, it will upload the wheel so that other jobs can pick it up going forward.Testing
All CI jobs pass (failures are pre-existing from trunk). Here are some quick number on how this helps reduce the durations of different MacOS jobs.
Overall, I'm seeing the duration for all MacOS jobs reducing by close to 2x. This is very useful to reduce the cost running MacOS jobs (remember the budget request to OSS team because of the $$$ GitHub MacOS runners)