Skip to content

Use cached PyTorch wheels on MacOS jobs #9484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 24, 2025
Merged

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Mar 21, 2025

One of the current drawback of using pinned PyTorch commit on CI is that we need to build PyTorch wheel on all MacOS jobs because it doesn't have Docker image. Building PyTorch wheel is usually not too bad because we have sccache in place to make the compilation faster. However, it's still slower than using a prebuilt wheel, and sccache is also not available on GitHub MacOS runner macos-latest-xlarge (no access to S3).

As all MacOS jobs are building exactly the same PyTorch wheel, the proposal here is to cache the wheel on S3 gha-artifacts bucket which is publicly readable, i.e. https://gha-artifacts.s3.us-east-1.amazonaws.com/cached_artifacts/pytorch/executorch/pytorch_wheels/Darwin/311/torch-2.7.0a0%2Bgit295f2ed-cp311-cp311-macosx_14_0_arm64.whl. The job can check for matching wheel from S3 and use it instead. If there is no such wheel, it will continue building PyTorch normally. Once a new wheel is built and if the runner has write access to S3, it will upload the wheel so that other jobs can pick it up going forward.

Testing

All CI jobs pass (failures are pre-existing from trunk). Here are some quick number on how this helps reduce the durations of different MacOS jobs.

  • Apple workflow:
  • Apple perf workflow:
  • All MacOS jobs in pull and trunk:
    • BEFORE ~417 on commit b195ed9 → AFTER ~268m

Overall, I'm seeing the duration for all MacOS jobs reducing by close to 2x. This is very useful to reduce the cost running MacOS jobs (remember the budget request to OSS team because of the $$$ GitHub MacOS runners)

@huydhn huydhn added the module: ci Issues related to continuous integration label Mar 21, 2025
Copy link

pytorch-bot bot commented Mar 21, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9484

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 2 Pending

As of commit 34fed00 with merge base d16b867 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 21, 2025
@huydhn
Copy link
Contributor Author

huydhn commented Mar 21, 2025

Testing with GitHub MacOS runner https://github.com/pytorch/executorch/actions/runs/13988905667

@huydhn huydhn temporarily deployed to upload-benchmark-results March 21, 2025 22:38 — with GitHub Actions Inactive
@huydhn huydhn temporarily deployed to upload-benchmark-results March 22, 2025 02:11 — with GitHub Actions Inactive
@huydhn huydhn requested review from shoumikhin and guangy10 March 22, 2025 03:02
@huydhn huydhn marked this pull request as ready for review March 22, 2025 03:03
TORCH_RELEASE=$(cat version.txt)
TORCH_SHORT_HASH=${TORCH_VERSION:0:7}
TORCH_WHEEL_PATH="cached_artifacts/pytorch/executorch/pytorch_wheels/${SYSTEM_NAME}/${PYTHON_VERSION}"
TORCH_WHEEL_NAME="torch-${TORCH_RELEASE}%2Bgit${TORCH_SHORT_HASH}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-${PLATFORM:-}.whl"
Copy link
Contributor Author

@huydhn huydhn Mar 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to suggestions here on how to figure out the name of the required PyTorch wheel. When building from source, its name is like torch-2.7.0a0+git295f2ed-cp311-cp311-macosx_14_0_arm64.whl. Maybe there is a way to get this from PyTorch setup.py without actually building the wheel.

fi
else
echo "Use cached wheel at ${CACHE_TORCH_WHEEL}"
fi

# Grab the pinned audio and vision commits from PyTorch
TORCHAUDIO_VERSION=$(cat .github/ci_commit_pins/audio.txt)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also cache audio, vision, and other wheels, but the gain is probably smaller because it's fast to build them. This can come in subsequent PRs.

@huydhn huydhn requested a review from mergennachin March 22, 2025 03:12
# Cache PyTorch wheel is only needed on MacOS, Linux CI already has this as part
# of the Docker image
if [[ "${SYSTEM_NAME}" == "Darwin" ]]; then
pip install "${CACHE_TORCH_WHEEL}" || TORCH_WHEEL_NOT_FOUND=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you log when no cache is found, and log the wheel name?

# Only AWS runners have access to S3
if command -v aws && [[ -z "${GITHUB_RUNNER:-}" ]]; then
for WHEEL_PATH in dist/*.whl; do
WHEEL_NAME=$(basename "${WHEEL_PATH}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log the name of the wheel that's being uploaded


# Found no such wheel, we will build it from source then
if [[ "${TORCH_WHEEL_NOT_FOUND:-0}" == "1" ]]; then
USE_DISTRIBUTED=1 python setup.py bdist_wheel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log that we're building from source

@@ -62,10 +62,38 @@ install_pytorch_and_domains() {
git checkout "${TORCH_VERSION}"
git submodule update --init --recursive
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can move this command (cloning all submodules) when we haven't found the cache entry?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this will reduce even further

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch, I only need the version.txt from PyTorch

@mergennachin
Copy link
Contributor

Thank you for doing this!

See inline comments

This is very useful to reduce the cost running MacOS jobs (remember the budget request to OSS team because of the $$$ GitHub MacOS runners)

Orthogonally I did some work on reducing mac runner jobs, see context here https://fb.workplace.com/groups/pytorch.edge2.team/posts/1161746828414501


CACHE_TORCH_WHEEL="https://gha-artifacts.s3.us-east-1.amazonaws.com/${TORCH_WHEEL_PATH}/${TORCH_WHEEL_NAME}"
# Cache PyTorch wheel is only needed on MacOS, Linux CI already has this as part
# of the Docker image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need to set default value for TORCH_WHEEL_NOT_FOUND (to handle non Darwin case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, this function is currently used only on MacOS, but I remember reading that we can now build ExecuTorch on Windows too

Comment on lines 65 to 73
SYSTEM_NAME=$(uname)
if [[ "${SYSTEM_NAME}" == "Darwin" ]]; then
PLATFORM=$(python -c 'import sysconfig; import platform; v=platform.mac_ver()[0].split(".")[0]; platform=sysconfig.get_platform().split("-"); platform[1]=f"{v}_0"; print("_".join(platform))')
fi
PYTHON_VERSION=$(python -c 'import platform; v=platform.python_version_tuple(); print(f"{v[0]}{v[1]}")')
TORCH_RELEASE=$(cat version.txt)
TORCH_SHORT_HASH=${TORCH_VERSION:0:7}
TORCH_WHEEL_PATH="cached_artifacts/pytorch/executorch/pytorch_wheels/${SYSTEM_NAME}/${PYTHON_VERSION}"
TORCH_WHEEL_NAME="torch-${TORCH_RELEASE}%2Bgit${TORCH_SHORT_HASH}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-${PLATFORM:-}.whl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have local variables instead of global env variables

local system_name, torch_release etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, after updating these variables with local, I look around and find some feature request on shellcheck about this koalaman/shellcheck#468, but it hasn't been implemented yet (probably not anytime soon)

if [[ "${system_name}" == "Darwin" ]]; then
pip install "${cached_torch_wheel}" || torch_wheel_not_found=1
else
torch_wheel_not_found=1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this else statement and just set the local torch_wheel_not_found=1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I want it to default to 0 (found the wheel) because pip install "${cached_torch_wheel}" || torch_wheel_not_found=1 will set it to 1 when the pip command fails or not MacOS

@huydhn huydhn merged commit 5c5b84e into main Mar 24, 2025
165 of 171 checks passed
@huydhn huydhn deleted the cache-macos-pytorch-build-artifact branch March 24, 2025 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: ci Issues related to continuous integration topic: not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants