Use cached PyTorch wheels on MacOS jobs #9484

huydhn · 2025-03-21T04:33:32Z

One of the current drawback of using pinned PyTorch commit on CI is that we need to build PyTorch wheel on all MacOS jobs because it doesn't have Docker image. Building PyTorch wheel is usually not too bad because we have sccache in place to make the compilation faster. However, it's still slower than using a prebuilt wheel, and sccache is also not available on GitHub MacOS runner macos-latest-xlarge (no access to S3).

As all MacOS jobs are building exactly the same PyTorch wheel, the proposal here is to cache the wheel on S3 gha-artifacts bucket which is publicly readable, i.e. https://gha-artifacts.s3.us-east-1.amazonaws.com/cached_artifacts/pytorch/executorch/pytorch_wheels/Darwin/311/torch-2.7.0a0%2Bgit295f2ed-cp311-cp311-macosx_14_0_arm64.whl. The job can check for matching wheel from S3 and use it instead. If there is no such wheel, it will continue building PyTorch normally. Once a new wheel is built and if the runner has write access to S3, it will upload the wheel so that other jobs can pick it up going forward.

Testing

All CI jobs pass (failures are pre-existing from trunk). Here are some quick number on how this helps reduce the durations of different MacOS jobs.

Apple workflow:
- build-benchmark-app: BEFORE ~80m → AFTER ~44m
- build-frameworks-ios: BEFORE ~80m → AFTER ~ 44m
- build-demo-ios: BEFORE ~ 55m → AFTER ~23m
Apple perf workflow:
- build-benchmark-app: BEFORE ~80m → AFTER ~48m
- export model (llama): BEFORE ~30m → AFTER ~13m
All MacOS jobs in pull and trunk:
- BEFORE ~417 on commit b195ed9 → AFTER ~268m

Overall, I'm seeing the duration for all MacOS jobs reducing by close to 2x. This is very useful to reduce the cost running MacOS jobs (remember the budget request to OSS team because of the $$$ GitHub MacOS runners)

pytorch-bot · 2025-03-21T04:33:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9484

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures, 2 Pending

As of commit 34fed00 with merge base d16b867 ():

NEW FAILURES - The following jobs have failed:

Lint / android-java-format / linux-job (gh)
RuntimeError: Command docker exec -t 520248573b0f8c4744e1fd4bc9b3a424a10ca7b6a0e41ba280c80d4b99e50b7f /exec failed with exit code 1
pull / android / build-llm-demo / linux-job (gh)
RuntimeError: Command docker exec -t ff027f9f6720d2299b409ff7c7a645d56e18493e5852cde0926bbb49586aaebd /exec failed with exit code 127
pull / test-moshi-linux / linux-job (gh)
RuntimeError: Command docker exec -t f2292a01cf969be895957a240ca53ae784916c6451279f67f0ac5107ade81a9e /exec failed with exit code 127
pull / unittest / macos / macos-job (gh)
backends/xnnpack/test/passes/test_convert_to_linear.py::TestConvertToLinear::test_fp32_convert_to_linear
pull / unittest-arm / linux-job (gh)
RuntimeError: Command docker exec -t 91c14efa988a76397964a6f7320dce65fe7ee6c7b92856e84161275874221411 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

huydhn · 2025-03-21T09:43:55Z

Testing with GitHub MacOS runner https://github.com/pytorch/executorch/actions/runs/13988905667

huydhn · 2025-03-22T03:09:30Z

.ci/scripts/utils.sh

+  TORCH_RELEASE=$(cat version.txt)
+  TORCH_SHORT_HASH=${TORCH_VERSION:0:7}
+  TORCH_WHEEL_PATH="cached_artifacts/pytorch/executorch/pytorch_wheels/${SYSTEM_NAME}/${PYTHON_VERSION}"
+  TORCH_WHEEL_NAME="torch-${TORCH_RELEASE}%2Bgit${TORCH_SHORT_HASH}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-${PLATFORM:-}.whl"


I'm open to suggestions here on how to figure out the name of the required PyTorch wheel. When building from source, its name is like torch-2.7.0a0+git295f2ed-cp311-cp311-macosx_14_0_arm64.whl. Maybe there is a way to get this from PyTorch setup.py without actually building the wheel.

huydhn · 2025-03-22T03:11:50Z

.ci/scripts/utils.sh

+    fi
+  else
+    echo "Use cached wheel at ${CACHE_TORCH_WHEEL}"
+  fi

  # Grab the pinned audio and vision commits from PyTorch
  TORCHAUDIO_VERSION=$(cat .github/ci_commit_pins/audio.txt)


We can also cache audio, vision, and other wheels, but the gain is probably smaller because it's fast to build them. This can come in subsequent PRs.

mergennachin · 2025-03-24T15:03:27Z

.ci/scripts/utils.sh

+  # Cache PyTorch wheel is only needed on MacOS, Linux CI already has this as part
+  # of the Docker image
+  if [[ "${SYSTEM_NAME}" == "Darwin" ]]; then
+    pip install "${CACHE_TORCH_WHEEL}" || TORCH_WHEEL_NOT_FOUND=1


can you log when no cache is found, and log the wheel name?

mergennachin · 2025-03-24T15:05:46Z

.ci/scripts/utils.sh

+    # Only AWS runners have access to S3
+    if command -v aws && [[ -z "${GITHUB_RUNNER:-}" ]]; then
+      for WHEEL_PATH in dist/*.whl; do
+        WHEEL_NAME=$(basename "${WHEEL_PATH}")


log the name of the wheel that's being uploaded

mergennachin · 2025-03-24T15:06:04Z

.ci/scripts/utils.sh

+
+  # Found no such wheel, we will build it from source then
+  if [[ "${TORCH_WHEEL_NOT_FOUND:-0}" == "1" ]]; then
+    USE_DISTRIBUTED=1 python setup.py bdist_wheel


log that we're building from source

mergennachin · 2025-03-24T15:12:53Z

.ci/scripts/utils.sh

@@ -62,10 +62,38 @@ install_pytorch_and_domains() {
  git checkout "${TORCH_VERSION}"
  git submodule update --init --recursive


can move this command (cloning all submodules) when we haven't found the cache entry?

i think this will reduce even further

Yeah, good catch, I only need the version.txt from PyTorch

mergennachin · 2025-03-24T15:13:38Z

Thank you for doing this!

See inline comments

This is very useful to reduce the cost running MacOS jobs (remember the budget request to OSS team because of the $$$ GitHub MacOS runners)

Orthogonally I did some work on reducing mac runner jobs, see context here https://fb.workplace.com/groups/pytorch.edge2.team/posts/1161746828414501

mergennachin · 2025-03-24T15:21:44Z

.ci/scripts/utils.sh

+
+  CACHE_TORCH_WHEEL="https://gha-artifacts.s3.us-east-1.amazonaws.com/${TORCH_WHEEL_PATH}/${TORCH_WHEEL_NAME}"
+  # Cache PyTorch wheel is only needed on MacOS, Linux CI already has this as part
+  # of the Docker image


Don't you need to set default value for TORCH_WHEEL_NOT_FOUND (to handle non Darwin case)

True, this function is currently used only on MacOS, but I remember reading that we can now build ExecuTorch on Windows too

mergennachin · 2025-03-24T15:23:21Z

.ci/scripts/utils.sh

+  SYSTEM_NAME=$(uname)
+  if [[ "${SYSTEM_NAME}" == "Darwin" ]]; then
+    PLATFORM=$(python -c 'import sysconfig; import platform; v=platform.mac_ver()[0].split(".")[0]; platform=sysconfig.get_platform().split("-"); platform[1]=f"{v}_0"; print("_".join(platform))')
+  fi
+  PYTHON_VERSION=$(python -c 'import platform; v=platform.python_version_tuple(); print(f"{v[0]}{v[1]}")')
+  TORCH_RELEASE=$(cat version.txt)
+  TORCH_SHORT_HASH=${TORCH_VERSION:0:7}
+  TORCH_WHEEL_PATH="cached_artifacts/pytorch/executorch/pytorch_wheels/${SYSTEM_NAME}/${PYTHON_VERSION}"
+  TORCH_WHEEL_NAME="torch-${TORCH_RELEASE}%2Bgit${TORCH_SHORT_HASH}-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-${PLATFORM:-}.whl"


should we have local variables instead of global env variables

local system_name, torch_release etc?

Just FYI, after updating these variables with local, I look around and find some feature request on shellcheck about this koalaman/shellcheck#468, but it hasn't been implemented yet (probably not anytime soon)

mergennachin · 2025-03-24T20:03:16Z

.ci/scripts/utils.sh

+  if [[ "${system_name}" == "Darwin" ]]; then
+    pip install "${cached_torch_wheel}" || torch_wheel_not_found=1
+  else
+    torch_wheel_not_found=1


remove this else statement and just set the local torch_wheel_not_found=1?

Oh, I want it to default to 0 (found the wheel) because pip install "${cached_torch_wheel}" || torch_wheel_not_found=1 will set it to 1 when the pip command fails or not MacOS

Upload PyTorch MacOS wheel to S3

eaed605

huydhn added the module: ci Issues related to continuous integration label Mar 21, 2025

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 21, 2025

No ACL

3071d26

huydhn added the topic: not user facing label Mar 21, 2025

huydhn added 4 commits March 21, 2025 01:18

Implement reading from the cache

947c39c

Fix typo

9cbe26a

Another attempt

1576f32

Fix a bug

fbd4949

huydhn added 2 commits March 21, 2025 14:04

Merge branch 'main' into cache-macos-pytorch-build-artifact

b47eb88

Fix the platform version

bc6baf3

huydhn temporarily deployed to upload-benchmark-results March 21, 2025 22:38 — with GitHub Actions Inactive

huydhn temporarily deployed to upload-benchmark-results March 22, 2025 02:11 — with GitHub Actions Inactive

huydhn requested review from shoumikhin and guangy10 March 22, 2025 03:02

huydhn marked this pull request as ready for review March 22, 2025 03:03

huydhn commented Mar 22, 2025

View reviewed changes

huydhn requested a review from mergennachin March 22, 2025 03:12

shoumikhin approved these changes Mar 22, 2025

View reviewed changes

mergennachin approved these changes Mar 24, 2025

View reviewed changes

mergennachin reviewed Mar 24, 2025

View reviewed changes

huydhn added 2 commits March 24, 2025 12:06

Address review comments

3b1938f

Another tweak

34fed00

mergennachin approved these changes Mar 24, 2025

View reviewed changes

huydhn merged commit 5c5b84e into main Mar 24, 2025
165 of 171 checks passed

huydhn deleted the cache-macos-pytorch-build-artifact branch March 24, 2025 20:26

huydhn mentioned this pull request Apr 23, 2025

fbgemm packages are compiled in torchinductor torchbench tests pytorch/pytorch#152024

Open

		@@ -62,10 +62,38 @@ install_pytorch_and_domains() {
		git checkout "${TORCH_VERSION}"
		git submodule update --init --recursive

Use cached PyTorch wheels on MacOS jobs #9484

Use cached PyTorch wheels on MacOS jobs #9484

Uh oh!

Conversation

huydhn commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

pytorch-bot bot commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9484

❌ 5 New Failures, 2 Pending

Uh oh!

huydhn commented Mar 21, 2025

Uh oh!

huydhn Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergennachin commented Mar 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

huydhn commented Mar 21, 2025 •

edited

Loading

pytorch-bot bot commented Mar 21, 2025 •

edited

Loading

huydhn Mar 22, 2025 •

edited

Loading