Skip to content

Fix libflatccrt race #10918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 16, 2025
Merged

Fix libflatccrt race #10918

merged 1 commit into from
May 16, 2025

Conversation

jathu
Copy link
Contributor

@jathu jathu commented May 15, 2025

Summary

Seems like there is a race in building libflatccrt.a. This issue has existed for a while: #7300. It was temporarily mitigated in #7570 by just reducing the parallelism.

In this diff I attempt to fix it. This is just my assumption of what is wrong. Given flatccrt builds a debug version with a _d suffix, if the target isn't depended on (i.e. some target don't use the conditional target name) then the order of how the lib is built causes a race. So for now, always use the non-debug version.

Given it's a race, I was never able to repro the issue locally — I can't guarantee this is the problem. However, it seems my recent changes in #10855 has increased the frequency of the problem in CI.

Test plan

CI

cc @larryliu0820

Copy link

pytorch-bot bot commented May 15, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/10918

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Pending, 1 Unrelated Failure

As of commit 6a12fa8 with merge base d0464f8 (image):

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 15, 2025
@jathu jathu added ciflow/trunk ciflow/binaries and removed CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. labels May 15, 2025
@jathu jathu force-pushed the jathu/fix-flatccrt branch from bf0af5e to e56f8d2 Compare May 15, 2025 19:13
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 15, 2025
@jathu jathu force-pushed the jathu/fix-flatccrt branch from e56f8d2 to 3d78257 Compare May 15, 2025 19:28
@jathu jathu changed the title [WIP] Fix flatccrt race Fix flatccrt race May 15, 2025
@jathu jathu added module: build/install Issues related to the cmake and buck2 builds, and to installing ExecuTorch release notes: build Changes related to build, including dependency upgrades, build flags, optimizations, etc. labels May 15, 2025
@jathu jathu marked this pull request as ready for review May 15, 2025 19:32
@jathu jathu force-pushed the jathu/fix-flatccrt branch from 3d78257 to 37721ae Compare May 15, 2025 23:39
@jathu jathu changed the title Fix flatccrt race Fix libflatccrt.a location May 15, 2025
@jathu jathu force-pushed the jathu/fix-flatccrt branch 4 times, most recently from 26e9d75 to fb6ae9e Compare May 16, 2025 01:24
@jathu jathu changed the title Fix libflatccrt.a location [WIP] Fix libflatccrt race May 16, 2025
@jathu jathu marked this pull request as draft May 16, 2025 01:45
@jathu jathu force-pushed the jathu/fix-flatccrt branch 7 times, most recently from 8899a8e to 4b3b4ab Compare May 16, 2025 04:56
@jathu jathu force-pushed the jathu/fix-flatccrt branch 10 times, most recently from ac72cc1 to 4ccdcc6 Compare May 16, 2025 18:11
@jathu jathu force-pushed the jathu/fix-flatccrt branch from 4ccdcc6 to 6a12fa8 Compare May 16, 2025 18:24
@jathu jathu requested a review from kirklandsign May 16, 2025 18:55
@jathu jathu changed the title [WIP] Fix libflatccrt race Fix libflatccrt race May 16, 2025
@jathu jathu marked this pull request as ready for review May 16, 2025 19:08
Comment on lines -23 to -24
bash backends/qualcomm/scripts/build.sh --skip_aarch64 --job_number 2 --release || \
bash backends/qualcomm/scripts/build.sh --skip_aarch64 --job_number 2 --release --no_clean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see the comment right above — this was the way the race was mitigated before in #7570

Comment on lines -100 to -105
if(CMAKE_BUILD_TYPE MATCHES "Debug")
set(FLATCC_LIB flatccrt_d)
else()
set(FLATCC_LIB flatccrt)
endif()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you confirmed that debug build will also build flatccrt?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it seems to have both libs

@jathu jathu merged commit d18a52d into main May 16, 2025
288 of 291 checks passed
@jathu jathu deleted the jathu/fix-flatccrt branch May 16, 2025 19:51
hinriksnaer pushed a commit to hinriksnaer/executorch that referenced this pull request May 19, 2025
### Summary

Seems like there is a race in building `libflatccrt.a`. This issue has
existed for a while: pytorch#7300.
It was temporarily mitigated in
pytorch#7570 by just reducing the
parallelism.

In this diff I attempt to fix it. This is just my assumption of what is
wrong. Given flatccrt builds a debug version with a `_d` suffix, if the
target isn't depended on (i.e. some target don't use the conditional
target name) then the order of how the lib is built causes a race. So
for now, always use the non-debug version.

Given it's a race, I was never able to repro the issue locally — I can't
guarantee this is the problem. However, it seems my recent changes in
pytorch#10855 has increased the
frequency of the problem in CI.

### Test plan

CI

cc @larryliu0820
@cccclai
Copy link
Contributor

cccclai commented Jun 2, 2025

@jathu, looks like this causing issue to build qnn backend in debug mode. Looks like the script to build qnn sdk only covers release mode though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/binaries ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: build/install Issues related to the cmake and buck2 builds, and to installing ExecuTorch release notes: build Changes related to build, including dependency upgrades, build flags, optimizations, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants