Skip to content

Attempt to fix libc++ actions runner restarter. #120627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 21, 2025
Merged

Conversation

EricWF
Copy link
Member

@EricWF EricWF commented Dec 19, 2024

It appears that introducing docker containers has broken the restarter
job since additional failure messages appear with the preemption
messages.

This should get jobs restarting on preemption again, but may do so
for jobs that also contain unrelated failures

It appears that introducing docker containers has broken the restarter
job since additional failure messages appear with the preemption
messages.

This should get jobs restarting on preemption again, but may do so
for jobs that also contain unrelated failures
@llvmbot llvmbot added libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. github:workflow labels Dec 19, 2024
@llvmbot
Copy link
Member

llvmbot commented Dec 19, 2024

@llvm/pr-subscribers-github-workflow

@llvm/pr-subscribers-libcxx

Author: Eric (EricWF)

Changes

It appears that introducing docker containers has broken the restarter
job since additional failure messages appear with the preemption
messages.

This should get jobs restarting on preemption again, but may do so
for jobs that also contain unrelated failures


Full diff: https://github.com/llvm/llvm-project/pull/120627.diff

1 Files Affected:

  • (modified) .github/workflows/libcxx-restart-preempted-jobs.yaml (+30-7)
diff --git a/.github/workflows/libcxx-restart-preempted-jobs.yaml b/.github/workflows/libcxx-restart-preempted-jobs.yaml
index 82d84c01c92af2..b27debd0e6fe71 100644
--- a/.github/workflows/libcxx-restart-preempted-jobs.yaml
+++ b/.github/workflows/libcxx-restart-preempted-jobs.yaml
@@ -92,6 +92,12 @@ jobs:
                 check_run_id: check_run_id
               })
 
+              // For temporary debugging purposes to see the structure of the annotations.
+              console.print(annotations);
+
+              has_failed_job = false;
+              saved_failure_message = null;
+
               for (annotation of annotations.data) {
                 if (annotation.annotation_level != 'failure') {
                   continue;
@@ -106,15 +112,32 @@ jobs:
 
                 const failure_match = annotation.message.match(failure_regex);
                 if (failure_match != null) {
-                  // We only want to restart the workflow if all of the failures were due to preemption.
-                  // We don't want to restart the workflow if there were other failures.
-                  core.notice('Choosing not to rerun workflow because we found a non-preemption failure' +
-                    'Failure message: "' + annotation.message + '"');
-                  await create_check_run('skipped', 'Choosing not to rerun workflow because we found a non-preemption failure\n'
-                    + 'Failure message: ' + annotation.message)
-                  return;
+                  has_failed_job = true;
+                  saved_failure_message = annotation.message;
                 }
               }
+              if (has_failed_job and not has_preempted_job) {
+                // We only want to restart the workflow if all of the failures were due to preemption.
+                // We don't want to restart the workflow if there were other failures.
+                //
+                // However, libcxx runners running inside docker containers produce both a preemption message and failure message.
+                //
+                // The desired approach is to ignore failure messages which appear on the same job as a preemption message
+                // (An job is a single run with a specific configuration, ex generic-gcc, gcc-14).
+                //
+                // However, it's unclear that this code achieves the desired approach, and it may ignore all failures
+                // if a preemption message is found at all on any run.
+                //
+                // For now, it's more important to restart preempted workflows than to avoid restarting workflows with
+                // non-preemption failures.
+                //
+                // TODO Figure this out.
+                core.notice('Choosing not to rerun workflow because we found a non-preemption failure' +
+                  'Failure message: "' + saved_failure_message + '"');
+                await create_check_run('skipped', 'Choosing not to rerun workflow because we found a non-preemption failure\n'
+                    + 'Failure message: ' + saved_failure_message)
+                return;
+              }
             }
 
             if (!has_preempted_job) {

@EricWF EricWF requested a review from ldionne December 19, 2024 19:42
@EricWF
Copy link
Member Author

EricWF commented Dec 19, 2024

Note: Due to the difficulty of testing, this hasn't been meaningfully tested.

Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that we can't test that without pushing to main is horrible, but I don't see how we can best make progress on this issue otherwise.

If you push this, please carefully monitor the CI state for a bit cause this can easily cause unintended instability and we'd rather have something broken-but-in-a-state-we-understand over the holidays than something broken-in-new-fun-ways.

@EricWF EricWF merged commit 59850c2 into llvm:main Jan 21, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
github:workflow libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants