Skip to content

Upstream libc++ buildbot restarter. #93582

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 28, 2024
Merged

Conversation

EricWF
Copy link
Member

@EricWF EricWF commented May 28, 2024

I've been running a cronjob on my local machine to restart preempted
libc++ CI runs. This is bad and brittle. This upstreams a much better
version of the restarter.

It works by matching on check run annotations looking for mention
of the machine being shutdown.

If there are both preempted jobs and failing jobs, we don't restart
the workflow. Maybe we should change that?

I've been running a cronjob on my local machine to restart preempted
libc++ CI runs. This is bad and brittle. This upstreams a much better
version of the restarter.

It works by matching on check run annotations looking for mention
of the machine being shutdown.

If there are both preempted jobs and failing jobs, we don't restart
the workflow. Maybe we should change that?
@EricWF EricWF requested a review from ldionne May 28, 2024 17:14
@llvmbot
Copy link
Member

llvmbot commented May 28, 2024

@llvm/pr-subscribers-github-workflow

Author: Eric (EricWF)

Changes

I've been running a cronjob on my local machine to restart preempted
libc++ CI runs. This is bad and brittle. This upstreams a much better
version of the restarter.

It works by matching on check run annotations looking for mention
of the machine being shutdown.

If there are both preempted jobs and failing jobs, we don't restart
the workflow. Maybe we should change that?


Full diff: https://github.com/llvm/llvm-project/pull/93582.diff

1 Files Affected:

  • (added) .github/workflows/restart-preempted-libcxx-jobs.yaml (+108)
diff --git a/.github/workflows/restart-preempted-libcxx-jobs.yaml b/.github/workflows/restart-preempted-libcxx-jobs.yaml
new file mode 100644
index 0000000000000..3da17b9f85544
--- /dev/null
+++ b/.github/workflows/restart-preempted-libcxx-jobs.yaml
@@ -0,0 +1,108 @@
+name: Restart Preempted Libc++ Workflow
+
+# The libc++ builders run on preemptable VMs, which can be shutdown at any time.
+# This workflow identifies when a workflow run was canceled due to the VM being preempted,
+# and restarts the workflow run.
+
+# We identify a canceled workflow run by checking the annotations of the check runs in the check suite,
+# which should contain the message "The runner has received a shutdown signal."
+
+# Note: If a job is both preempted and also contains a non-preemption failure, we do not restart the workflow.
+
+on:
+  workflow_run:
+    workflows:
+      - "Build and Test libc\+\+"
+    types:
+      - failure
+      - canceled
+
+permissions:
+  contents: read
+
+jobs:
+  restart:
+    name: "Restart Job"
+    permissions:
+      statuses: read
+      checks: read
+      actions: write
+    runs-on: ubuntu-latest
+    steps:
+      - name: "Restart Job"
+        uses: actions/github-script@60a0d83039c74a4aee543508d2ffcb1c3799cdea #v7.0.1
+        with:
+          script: |
+            const failure_regex = /Process completed with exit code 1./
+            const preemption_regex = /The runner has received a shutdown signal/
+            
+            console.log('Listing check runs for suite')
+            const check_suites = await github.rest.checks.listForSuite({
+              owner: context.repo.owner,
+              repo: context.repo.repo,
+              check_suite_id: context.payload.workflow_run.check_suite_id
+            })
+
+            check_run_ids = [];
+            for (check_run of check_suites.data.check_runs) {
+              console.log('Checking check run: ' + check_run.id);
+              console.log(check_run);
+              if (check_run.status != 'completed') {
+                console.log('Check run was not completed. Skipping.');
+                continue;
+              }
+              if (check_run.conclusion != 'failure' && check_run.conclusion != 'cancelled') {
+                console.log('Check run had conclusion: ' + check_run.conclusion + '. Skipping.');
+                continue;
+              }
+              check_run_ids.push(check_run.id);
+            }
+            
+            has_preempted_job = false;
+
+            for (check_run_id of check_run_ids) {
+              console.log('Listing annotations for check run: ' + check_run_id);
+                 
+              annotations = await github.rest.checks.listAnnotations({
+                owner: context.repo.owner,
+                repo: context.repo.repo,
+                check_run_id: check_run_id
+              })
+              
+              console.log(annotations);
+              for (annotation of annotations.data) {
+                if (annotation.annotation_level != 'failure') {
+                  continue;
+                }
+                
+                const preemption_match = annotation.message.match(preemption_regex);
+              
+                if (preemption_match != null) {
+                  console.log('Found preemption message: ' + annotation.message);
+                  has_preempted_job = true;
+                }
+                
+                const failure_match = annotation.message.match(failure_regex);
+                if (failure_match != null) {
+                  // We only want to restart the workflow if all of the failures were due to preemption.
+                  // We don't want to restart the workflow if there were other failures.
+                  console.log('Choosing not to rerun workflow because we found a non-preemption failure');
+                  console.log('Failure message: ' + annotation.message);
+                  return;
+                }
+              }
+            } 
+             
+            if (!has_preempted_job) {
+              console.log('No preempted jobs found. Not restarting workflow.');
+              return;
+            }
+            
+            console.log("Restarted workflow: " + context.payload.workflow_run.id);
+            await github.rest.actions.reRunWorkflowFailedJobs({
+                owner: context.repo.owner,
+                repo: context.repo.repo,
+                run_id: context.payload.workflow_run.id
+              })
+            
+        

@EricWF EricWF requested review from a team and tstellar May 28, 2024 17:14
@EricWF
Copy link
Member Author

EricWF commented May 28, 2024

@tstellar Once this lands we can revoke the permissions for the libcxx-buildbot-restarter app.

@EricWF EricWF requested a review from tstellar May 28, 2024 18:15
Copy link
Collaborator

@tstellar tstellar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@EricWF EricWF merged commit 067b4cc into llvm:main May 28, 2024
5 checks passed
@EricWF
Copy link
Member Author

EricWF commented May 29, 2024

I've deleted the restarter app.

vg0204 pushed a commit to vg0204/llvm-project that referenced this pull request May 29, 2024
I've been running a cronjob on my local machine to restart preempted
libc++ CI runs. This is bad and brittle. This upstreams a much better
version of the restarter.

It works by matching on check run annotations looking for mention
of the machine being shutdown.

If there are both preempted jobs and failing jobs, we don't restart
the workflow. Maybe we should change that?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants