[CI] Rework github workflow processing #130317

Keenuts · 2025-03-07T18:00:47Z

Before this patch, the job/workflow name impacted the metric name, meaning a change in the workflow definition could break monitoring. This patch adds a map to get a stable name on metrics from a workflow name.

In addition, it reworks a bit how we track the last processed workflow: the github queries are broken if filtering is applied, meaning we have a list of workflow, ordered by 'created_at', which mixes completed & running workflows.
We have no guarantees over the order of completion, meaning we cannot stop at the first completed job we found (even per-workflow).

This PR processed the last 1000 workflows, but allows an early stop if the created_at time is older than 8 hours. This means we could miss long-running workflows (>8 hours), and if the number of workflows started before another one completes becomes high (>1000), we'll miss it.
To detect this kind of behavior, a new metric is added "oldest workflow processed", which should at least indicate if the depth is too small.

An alternative without arbitrary cut would be to initially parse all workflows, and then record the last non-completed one we find and always start from the last (moving the lower bound as they complete). But LLVM has forever-queued workflows runs (>1 years), hence this would cause us to iterate over a very large number of jobs.

Before this patch, the job/workflow name impacted the metric name, meaning a change in the workflow definition could break monitoring. This patch adds a map to get a stable name on metrics from a workflow name. In addition, it reworks a bit how we track the last processed workflow to simplify the behavior, and work around an API issue which returns bogus results if a filter is used. This PR is a first step to bring buildkite metrics monitoring. Signed-off-by: Nathan Gauër <[email protected]>

github-actions · 2025-03-07T18:04:25Z

✅ With the latest revision this PR passed the Python code formatter.

boomanaiden154

Some comments, mostly minor. This looks pretty good.

Thanks for taking a stab at this!

.ci/metrics/metrics.py

Keenuts · 2025-03-10T12:15:59Z

Updated the description. realized during the weekend the method was flawed, and ended-up doing something even more basic: fetch the last 1000 workflows, and extract the metrics from those.
Added a cut-off for 8 hours old workflows, and also logging the max age oldest workflow we processed to monitor in case one day we have more then 1000 workflows in the last 2 hours.

boomanaiden154

One nit, otherwise the new approach makes sense to me.

.ci/metrics/metrics.py

boomanaiden154 · 2025-03-11T01:23:57Z

I didn't even consider the case where we hit a completed workflow that shadowed other completed workflows due to them being ordered by start time. That's a good catch.

I'm reasonably hopeful this will fix the weird issues we've been seeing. It looks like we may need to increase the max depth though based on what I saw on the dashboard today.

Keenuts · 2025-03-11T13:08:46Z

I didn't even consider the case where we hit a completed workflow that shadowed other completed workflows due to them being ordered by start time. That's a good catch.

Yes, I realized that very late.. So we probably had better metrics than reality because of this.

It looks like we may need to increase the max depth though based on what I saw on the dashboard today.

Yes, seems like yesterday was a pretty busy day, but we might want to double it soon.

Keenuts requested a review from boomanaiden154 March 7, 2025 18:00

format

fbf6505

boomanaiden154 mentioned this pull request Mar 8, 2025

[CI] Extend metrics container to log BuildKite metrics #129699

Merged

boomanaiden154 reviewed Mar 8, 2025

View reviewed changes

.ci/metrics/metrics.py Outdated Show resolved Hide resolved

.ci/metrics/metrics.py Show resolved Hide resolved

.ci/metrics/metrics.py Show resolved Hide resolved

Keenuts added 2 commits March 10, 2025 11:39

iterate over fixed depth

a952994

pr-feedback

4c31335

boomanaiden154 approved these changes Mar 11, 2025

View reviewed changes

.ci/metrics/metrics.py Show resolved Hide resolved

add logging when dropping stale metrics

44d2967

Keenuts mentioned this pull request Mar 11, 2025

Revert "[CI] Extend metrics container to log BuildKite metrics" #130770

Merged

Keenuts merged commit 389a705 into llvm:main Mar 11, 2025
10 of 11 checks passed

Keenuts deleted the metrics-refact branch March 11, 2025 13:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI] Rework github workflow processing #130317

[CI] Rework github workflow processing #130317

Uh oh!

Keenuts commented Mar 7, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Mar 7, 2025 •

edited

Loading

Uh oh!

boomanaiden154 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Keenuts commented Mar 10, 2025

Uh oh!

boomanaiden154 left a comment

Uh oh!

Uh oh!

boomanaiden154 commented Mar 11, 2025

Uh oh!

Keenuts commented Mar 11, 2025

Uh oh!

Uh oh!

Uh oh!

[CI] Rework github workflow processing #130317

[CI] Rework github workflow processing #130317

Uh oh!

Conversation

Keenuts commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boomanaiden154 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Keenuts commented Mar 10, 2025

Uh oh!

boomanaiden154 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

boomanaiden154 commented Mar 11, 2025

Uh oh!

Keenuts commented Mar 11, 2025

Uh oh!

Uh oh!

Uh oh!

Keenuts commented Mar 7, 2025 •

edited

Loading

github-actions bot commented Mar 7, 2025 •

edited

Loading