Skip to content

[CI] Rework github workflow processing #130317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 11, 2025
Merged

Conversation

Keenuts
Copy link
Contributor

@Keenuts Keenuts commented Mar 7, 2025

Before this patch, the job/workflow name impacted the metric name, meaning a change in the workflow definition could break monitoring. This patch adds a map to get a stable name on metrics from a workflow name.

In addition, it reworks a bit how we track the last processed workflow: the github queries are broken if filtering is applied, meaning we have a list of workflow, ordered by 'created_at', which mixes completed & running workflows.
We have no guarantees over the order of completion, meaning we cannot stop at the first completed job we found (even per-workflow).

This PR processed the last 1000 workflows, but allows an early stop if the created_at time is older than 8 hours. This means we could miss long-running workflows (>8 hours), and if the number of workflows started before another one completes becomes high (>1000), we'll miss it.
To detect this kind of behavior, a new metric is added "oldest workflow processed", which should at least indicate if the depth is too small.

An alternative without arbitrary cut would be to initially parse all workflows, and then record the last non-completed one we find and always start from the last (moving the lower bound as they complete). But LLVM has forever-queued workflows runs (>1 years), hence this would cause us to iterate over a very large number of jobs.

Before this patch, the job/workflow name impacted the metric name,
meaning a change in the workflow definition could break monitoring.
This patch adds a map to get a stable name on metrics from a workflow
name.

In addition, it reworks a bit how we track the last processed workflow
to simplify the behavior, and work around an API issue which returns
bogus results if a filter is used.

This PR is a first step to bring buildkite metrics monitoring.

Signed-off-by: Nathan Gauër <[email protected]>
@Keenuts Keenuts requested a review from boomanaiden154 March 7, 2025 18:00
Copy link

github-actions bot commented Mar 7, 2025

✅ With the latest revision this PR passed the Python code formatter.

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, mostly minor. This looks pretty good.

Thanks for taking a stab at this!

@Keenuts
Copy link
Contributor Author

Keenuts commented Mar 10, 2025

Updated the description. realized during the weekend the method was flawed, and ended-up doing something even more basic: fetch the last 1000 workflows, and extract the metrics from those.
Added a cut-off for 8 hours old workflows, and also logging the max age oldest workflow we processed to monitor in case one day we have more then 1000 workflows in the last 2 hours.

Copy link
Contributor

@boomanaiden154 boomanaiden154 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit, otherwise the new approach makes sense to me.

@boomanaiden154
Copy link
Contributor

I didn't even consider the case where we hit a completed workflow that shadowed other completed workflows due to them being ordered by start time. That's a good catch.

I'm reasonably hopeful this will fix the weird issues we've been seeing. It looks like we may need to increase the max depth though based on what I saw on the dashboard today.

@Keenuts
Copy link
Contributor Author

Keenuts commented Mar 11, 2025

I didn't even consider the case where we hit a completed workflow that shadowed other completed workflows due to them being ordered by start time. That's a good catch.

Yes, I realized that very late.. So we probably had better metrics than reality because of this.

It looks like we may need to increase the max depth though based on what I saw on the dashboard today.

Yes, seems like yesterday was a pretty busy day, but we might want to double it soon.

@Keenuts Keenuts merged commit 389a705 into llvm:main Mar 11, 2025
10 of 11 checks passed
@Keenuts Keenuts deleted the metrics-refact branch March 11, 2025 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants