-
Notifications
You must be signed in to change notification settings - Fork 14.3k
[CI] Rework github workflow processing #130317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Before this patch, the job/workflow name impacted the metric name, meaning a change in the workflow definition could break monitoring. This patch adds a map to get a stable name on metrics from a workflow name. In addition, it reworks a bit how we track the last processed workflow to simplify the behavior, and work around an API issue which returns bogus results if a filter is used. This PR is a first step to bring buildkite metrics monitoring. Signed-off-by: Nathan Gauër <[email protected]>
✅ With the latest revision this PR passed the Python code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments, mostly minor. This looks pretty good.
Thanks for taking a stab at this!
Updated the description. realized during the weekend the method was flawed, and ended-up doing something even more basic: fetch the last 1000 workflows, and extract the metrics from those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One nit, otherwise the new approach makes sense to me.
I didn't even consider the case where we hit a completed workflow that shadowed other completed workflows due to them being ordered by start time. That's a good catch. I'm reasonably hopeful this will fix the weird issues we've been seeing. It looks like we may need to increase the max depth though based on what I saw on the dashboard today. |
Yes, I realized that very late.. So we probably had better metrics than reality because of this.
Yes, seems like yesterday was a pretty busy day, but we might want to double it soon. |
Before this patch, the job/workflow name impacted the metric name, meaning a change in the workflow definition could break monitoring. This patch adds a map to get a stable name on metrics from a workflow name.
In addition, it reworks a bit how we track the last processed workflow: the github queries are broken if filtering is applied, meaning we have a list of workflow, ordered by 'created_at', which mixes completed & running workflows.
We have no guarantees over the order of completion, meaning we cannot stop at the first completed job we found (even per-workflow).
This PR processed the last 1000 workflows, but allows an early stop if the created_at time is older than 8 hours. This means we could miss long-running workflows (>8 hours), and if the number of workflows started before another one completes becomes high (>1000), we'll miss it.
To detect this kind of behavior, a new metric is added "oldest workflow processed", which should at least indicate if the depth is too small.
An alternative without arbitrary cut would be to initially parse all workflows, and then record the last non-completed one we find and always start from the last (moving the lower bound as they complete). But LLVM has forever-queued workflows runs (>1 years), hence this would cause us to iterate over a very large number of jobs.