Skip to content

feat: support for flux minicluster #11

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 23, 2024
Merged

Conversation

vsoch
Copy link
Member

@vsoch vsoch commented Oct 20, 2024

Flux running in the MiniCluster contexts adds the cool features to grow/shrink (autoscaling), which can be paired with an actual cluster autoscaler (or not) if the mini cluster is not using all the nodes available. To support this we need a few things - first a heartbeat, and one that runs at a user specified increment (and likely this should also be exposed in the ensemble config) because it is very likely the case that the triggers are not linked to jobs (for example, "run this when the queue wait time is over X for this job group"). We then need the MiniCluster member, which is exactly the same as flux, but instead has added the grow/shrink. I think likely what I want to do is have the heartbeat disabled unless the setting is found in the ensemble config OR the action to grow/shrink is found (and we would use a default heartbeat seconds of 60). This time should obviously be tested for different applications.

Update: here is running the example "hello world" with a heartbeat:

image

I have the grpc server functions started and the MiniCluster member, I need to put those together with the ensemble operator to get grow/shrink actually working.

vsoch added 6 commits October 20, 2024 14:43
Flux running in the MiniCluster contexts adds the cool
features to grow/shrink (autoscaling), which can be
paired with an actual cluster autoscaler (or not) if
the mini cluster is not using all the nodes available.
To support this we need a few things - first a heartbeat,
and one that runs at a user specified increment (and likely
this should also be exposed in the ensemble config) because
it is very likely the case that the triggers are not linked
to jobs (for example, "run this when the queue wait time is
over X for this job group"). We then need the MiniCluster
member, which is exactly the same as flux, but instead has
added the grow/shrink. I think likely what I want to do is
have the heartbeat disabled unless the setting is found in
the ensemble config OR the action to grow/shrink is found
(and we would use a default heartbeat seconds of 60). This
time should obviously be tested for different applications.

Signed-off-by: vsoch <[email protected]>
This updates the heartbeat so it is entirely derived from
the config. This can happen explicitly if the user sets
logging->heartbeat to a non zero value, but it will also
happen if there is a grow or shrink action used. If the user
defines a grow/shrink and sets the heartbeat to 0 it will
still be set to the default, 60, because grow/shrink will
not work as expected without it.

Signed-off-by: vsoch <[email protected]>
I need to put this into the ensemble operator next to have the request
actually do something, like request the minicluster to scale up or
down. I will also need to have a way to communicate the member name
and namespace. This could either be done via discovery (requiring the
kubernetes API within the ensemble python and the rbac to use it),
or more simply done, just put the member name that is expected in
the same namespace. More ideally there can be a registration step at
the onset that generates a random name and sends it over to the grpc
service to associate.

Signed-off-by: vsoch <[email protected]>
@vsoch vsoch force-pushed the add-support-minicluster-autoscale branch 2 times, most recently from 8c0e9bd to 1a3af82 Compare October 23, 2024 04:03
@vsoch vsoch force-pushed the add-support-minicluster-autoscale branch from 1a3af82 to 754d9f4 Compare October 23, 2024 04:08
@vsoch vsoch mentioned this pull request Oct 23, 2024
@vsoch
Copy link
Member Author

vsoch commented Oct 23, 2024

grpc for grow / shrink in the flux operator is being hit!

image

This is working via a deployment that has an exposed service, and the broker rank 0 in the ensemble member (the flux minicluster) is running ensemble-python, which is installed on the fly and given the ensemble.yaml that describes the jobs and rules. Then (next step, already added but not the client logic) is that that same grpc container has an rbac added to give it permission to update the MiniCluster. This means (assuming nothing unexpected comes up) I just need to write the interaction with the in cluster kube config to get and update the minicluster spec!

This is so wicked! I'm so excited! But it's my bedtime, so time for that. 🥔

@vsoch vsoch force-pushed the add-support-minicluster-autoscale branch 4 times, most recently from bab8874 to c2e3f4c Compare October 23, 2024 21:42
@vsoch vsoch force-pushed the add-support-minicluster-autoscale branch from c2e3f4c to 1c14f37 Compare October 23, 2024 21:44
@vsoch vsoch merged commit bf08022 into main Oct 23, 2024
2 checks passed
@vsoch vsoch deleted the add-support-minicluster-autoscale branch October 23, 2024 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant