-
Notifications
You must be signed in to change notification settings - Fork 0
feat: support for flux minicluster #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Flux running in the MiniCluster contexts adds the cool features to grow/shrink (autoscaling), which can be paired with an actual cluster autoscaler (or not) if the mini cluster is not using all the nodes available. To support this we need a few things - first a heartbeat, and one that runs at a user specified increment (and likely this should also be exposed in the ensemble config) because it is very likely the case that the triggers are not linked to jobs (for example, "run this when the queue wait time is over X for this job group"). We then need the MiniCluster member, which is exactly the same as flux, but instead has added the grow/shrink. I think likely what I want to do is have the heartbeat disabled unless the setting is found in the ensemble config OR the action to grow/shrink is found (and we would use a default heartbeat seconds of 60). This time should obviously be tested for different applications. Signed-off-by: vsoch <[email protected]>
This updates the heartbeat so it is entirely derived from the config. This can happen explicitly if the user sets logging->heartbeat to a non zero value, but it will also happen if there is a grow or shrink action used. If the user defines a grow/shrink and sets the heartbeat to 0 it will still be set to the default, 60, because grow/shrink will not work as expected without it. Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
I need to put this into the ensemble operator next to have the request actually do something, like request the minicluster to scale up or down. I will also need to have a way to communicate the member name and namespace. This could either be done via discovery (requiring the kubernetes API within the ensemble python and the rbac to use it), or more simply done, just put the member name that is expected in the same namespace. More ideally there can be a registration step at the onset that generates a random name and sends it over to the grpc service to associate. Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
Signed-off-by: vsoch <[email protected]>
8c0e9bd
to
1a3af82
Compare
Signed-off-by: vsoch <[email protected]>
1a3af82
to
754d9f4
Compare
grpc for grow / shrink in the flux operator is being hit! This is working via a deployment that has an exposed service, and the broker rank 0 in the ensemble member (the flux minicluster) is running ensemble-python, which is installed on the fly and given the ensemble.yaml that describes the jobs and rules. Then (next step, already added but not the client logic) is that that same grpc container has an rbac added to give it permission to update the MiniCluster. This means (assuming nothing unexpected comes up) I just need to write the interaction with the in cluster kube config to get and update the minicluster spec! This is so wicked! I'm so excited! But it's my bedtime, so time for that. 🥔 |
bab8874
to
c2e3f4c
Compare
Signed-off-by: vsoch <[email protected]>
c2e3f4c
to
1c14f37
Compare
Flux running in the MiniCluster contexts adds the cool features to grow/shrink (autoscaling), which can be paired with an actual cluster autoscaler (or not) if the mini cluster is not using all the nodes available. To support this we need a few things - first a heartbeat, and one that runs at a user specified increment (and likely this should also be exposed in the ensemble config) because it is very likely the case that the triggers are not linked to jobs (for example, "run this when the queue wait time is over X for this job group"). We then need the MiniCluster member, which is exactly the same as flux, but instead has added the grow/shrink. I think likely what I want to do is have the heartbeat disabled unless the setting is found in the ensemble config OR the action to grow/shrink is found (and we would use a default heartbeat seconds of 60). This time should obviously be tested for different applications.
Update: here is running the example "hello world" with a heartbeat:
I have the grpc server functions started and the MiniCluster member, I need to put those together with the ensemble operator to get grow/shrink actually working.