Skip to content

Change cgroupDriver from systemd to cgroupfs #1947

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 11, 2021

Conversation

vishalbollu
Copy link
Contributor

GPU nodes are not joining the cluster. AWS changed the AMI for GPUs on March 5th which coincide with our failed GPU tests on AWS.

It sounds like the cgroupDriver value for CPU based AMIs seem to be cgroupfs but for GPU AMIs they seem to be systemd. This commit changes cgroupDriver to cgroupfs.

This appears to be a workaround. More information about the issue and the possible workarounds are described in this comment: eksctl-io/eksctl#3005 (comment).


checklist:

  • run make test and make lint
  • test manually (i.e. build/push all images, restart operator, and re-deploy APIs)
  • update examples
  • update docs and add any new files to summary.md (view in gitbook after merging)
  • cherry-pick into release branches if applicable
  • alert the dev team if the dev environment changed

@deliahu deliahu merged commit 4b7abc2 into master Mar 11, 2021
@deliahu deliahu deleted the gpus-not-joining-cluster branch March 11, 2021 04:04
deliahu pushed a commit that referenced this pull request Mar 11, 2021
vishalbollu added a commit that referenced this pull request Mar 11, 2021
vishalbollu added a commit that referenced this pull request Mar 11, 2021
This reverts commit 4b7abc2.

(cherry picked from commit 5596c88)
@dipen-epi
Copy link

dipen-epi commented Sep 9, 2021

@vishalbollu why was this change reverted?
We're using cortex and facing the same issue of gpus not joining the cluster - probably because we're using a overrideBootstrapCommand in the manager image
The logs look pretty similar to the issue you linked

[�[32m  OK  �[0m] Started Docker Application Container Engine.
[   79.501444] cloud-init[4890]: Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /etc/systemd/system/kubelet.service.
[�[1;31mFAILED�[0m] Failed to start Kubernetes Kubelet.
See 'systemctl status kubelet.service' for details.
[   79.571532] cloud-init[4890]: Job for kubelet.service failed because a configured resource limit was exceeded. See "systemctl status kubelet.service" and "journalctl -xe" for details.
[   79.573805] cloud-init[4890]: Exited with error on line 437
[   79.574571] cloud-init[4890]: Sep 08 09:29:18 cloud-init[4890]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/runcmd [1]
[   79.576214] cloud-init[4890]: Sep 08 09:29:18 cloud-init[4890]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
[   79.576479] cloud-init[4890]: Sep 08 09:29:18 cloud-init[4890]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
[   79.625532] cloud-init[4890]: Cloud-init v. 19.3-44.amzn2 finished at Wed, 08 Sep 2021 09:29:18 +0000. Datasource DataSourceEc2.  Up 79.62 seconds
[�[1;31mFAILED�[0m] Failed to start Execute cloud user/final scripts.

@deliahu
Copy link
Member

deliahu commented Sep 10, 2021

@dipen-epi I believe this was reverted because it was due to a bug in AWS's AMI, which was resolved fairly quickly by AWS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants