Change cgroupDriver from systemd to cgroupfs #1947

vishalbollu · 2021-03-11T01:35:36Z

GPU nodes are not joining the cluster. AWS changed the AMI for GPUs on March 5th which coincide with our failed GPU tests on AWS.

It sounds like the cgroupDriver value for CPU based AMIs seem to be cgroupfs but for GPU AMIs they seem to be systemd. This commit changes cgroupDriver to cgroupfs.

This appears to be a workaround. More information about the issue and the possible workarounds are described in this comment: eksctl-io/eksctl#3005 (comment).

checklist:

run make test and make lint
test manually (i.e. build/push all images, restart operator, and re-deploy APIs)
update examples
update docs and add any new files to summary.md (view in gitbook after merging)
cherry-pick into release branches if applicable
alert the dev team if the dev environment changed

(cherry picked from commit 4b7abc2)

This reverts commit 4b7abc2.

This reverts commit 4b7abc2. (cherry picked from commit 5596c88)

dipen-epi · 2021-09-09T07:16:18Z

@vishalbollu why was this change reverted?
We're using cortex and facing the same issue of gpus not joining the cluster - probably because we're using a overrideBootstrapCommand in the manager image
The logs look pretty similar to the issue you linked

[�[32m  OK  �[0m] Started Docker Application Container Engine.
[   79.501444] cloud-init[4890]: Created symlink from /etc/systemd/system/multi-user.target.wants/kubelet.service to /etc/systemd/system/kubelet.service.
[�[1;31mFAILED�[0m] Failed to start Kubernetes Kubelet.
See 'systemctl status kubelet.service' for details.
[   79.571532] cloud-init[4890]: Job for kubelet.service failed because a configured resource limit was exceeded. See "systemctl status kubelet.service" and "journalctl -xe" for details.
[   79.573805] cloud-init[4890]: Exited with error on line 437
[   79.574571] cloud-init[4890]: Sep 08 09:29:18 cloud-init[4890]: util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/runcmd [1]
[   79.576214] cloud-init[4890]: Sep 08 09:29:18 cloud-init[4890]: cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
[   79.576479] cloud-init[4890]: Sep 08 09:29:18 cloud-init[4890]: util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python2.7/site-packages/cloudinit/config/cc_scripts_user.pyc'>) failed
[   79.625532] cloud-init[4890]: Cloud-init v. 19.3-44.amzn2 finished at Wed, 08 Sep 2021 09:29:18 +0000. Datasource DataSourceEc2.  Up 79.62 seconds
[�[1;31mFAILED�[0m] Failed to start Execute cloud user/final scripts.

deliahu · 2021-09-10T16:47:33Z

@dipen-epi I believe this was reverted because it was due to a bug in AWS's AMI, which was resolved fairly quickly by AWS

Change cgroupDriver from systemd to cgroupfs

1d891b3

vishalbollu requested review from deliahu, miguelvr and RobertLucian March 11, 2021 01:35

deliahu approved these changes Mar 11, 2021

View reviewed changes

deliahu merged commit 4b7abc2 into master Mar 11, 2021

deliahu deleted the gpus-not-joining-cluster branch March 11, 2021 04:04

deliahu pushed a commit that referenced this pull request Mar 11, 2021

Change cgroupDriver from systemd to cgroupfs (#1947)

1b3433c

(cherry picked from commit 4b7abc2)

vishalbollu added a commit that referenced this pull request Mar 11, 2021

Revert "Change cgroupDriver from systemd to cgroupfs (#1947)"

5596c88

This reverts commit 4b7abc2.

vishalbollu added a commit that referenced this pull request Mar 11, 2021

Revert "Change cgroupDriver from systemd to cgroupfs (#1947)"

e6ed9de

This reverts commit 4b7abc2. (cherry picked from commit 5596c88)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change cgroupDriver from systemd to cgroupfs #1947

Change cgroupDriver from systemd to cgroupfs #1947

Uh oh!

vishalbollu commented Mar 11, 2021

Uh oh!

dipen-epi commented Sep 9, 2021 •

edited

Loading

Uh oh!

deliahu commented Sep 10, 2021

Uh oh!

Uh oh!

Change cgroupDriver from systemd to cgroupfs #1947

Change cgroupDriver from systemd to cgroupfs #1947

Uh oh!

Conversation

vishalbollu commented Mar 11, 2021

Uh oh!

dipen-epi commented Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deliahu commented Sep 10, 2021

Uh oh!

Uh oh!

dipen-epi commented Sep 9, 2021 •

edited

Loading