-
Notifications
You must be signed in to change notification settings - Fork 23
Add workaround for rc: -13 #1108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I've commonly hit this when configuring prometheus: ``` TASK [prometheus : Get container facts] ************************************************************************************************************************************* Monday 24 June 2024 11:09:37 +0000 (0:00:08.528) 0:01:31.707 *********** fatal: [will-compute-01]: FAILED! => changed=false module_stderr: '' module_stdout: '' msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 fatal: [will-compute-02]: FAILED! => changed=false module_stderr: '' module_stdout: '' msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 ``` The ControlPersist workaround is documented in these bug reports: - ansible/ansible#78344 - ansible/ansible#81777 From the comments, It seems like this does not completely resolve the issue, but does decrease the frequency that you hit this. The Prometheus tasks seem particuarly susceptible as they run on every host.
Worth noting that the other related settings have sensible defaults: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/ssh_connection.html#parameter-control_path |
How well tested is this change? Is it safer to bring it in with Caracal? |
Currently testing with multinode. I will mark as draft for now as I still haven't reached the end of a multinode deployment due to various unrelated issues :D |
This seems to have made my multinode deployments more reliable - I've completed several runs end to end now. |
Still hitting this issue when trying to deploy multinode on :
Is it worth bumping even more ? I am testing with 2h now |
Just hit this in the Groningen Habrok Antelope upgrade |
There is a race condition in Ansible that can result in this failure: msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 See ansible/ansible#78344 and ansible/ansible#81777. In stackhpc/stackhpc-kayobe-config#1108 we applied a workaround to increase the ControlPersist timeout to 1 hour, but this does not always work. Try another workaround of removing the ControlPersist sockets in between Kayobe runs.
There is a race condition in Ansible that can result in this failure: msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 See ansible/ansible#78344 and ansible/ansible#81777. In stackhpc/stackhpc-kayobe-config#1108 we applied a workaround to increase the ControlPersist timeout to 1 hour, but this does not always work. Disabling SSH pipelining prevents the issue at the cost of Ansible execution duration.
I've commonly hit this when configuring prometheus: ``` TASK [prometheus : Get container facts] ************************************************************************************************************************************* Monday 24 June 2024 11:09:37 +0000 (0:00:08.528) 0:01:31.707 *********** fatal: [will-compute-01]: FAILED! => changed=false module_stderr: '' module_stdout: '' msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 fatal: [will-compute-02]: FAILED! => changed=false module_stderr: '' module_stdout: '' msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 ``` The ControlPersist workaround is documented in these bug reports: - ansible/ansible#78344 - ansible/ansible#81777 From the comments, It seems like this does not completely resolve the issue, but does decrease the frequency that you hit this. The Prometheus tasks seem particuarly susceptible as they run on every host. (cherry picked from commit 699769c)
There is a race condition in Ansible that can result in this failure: msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 See ansible/ansible#78344 and ansible/ansible#81777. In stackhpc/stackhpc-kayobe-config#1108 we applied a workaround to increase the ControlPersist timeout to 1 hour, but this does not always work. Try another workaround of removing the ControlPersist sockets in between Kayobe runs.
There is a race condition in Ansible that can result in this failure: msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 See ansible/ansible#78344 and ansible/ansible#81777. In stackhpc/stackhpc-kayobe-config#1108 we applied a workaround to increase the ControlPersist timeout to 1 hour, but this does not always work. Here we use a different workaround of disabling SSH pipelining. This has performance implications for Ansible, but is a reasonable trade-off for reliability. We set the config option as an environment variable rather than in ansible.cfg in Kayobe configuration, to avoid a merge conflict on upgrade.
I've commonly hit this when configuring prometheus: ``` TASK [prometheus : Get container facts] ************************************************************************************************************************************* Monday 24 June 2024 11:09:37 +0000 (0:00:08.528) 0:01:31.707 *********** fatal: [will-compute-01]: FAILED! => changed=false module_stderr: '' module_stdout: '' msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 fatal: [will-compute-02]: FAILED! => changed=false module_stderr: '' module_stdout: '' msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 ``` The ControlPersist workaround is documented in these bug reports: - ansible/ansible#78344 - ansible/ansible#81777 From the comments, It seems like this does not completely resolve the issue, but does decrease the frequency that you hit this. The Prometheus tasks seem particuarly susceptible as they run on every host. (cherry picked from commit 699769c)
yoga: Backport Add workaround for rc: -13 (#1108)
I've commonly hit this when configuring prometheus:
The ControlPersist workaround is documented in these bug reports:
From the comments, It seems like this does not completely resolve the issue, but does decrease the frequency that you hit this.
The Prometheus tasks seem particuarly susceptible as they run on every host.