Add workaround for rc: -13 #1108

jovial · 2024-06-24T12:11:37Z

I've commonly hit this when configuring prometheus:

TASK [prometheus : Get container facts] *************************************************************************************************************************************
Monday 24 June 2024  11:09:37 +0000 (0:00:08.528)       0:01:31.707 ***********
fatal: [will-compute-01]: FAILED! => changed=false
  module_stderr: ''
  module_stdout: ''
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: -13
fatal: [will-compute-02]: FAILED! => changed=false
  module_stderr: ''
  module_stdout: ''
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: -13

The ControlPersist workaround is documented in these bug reports:

From the comments, It seems like this does not completely resolve the issue, but does decrease the frequency that you hit this.

The Prometheus tasks seem particuarly susceptible as they run on every host.

I've commonly hit this when configuring prometheus: ``` TASK [prometheus : Get container facts] ************************************************************************************************************************************* Monday 24 June 2024 11:09:37 +0000 (0:00:08.528) 0:01:31.707 *********** fatal: [will-compute-01]: FAILED! => changed=false module_stderr: '' module_stdout: '' msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 fatal: [will-compute-02]: FAILED! => changed=false module_stderr: '' module_stdout: '' msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 ``` The ControlPersist workaround is documented in these bug reports: - ansible/ansible#78344 - ansible/ansible#81777 From the comments, It seems like this does not completely resolve the issue, but does decrease the frequency that you hit this. The Prometheus tasks seem particuarly susceptible as they run on every host.

jovial · 2024-06-24T12:14:52Z

Worth noting that the other related settings have sensible defaults: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/ssh_connection.html#parameter-control_path

Alex-Welsh · 2024-06-24T14:35:00Z

How well tested is this change? Is it safer to bring it in with Caracal?

jovial · 2024-06-24T15:37:21Z

How well tested is this change? Is it safer to bring it in with Caracal?

Currently testing with multinode. I will mark as draft for now as I still haven't reached the end of a multinode deployment due to various unrelated issues :D

jovial · 2024-06-26T14:24:43Z

This seems to have made my multinode deployments more reliable - I've completed several runs end to end now.

grzegorzkoper · 2024-07-10T11:31:00Z

Still hitting this issue when trying to deploy multinode on :

ansible [core 2.14.11]
  config file = None
  configured module search path = ['/home/cloud-user/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/cloud-user/venvs/kayobe/lib64/python3.9/site-packages/ansible
  ansible collection location = /home/cloud-user/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/cloud-user/venvs/kayobe/bin/ansible
  python version = 3.9.18 (main, May 16 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (/home/cloud-user/venvs/kayobe/bin/python3)
  jinja version = 3.1.4
  libyaml = True

Is it worth bumping even more ? I am testing with 2h now

Alex-Welsh · 2024-07-17T12:11:08Z

Just hit this in the Groningen Habrok Antelope upgrade

There is a race condition in Ansible that can result in this failure: msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 See ansible/ansible#78344 and ansible/ansible#81777. In stackhpc/stackhpc-kayobe-config#1108 we applied a workaround to increase the ControlPersist timeout to 1 hour, but this does not always work. Try another workaround of removing the ControlPersist sockets in between Kayobe runs.

There is a race condition in Ansible that can result in this failure: msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 See ansible/ansible#78344 and ansible/ansible#81777. In stackhpc/stackhpc-kayobe-config#1108 we applied a workaround to increase the ControlPersist timeout to 1 hour, but this does not always work. Disabling SSH pipelining prevents the issue at the cost of Ansible execution duration.

I've commonly hit this when configuring prometheus: ``` TASK [prometheus : Get container facts] ************************************************************************************************************************************* Monday 24 June 2024 11:09:37 +0000 (0:00:08.528) 0:01:31.707 *********** fatal: [will-compute-01]: FAILED! => changed=false module_stderr: '' module_stdout: '' msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 fatal: [will-compute-02]: FAILED! => changed=false module_stderr: '' module_stdout: '' msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 ``` The ControlPersist workaround is documented in these bug reports: - ansible/ansible#78344 - ansible/ansible#81777 From the comments, It seems like this does not completely resolve the issue, but does decrease the frequency that you hit this. The Prometheus tasks seem particuarly susceptible as they run on every host. (cherry picked from commit 699769c)

There is a race condition in Ansible that can result in this failure: msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 See ansible/ansible#78344 and ansible/ansible#81777. In stackhpc/stackhpc-kayobe-config#1108 we applied a workaround to increase the ControlPersist timeout to 1 hour, but this does not always work. Try another workaround of removing the ControlPersist sockets in between Kayobe runs.

There is a race condition in Ansible that can result in this failure: msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 See ansible/ansible#78344 and ansible/ansible#81777. In stackhpc/stackhpc-kayobe-config#1108 we applied a workaround to increase the ControlPersist timeout to 1 hour, but this does not always work. Here we use a different workaround of disabling SSH pipelining. This has performance implications for Ansible, but is a reasonable trade-off for reliability. We set the config option as an environment variable rather than in ansible.cfg in Kayobe configuration, to avoid a merge conflict on upgrade.

I've commonly hit this when configuring prometheus: ``` TASK [prometheus : Get container facts] ************************************************************************************************************************************* Monday 24 June 2024 11:09:37 +0000 (0:00:08.528) 0:01:31.707 *********** fatal: [will-compute-01]: FAILED! => changed=false module_stderr: '' module_stdout: '' msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 fatal: [will-compute-02]: FAILED! => changed=false module_stderr: '' module_stdout: '' msg: |- MODULE FAILURE See stdout/stderr for the exact error rc: -13 ``` The ControlPersist workaround is documented in these bug reports: - ansible/ansible#78344 - ansible/ansible#81777 From the comments, It seems like this does not completely resolve the issue, but does decrease the frequency that you hit this. The Prometheus tasks seem particuarly susceptible as they run on every host. (cherry picked from commit 699769c)

yoga: Backport Add workaround for rc: -13 (#1108)

jovial requested a review from a team as a code owner June 24, 2024 12:11

jovial marked this pull request as draft June 24, 2024 15:37

jovial marked this pull request as ready for review June 26, 2024 14:22

markgoddard approved these changes Jun 26, 2024

View reviewed changes

jovial merged commit 699769c into stackhpc/2023.1 Jun 27, 2024
12 checks passed

jovial deleted the workaround/2023.1/rc-13 branch June 27, 2024 12:46

markgoddard mentioned this pull request Aug 27, 2024

Workaround: Disable SSH pipelining stackhpc/terraform-kayobe-multinode#68

Merged

markgoddard added a commit that referenced this pull request Sep 4, 2024

Merge pull request #1256 from stackhpc/yoga-workaround-rc-13

e643067

yoga: Backport Add workaround for rc: -13 (#1108)

This was referenced Sep 6, 2024

2023.1: zed merge #1265

Merged

2024.1: 2023.1 merge #1266

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add workaround for rc: -13 #1108

Add workaround for rc: -13 #1108

Uh oh!

jovial commented Jun 24, 2024

Uh oh!

jovial commented Jun 24, 2024

Uh oh!

Alex-Welsh commented Jun 24, 2024

Uh oh!

jovial commented Jun 24, 2024

Uh oh!

jovial commented Jun 26, 2024

Uh oh!

Uh oh!

grzegorzkoper commented Jul 10, 2024 •

edited

Loading

Uh oh!

Alex-Welsh commented Jul 17, 2024

Uh oh!

Uh oh!

Add workaround for rc: -13 #1108

Add workaround for rc: -13 #1108

Uh oh!

Conversation

jovial commented Jun 24, 2024

Uh oh!

jovial commented Jun 24, 2024

Uh oh!

Alex-Welsh commented Jun 24, 2024

Uh oh!

jovial commented Jun 24, 2024

Uh oh!

jovial commented Jun 26, 2024

Uh oh!

Uh oh!

grzegorzkoper commented Jul 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Alex-Welsh commented Jul 17, 2024

Uh oh!

Uh oh!

grzegorzkoper commented Jul 10, 2024 •

edited

Loading