Skip to content

Add workaround for rc: -13 #1108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 27, 2024
Merged

Add workaround for rc: -13 #1108

merged 1 commit into from
Jun 27, 2024

Conversation

jovial
Copy link
Contributor

@jovial jovial commented Jun 24, 2024

I've commonly hit this when configuring prometheus:

TASK [prometheus : Get container facts] *************************************************************************************************************************************
Monday 24 June 2024  11:09:37 +0000 (0:00:08.528)       0:01:31.707 ***********
fatal: [will-compute-01]: FAILED! => changed=false
  module_stderr: ''
  module_stdout: ''
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: -13
fatal: [will-compute-02]: FAILED! => changed=false
  module_stderr: ''
  module_stdout: ''
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: -13

The ControlPersist workaround is documented in these bug reports:

From the comments, It seems like this does not completely resolve the issue, but does decrease the frequency that you hit this.

The Prometheus tasks seem particuarly susceptible as they run on every host.

I've commonly hit this when configuring prometheus:

```
TASK [prometheus : Get container facts] *************************************************************************************************************************************
Monday 24 June 2024  11:09:37 +0000 (0:00:08.528)       0:01:31.707 ***********
fatal: [will-compute-01]: FAILED! => changed=false
  module_stderr: ''
  module_stdout: ''
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: -13
fatal: [will-compute-02]: FAILED! => changed=false
  module_stderr: ''
  module_stdout: ''
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: -13
```

The ControlPersist  workaround is documented in these bug reports:

- ansible/ansible#78344
- ansible/ansible#81777

From the comments, It seems like this does not completely resolve the
issue, but does decrease the frequency that you hit this.

The Prometheus tasks seem particuarly susceptible as they run on
every host.
@jovial jovial requested a review from a team as a code owner June 24, 2024 12:11
@jovial
Copy link
Contributor Author

jovial commented Jun 24, 2024

Worth noting that the other related settings have sensible defaults: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/ssh_connection.html#parameter-control_path

@Alex-Welsh
Copy link
Member

How well tested is this change? Is it safer to bring it in with Caracal?

@jovial
Copy link
Contributor Author

jovial commented Jun 24, 2024

How well tested is this change? Is it safer to bring it in with Caracal?

Currently testing with multinode. I will mark as draft for now as I still haven't reached the end of a multinode deployment due to various unrelated issues :D

@jovial jovial marked this pull request as draft June 24, 2024 15:37
@jovial jovial marked this pull request as ready for review June 26, 2024 14:22
@jovial
Copy link
Contributor Author

jovial commented Jun 26, 2024

This seems to have made my multinode deployments more reliable - I've completed several runs end to end now.

@jovial jovial merged commit 699769c into stackhpc/2023.1 Jun 27, 2024
12 checks passed
@jovial jovial deleted the workaround/2023.1/rc-13 branch June 27, 2024 12:46
@grzegorzkoper
Copy link
Contributor

grzegorzkoper commented Jul 10, 2024

Still hitting this issue when trying to deploy multinode on :

ansible [core 2.14.11]
  config file = None
  configured module search path = ['/home/cloud-user/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/cloud-user/venvs/kayobe/lib64/python3.9/site-packages/ansible
  ansible collection location = /home/cloud-user/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/cloud-user/venvs/kayobe/bin/ansible
  python version = 3.9.18 (main, May 16 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] (/home/cloud-user/venvs/kayobe/bin/python3)
  jinja version = 3.1.4
  libyaml = True

Is it worth bumping even more ? I am testing with 2h now

@Alex-Welsh
Copy link
Member

Just hit this in the Groningen Habrok Antelope upgrade

markgoddard added a commit to stackhpc/terraform-kayobe-multinode that referenced this pull request Aug 27, 2024
There is a race condition in Ansible that can result in this failure:
  msg: |-
  MODULE FAILURE
  See stdout/stderr for the exact error
rc: -13
See ansible/ansible#78344 and
ansible/ansible#81777.

In stackhpc/stackhpc-kayobe-config#1108 we applied
a workaround to increase the ControlPersist timeout to 1 hour, but this
does not always work.

Try another workaround of removing the ControlPersist sockets in between
Kayobe runs.
markgoddard added a commit to stackhpc/terraform-kayobe-multinode that referenced this pull request Aug 27, 2024
There is a race condition in Ansible that can result in this failure:
  msg: |-
  MODULE FAILURE
  See stdout/stderr for the exact error
rc: -13

See ansible/ansible#78344 and
ansible/ansible#81777.

In stackhpc/stackhpc-kayobe-config#1108 we
applied a workaround to increase the ControlPersist timeout to 1 hour,
but this does not always work.

Disabling SSH pipelining prevents the issue at the cost of Ansible
execution duration.
markgoddard pushed a commit that referenced this pull request Aug 27, 2024
I've commonly hit this when configuring prometheus:

```
TASK [prometheus : Get container facts] *************************************************************************************************************************************
Monday 24 June 2024  11:09:37 +0000 (0:00:08.528)       0:01:31.707 ***********
fatal: [will-compute-01]: FAILED! => changed=false
  module_stderr: ''
  module_stdout: ''
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: -13
fatal: [will-compute-02]: FAILED! => changed=false
  module_stderr: ''
  module_stdout: ''
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: -13
```

The ControlPersist  workaround is documented in these bug reports:

- ansible/ansible#78344
- ansible/ansible#81777

From the comments, It seems like this does not completely resolve the
issue, but does decrease the frequency that you hit this.

The Prometheus tasks seem particuarly susceptible as they run on
every host.

(cherry picked from commit 699769c)
markgoddard added a commit to stackhpc/terraform-kayobe-multinode that referenced this pull request Aug 29, 2024
There is a race condition in Ansible that can result in this failure:
  msg: |-
  MODULE FAILURE
  See stdout/stderr for the exact error
rc: -13
See ansible/ansible#78344 and
ansible/ansible#81777.

In stackhpc/stackhpc-kayobe-config#1108 we applied
a workaround to increase the ControlPersist timeout to 1 hour, but this
does not always work.

Try another workaround of removing the ControlPersist sockets in between
Kayobe runs.
markgoddard added a commit to stackhpc/terraform-kayobe-multinode that referenced this pull request Sep 2, 2024
There is a race condition in Ansible that can result in this failure:
  msg: |-
  MODULE FAILURE
  See stdout/stderr for the exact error
rc: -13
See ansible/ansible#78344 and
ansible/ansible#81777.

In stackhpc/stackhpc-kayobe-config#1108 we applied
a workaround to increase the ControlPersist timeout to 1 hour, but this
does not always work.

Here we use a different workaround of disabling SSH pipelining. This has
performance implications for Ansible, but is a reasonable trade-off for
reliability.

We set the config option as an environment variable rather than in
ansible.cfg in Kayobe configuration, to avoid a merge conflict on upgrade.
markgoddard pushed a commit that referenced this pull request Sep 2, 2024
I've commonly hit this when configuring prometheus:

```
TASK [prometheus : Get container facts] *************************************************************************************************************************************
Monday 24 June 2024  11:09:37 +0000 (0:00:08.528)       0:01:31.707 ***********
fatal: [will-compute-01]: FAILED! => changed=false
  module_stderr: ''
  module_stdout: ''
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: -13
fatal: [will-compute-02]: FAILED! => changed=false
  module_stderr: ''
  module_stdout: ''
  msg: |-
    MODULE FAILURE
    See stdout/stderr for the exact error
  rc: -13
```

The ControlPersist  workaround is documented in these bug reports:

- ansible/ansible#78344
- ansible/ansible#81777

From the comments, It seems like this does not completely resolve the
issue, but does decrease the frequency that you hit this.

The Prometheus tasks seem particuarly susceptible as they run on
every host.

(cherry picked from commit 699769c)
markgoddard added a commit that referenced this pull request Sep 4, 2024
yoga: Backport Add workaround for rc: -13 (#1108)
This was referenced Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants