Skip to content

Commit 112aa6e

Browse files
bertiethorpesjpbm-bull
authored
Support compute node rebuild/reboot via Slurm RebootProgram (#553)
* add rebuild role to appliance and modify groupvars * improve readability of group_vars * Define login nodes using an opentofu module (#547) * define login nodes using tf module * Apply suggestions from code review Co-authored-by: Matt Anson <[email protected]> * tweak README to explain compute groups * try to clarify login/compute groups --------- Co-authored-by: Matt Anson <[email protected]> * Change docs/ references from Terraform to OpenTofu (#544) * change terraform references to opentofu in docs * remove wider reference to terraform * Update environments/README.md Co-authored-by: Steve Brasier <[email protected]> * Update environments/common/README.md Co-authored-by: Steve Brasier <[email protected]> --------- Co-authored-by: Steve Brasier <[email protected]> * fix instance_id in compute inventory to be target image, not deployed image * review all roles for compute_init_enable * fix permissions to /exports/cluster * make openhpc_config more greppable * Set ResumeTimeout and ReturnToService overrides in group_vars * CI tests for reboot via slurm (without rebuild) * fpinrocky 8 pythoolsvenv version * refining comments and task names * rebuild role readme --------- Co-authored-by: Steve Brasier <[email protected]> Co-authored-by: Matt Anson <[email protected]> Co-authored-by: Steve Brasier <[email protected]>
1 parent 7c831c7 commit 112aa6e

File tree

20 files changed

+233
-72
lines changed

20 files changed

+233
-72
lines changed

.github/workflows/stackhpc.yml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -178,12 +178,13 @@ jobs:
178178
ansible-playbook -v ansible/site.yml
179179
ansible-playbook -v ansible/ci/check_slurm.yml
180180
181-
- name: Test reimage of compute nodes and compute-init (via rebuild adhoc)
181+
- name: Test compute node reboot and compute-init
182182
run: |
183183
. venv/bin/activate
184184
. environments/.stackhpc/activate
185185
ansible-playbook -v --limit compute ansible/adhoc/rebuild.yml
186186
ansible-playbook -v ansible/ci/check_slurm.yml
187+
ansible-playbook -v ansible/adhoc/reboot_via_slurm.yml
187188
188189
- name: Check sacct state survived reimage
189190
run: |

ansible/.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,3 +80,5 @@ roles/*
8080
!roles/slurm_stats/**
8181
!roles/pytools/
8282
!roles/pytools/**
83+
!roles/rebuild/
84+
!roles/rebuild/**

ansible/adhoc/reboot_via_slurm.yml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Reboot compute nodes via slurm. Nodes will be rebuilt if `image_id` in inventory is different to the currently-provisioned image.
2+
# Example:
3+
# ansible-playbook -v ansible/adhoc/reboot_via_slurm.yml
4+
5+
- hosts: login
6+
run_once: true
7+
become: yes
8+
gather_facts: no
9+
tasks:
10+
- name: Submit a Slurm job to reboot compute nodes
11+
ansible.builtin.shell: |
12+
set -e
13+
srun --reboot -N 2 uptime
14+
become_user: root
15+
register: slurm_result
16+
failed_when: slurm_result.rc != 0
17+
18+
- name: Fetch Slurm controller logs if reboot fails
19+
ansible.builtin.shell: |
20+
journalctl -u slurmctld --since "10 minutes ago" | tail -n 50
21+
become_user: root
22+
register: slurm_logs
23+
when: slurm_result.rc != 0
24+
delegate_to: "{{ groups['control'] | first }}"

ansible/roles/compute_init/README.md

Lines changed: 103 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,104 @@
1-
# EXPERIMENTAL: compute-init
2-
3-
Experimental / in-progress functionality to allow compute nodes to rejoin the
4-
cluster after a reboot.
5-
6-
To enable this add compute nodes (or a subset of them into) the `compute_init`
7-
group.
8-
1+
# EXPERIMENTAL: compute_init
2+
3+
Experimental functionality to allow compute nodes to rejoin the cluster after
4+
a reboot without running the `ansible/site.yml` playbook.
5+
6+
To enable this:
7+
1. Add the `compute` group (or a subset) into the `compute_init` group. This is
8+
the default when using cookiecutter to create an environment, via the
9+
"everything" template.
10+
2. Build an image which includes the `compute_init` group. This is the case
11+
for StackHPC-built release images.
12+
3. Enable the required functionalities during boot, by setting the
13+
`compute_init_enable` property for a compute group in the
14+
OpenTofu `compute` variable to a list which includes "compute", plus the
15+
other roles/functionalities required, e.g.:
16+
17+
```terraform
18+
...
19+
compute = {
20+
general = {
21+
nodes = ["general-0", "general-1"]
22+
compute_init_enable = ["compute", ... ] # see below
23+
}
24+
}
25+
...
26+
```
27+
28+
## Supported appliance functionalities
29+
30+
The string "compute" must be present in the `compute_init_enable` flag to enable
31+
this functionality. The table below shows which other appliance functionalities
32+
are currently supported - use the name in the role column to enable these.
33+
34+
| Playbook | Role (or functionality) | Support |
35+
| -------------------------|-------------------------|-----------------|
36+
| hooks/pre.yml | ? | None at present |
37+
| validate.yml | n/a | Not relevant during boot |
38+
| bootstrap.yml | (wait for ansible-init) | Not relevant during boot |
39+
| bootstrap.yml | resolv_conf | Fully supported |
40+
| bootstrap.yml | etc_hosts | Fully supported |
41+
| bootstrap.yml | proxy | None at present |
42+
| bootstrap.yml | (/etc permissions) | None required - use image build |
43+
| bootstrap.yml | (ssh /home fix) | None required - use image build |
44+
| bootstrap.yml | (system users) | None required - use image build |
45+
| bootstrap.yml | systemd | None required - use image build |
46+
| bootstrap.yml | selinux | None required - use image build |
47+
| bootstrap.yml | sshd | None at present |
48+
| bootstrap.yml | dnf_repos | None at present (requirement TBD) |
49+
| bootstrap.yml | squid | Not relevant for compute nodes |
50+
| bootstrap.yml | tuned | None |
51+
| bootstrap.yml | freeipa_server | Not relevant for compute nodes |
52+
| bootstrap.yml | cockpit | None required - use image build |
53+
| bootstrap.yml | firewalld | Not relevant for compute nodes |
54+
| bootstrap.yml | fail2ban | Not relevant for compute nodes |
55+
| bootstrap.yml | podman | Not relevant for compute nodes |
56+
| bootstrap.yml | update | Not relevant during boot |
57+
| bootstrap.yml | reboot | Not relevant for compute nodes |
58+
| bootstrap.yml | ofed | Not relevant during boot |
59+
| bootstrap.yml | ansible_init (install) | Not relevant during boot |
60+
| bootstrap.yml | k3s (install) | Not relevant during boot |
61+
| hooks/post-bootstrap.yml | ? | None at present |
62+
| iam.yml | freeipa_client | None at present [1] |
63+
| iam.yml | freeipa_server | Not relevant for compute nodes |
64+
| iam.yml | sssd | None at present |
65+
| filesystems.yml | block_devices | None required - role deprecated |
66+
| filesystems.yml | nfs | All client functionality |
67+
| filesystems.yml | manila | All functionality |
68+
| filesystems.yml | lustre | None at present |
69+
| extras.yml | basic_users | All functionality [2] |
70+
| extras.yml | eessi | All functionality [3] |
71+
| extras.yml | cuda | None required - use image build [4] |
72+
| extras.yml | persist_hostkeys | Not expected to be required for compute nodes |
73+
| extras.yml | compute_init (export) | Not relevant for compute nodes |
74+
| extras.yml | k9s (install) | Not relevant during boot |
75+
| extras.yml | extra_packages | None at present. Would require dnf_repos |
76+
| slurm.yml | mysql | Not relevant for compute nodes |
77+
| slurm.yml | rebuild | Not relevant for compute nodes |
78+
| slurm.yml | openhpc [5] | All slurmd-related functionality |
79+
| slurm.yml | (set memory limits) | None at present |
80+
| slurm.yml | (block ssh) | None at present |
81+
| portal.yml | (openondemand server) | Not relevant for compute nodes |
82+
| portal.yml | (openondemand vnc desktop) | None required - use image build |
83+
| portal.yml | (openondemand jupyter server) | None required - use image build |
84+
| monitoring.yml | (all monitoring) | None at present [6] |
85+
| disable-repos.yml | dnf_repos | None at present (requirement TBD) |
86+
| hooks/post.yml | ? | None at present |
87+
88+
89+
Notes:
90+
1. FreeIPA client functionality would be better provided using a client fork
91+
which uses pkinit keys rather than OTP to reenrol nodes.
92+
2. Assumes home directory already exists on shared storage.
93+
3. Assumes `cvmfs_config` is the same on control node and all compute nodes
94+
4. If `cuda` role was run during build, the nvidia-persistenced is enabled
95+
and will start during boot.
96+
5. `openhpc` does not need to be added to `compute_init_enable`, this is
97+
automatically enabled by adding `compute`.
98+
5. Only node-exporter tasks are relevant, and will be done via k3s in a future release.
99+
100+
101+
## Approach
9102
This works as follows:
10103
1. During image build, an ansible-init playbook and supporting files
11104
(e.g. templates, filters, etc) are installed.
@@ -31,21 +124,7 @@ The check in 4b. above is what prevents the compute-init script from trying
31124
to configure the node before the services on the control node are available
32125
(which requires running the site.yml playbook).
33126

34-
The following roles/groups are currently fully functional:
35-
- `resolv_conf`: all functionality
36-
- `etc_hosts`: all functionality
37-
- `nfs`: client functionality only
38-
- `manila`: all functionality
39-
- `basic_users`: all functionality, assumes home directory already exists on
40-
shared storage
41-
- `eessi`: all functionality, assumes `cvmfs_config` is the same on control
42-
node and all compute nodes.
43-
- `openhpc`: all functionality
44-
45-
The above may be enabled by setting the compute_init_enable property on the
46-
tofu compute variable.
47-
48-
# Development/debugging
127+
## Development/debugging
49128

50129
To develop/debug changes to the compute script without actually having to build
51130
a new image:
@@ -83,7 +162,7 @@ reimage the compute node(s) first as in step 2 and/or add additional metadata
83162
as in step 3.
84163

85164

86-
# Design notes
165+
## Design notes
87166
- Duplicating code in roles into the `compute-init` script is unfortunate, but
88167
does allow developing this functionality without wider changes to the
89168
appliance.

ansible/roles/compute_init/tasks/export.yml

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
file:
33
path: /exports/cluster
44
state: directory
5-
owner: root
5+
owner: slurm
66
group: root
7-
mode: u=rwX,go=
7+
mode: u=rX,g=rwX,o=
88
run_once: true
99
delegate_to: "{{ groups['control'] | first }}"
1010

@@ -23,21 +23,27 @@
2323
file:
2424
path: /exports/cluster/hostvars/{{ inventory_hostname }}/
2525
state: directory
26-
mode: u=rwX,go=
27-
# TODO: owner,mode,etc
26+
owner: slurm
27+
group: root
28+
mode: u=rX,g=rwX,o=
2829
delegate_to: "{{ groups['control'] | first }}"
2930

3031
- name: Template out hostvars
3132
template:
3233
src: hostvars.yml.j2
3334
dest: /exports/cluster/hostvars/{{ inventory_hostname }}/hostvars.yml
34-
mode: u=rw,go=
35+
owner: slurm
36+
group: root
37+
mode: u=r,g=rw,o=
3538
delegate_to: "{{ groups['control'] | first }}"
3639

3740
- name: Copy manila share info to /exports/cluster
3841
copy:
3942
content: "{{ os_manila_mount_share_info_var | to_nice_yaml }}"
4043
dest: /exports/cluster/manila_share_info.yml
44+
owner: root
45+
group: root
46+
mode: u=rw,g=r
4147
run_once: true
4248
delegate_to: "{{ groups['control'] | first }}"
4349
when: os_manila_mount_share_info is defined

ansible/roles/rebuild/README.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
rebuild
2+
=========
3+
4+
Enables reboot tool from https://github.com/stackhpc/slurm-openstack-tools.git to be run from control node.
5+
6+
Requirements
7+
------------
8+
9+
clouds.yaml file
10+
11+
Role Variables
12+
--------------
13+
14+
- `openhpc_rebuild_clouds`: Directory. Path to clouds.yaml file.
15+
16+
17+
Example Playbook
18+
----------------
19+
20+
- hosts: control
21+
become: yes
22+
tasks:
23+
- import_role:
24+
name: rebuild
25+
26+
License
27+
-------
28+
29+
Apache-2.0
30+
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
---
2+
openhpc_rebuild_clouds: ~/.config/openstack/clouds.yaml

ansible/roles/rebuild/tasks/main.yml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
---
2+
3+
- name: Create /etc/openstack
4+
file:
5+
path: /etc/openstack
6+
state: directory
7+
owner: slurm
8+
group: root
9+
mode: u=rX,g=rwX
10+
11+
- name: Copy out clouds.yaml
12+
copy:
13+
src: "{{ openhpc_rebuild_clouds }}"
14+
dest: /etc/openstack/clouds.yaml
15+
owner: slurm
16+
group: root
17+
mode: u=r,g=rw
18+
19+
- name: Setup slurm tools
20+
include_role:
21+
name: slurm_tools

ansible/roles/slurm_stats/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Example Playbook
2121
- hosts: compute
2222
tasks:
2323
- import_role:
24-
name: stackhpc.slurm_openstack_tools.slurm-stats
24+
name: slurm_stats
2525

2626

2727
License

ansible/roles/slurm_tools/.travis.yml

Lines changed: 0 additions & 29 deletions
This file was deleted.

ansible/roles/slurm_tools/tasks/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
module_defaults:
2828
ansible.builtin.pip:
2929
virtualenv: /opt/slurm-tools
30-
virtualenv_command: python3 -m venv
30+
virtualenv_command: "{{ 'python3.9 -m venv' if ansible_distribution_major_version == '8' else 'python3 -m venv' }}"
3131
state: latest
3232
become: true
3333
become_user: "{{ pytools_user }}"

ansible/slurm.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,16 @@
99
- include_role:
1010
name: mysql
1111

12+
- name: Setup slurm-driven rebuild
13+
hosts: rebuild:!builder
14+
become: yes
15+
tags:
16+
- rebuild
17+
- openhpc
18+
tasks:
19+
- import_role:
20+
name: rebuild
21+
1222
- name: Setup slurm
1323
hosts: openhpc
1424
become: yes

environments/.stackhpc/inventory/extra_groups

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,6 @@
11
[basic_users:children]
22
cluster
33

4-
[rebuild:children]
5-
control
6-
compute
7-
84
[etc_hosts:children]
95
cluster
106

@@ -35,3 +31,6 @@ builder
3531
[sssd:children]
3632
# Install sssd into fat image
3733
builder
34+
35+
[rebuild:children]
36+
control
Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
1-
cluster_net = "stackhpc-ipv4-geneve"
2-
cluster_subnet = "stackhpc-ipv4-geneve-subnet"
1+
cluster_networks = [
2+
{
3+
network = "stackhpc-ipv4-geneve"
4+
subnet = "stackhpc-ipv4-geneve-subnet"
5+
}
6+
]
37
control_node_flavor = "general.v1.small"
48
other_node_flavor = "general.v1.small"

environments/.stackhpc/tofu/main.tf

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ module "cluster" {
8181
nodes: ["compute-0", "compute-1"]
8282
flavor: var.other_node_flavor
8383
compute_init_enable: ["compute", "etc_hosts", "nfs", "basic_users", "eessi"]
84-
# ignore_image_changes: true
84+
ignore_image_changes: true
8585
}
8686
# Example of how to add another partition:
8787
# extra: {

0 commit comments

Comments
 (0)