|
2 | 2 |
|
3 | 3 | See the role README.md
|
4 | 4 |
|
5 |
| -# Results/progress |
| 5 | +# CI workflow |
6 | 6 |
|
7 |
| -Without any metadata: |
| 7 | +The compute node rebuild is tested in CI after the tests for rebuilding the |
| 8 | +login and control nodes. The process follows |
8 | 9 |
|
9 |
| - [root@rl9-compute-0 rocky]# systemctl status ansible-init |
10 |
| - ● ansible-init.service |
11 |
| - Loaded: loaded (/etc/systemd/system/ansible-init.service; enabled; preset: disabled) |
12 |
| - Active: activating (start) since Fri 2024-12-13 20:41:16 UTC; 1min 45s ago |
13 |
| - Main PID: 16089 (ansible-init) |
14 |
| - Tasks: 8 (limit: 10912) |
15 |
| - Memory: 99.5M |
16 |
| - CPU: 11.687s |
17 |
| - CGroup: /system.slice/ansible-init.service |
18 |
| - ├─16089 /usr/lib/ansible-init/bin/python /usr/bin/ansible-init |
19 |
| - ├─16273 /usr/lib/ansible-init/bin/python3.9 /usr/lib/ansible-init/bin/ansible-playbook --connection local --inventory 127.0.0.1, /etc/ansible-init/playbooks/1-compute-init.yml |
20 |
| - ├─16350 /usr/lib/ansible-init/bin/python3.9 /usr/lib/ansible-init/bin/ansible-playbook --connection local --inventory 127.0.0.1, /etc/ansible-init/playbooks/1-compute-init.yml |
21 |
| - ├─16361 /bin/sh -c "/usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1734122485.9542894-16350-45936546411977/AnsiballZ_mount.py && sleep 0" |
22 |
| - ├─16362 /usr/bin/python3 /root/.ansible/tmp/ansible-tmp-1734122485.9542894-16350-45936546411977/AnsiballZ_mount.py |
23 |
| - ├─16363 /usr/bin/mount /mnt/cluster |
24 |
| - └─16364 /sbin/mount.nfs 192.168.10.12:/exports/cluster /mnt/cluster -o ro,sync |
| 10 | +1. Compute nodes are reimaged: |
25 | 11 |
|
26 |
| - Dec 13 20:41:24 rl9-compute-0.rl9.invalid ansible-init[16273]: ok: [127.0.0.1] |
27 |
| - Dec 13 20:41:24 rl9-compute-0.rl9.invalid ansible-init[16273]: TASK [Report skipping initialization if not compute node] ********************** |
28 |
| - Dec 13 20:41:25 rl9-compute-0.rl9.invalid ansible-init[16273]: skipping: [127.0.0.1] |
29 |
| - Dec 13 20:41:25 rl9-compute-0.rl9.invalid ansible-init[16273]: TASK [meta] ******************************************************************** |
30 |
| - Dec 13 20:41:25 rl9-compute-0.rl9.invalid ansible-init[16273]: skipping: [127.0.0.1] |
31 |
| - Dec 13 20:41:25 rl9-compute-0.rl9.invalid ansible-init[16273]: TASK [Ensure the mount directory exists] *************************************** |
32 |
| - Dec 13 20:41:25 rl9-compute-0.rl9.invalid python3[16346]: ansible-file Invoked with path=/mnt/cluster state=directory owner=root group=root mode=u=rwX,go= recurse=False force=False follow=True modification_time_format=%Y%m%d%H%M.%S access> |
33 |
| - Dec 13 20:41:25 rl9-compute-0.rl9.invalid ansible-init[16273]: changed: [127.0.0.1] |
34 |
| - Dec 13 20:41:25 rl9-compute-0.rl9.invalid ansible-init[16273]: TASK [Mount /mnt/cluster] ****************************************************** |
35 |
| - Dec 13 20:41:26 rl9-compute-0.rl9.invalid python3[16362]: ansible-mount Invoked with path=/mnt/cluster src=192.168.10.12:/exports/cluster fstype=nfs opts=ro,sync state=mounted boot=True dump=0 passno=0 backup=False fstab=None |
36 |
| - [root@rl9-compute-0 rocky]# systemctl status ansible-init |
| 12 | + ansible-playbook -v --limit compute ansible/adhoc/rebuild.yml |
37 | 13 |
|
38 |
| -Added metadata via horizon: |
| 14 | +2. Ansible-init runs against newly reimaged compute nodes |
39 | 15 |
|
40 |
| - compute_groups ["compute"] |
| 16 | +3. Run sinfo and check nodes have expected slurm state |
41 | 17 |
|
42 |
| - |
43 |
| -OK: |
44 |
| - |
45 |
| - [root@rl9-compute-0 rocky]# systemctl status ansible-init |
46 |
| - ● ansible-init.service |
47 |
| - Loaded: loaded (/etc/systemd/system/ansible-init.service; enabled; preset: disabled) |
48 |
| - Active: active (exited) since Fri 2024-12-13 20:43:31 UTC; 33s ago |
49 |
| - Process: 16089 ExecStart=/usr/bin/ansible-init (code=exited, status=0/SUCCESS) |
50 |
| - Main PID: 16089 (code=exited, status=0/SUCCESS) |
51 |
| - CPU: 13.003s |
52 |
| - |
53 |
| - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16273]: ok: [127.0.0.1] => { |
54 |
| - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16273]: "msg": "Skipping compute initialization as cannot mount exports/cluster share" |
55 |
| - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16273]: } |
56 |
| - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16273]: TASK [meta] ******************************************************************** |
57 |
| - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16273]: PLAY RECAP ********************************************************************* |
58 |
| - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16273]: 127.0.0.1 : ok=4 changed=1 unreachable=0 failed=0 skipped=1 rescued=0 ignored=1 |
59 |
| - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16089]: [INFO] executing remote playbooks for stage - post |
60 |
| - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16089]: [INFO] writing sentinel file /var/lib/ansible-init.done |
61 |
| - Dec 13 20:43:31 rl9-compute-0.rl9.invalid ansible-init[16089]: [INFO] ansible-init completed successfully |
62 |
| - Dec 13 20:43:31 rl9-compute-0.rl9.invalid systemd[1]: Finished ansible-init.service. |
63 |
| - |
64 |
| -Now run site.yml, then restart ansible-init again: |
65 |
| - |
66 |
| - |
67 |
| - [root@rl9-compute-0 rocky]# systemctl status ansible-init |
68 |
| - ● ansible-init.service |
69 |
| - Loaded: loaded (/etc/systemd/system/ansible-init.service; enabled; preset: disabled) |
70 |
| - Active: active (exited) since Fri 2024-12-13 20:50:10 UTC; 11s ago |
71 |
| - Process: 18921 ExecStart=/usr/bin/ansible-init (code=exited, status=0/SUCCESS) |
72 |
| - Main PID: 18921 (code=exited, status=0/SUCCESS) |
73 |
| - CPU: 8.240s |
74 |
| - |
75 |
| - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[19110]: TASK [Report skipping initialization if cannot mount nfs] ********************** |
76 |
| - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[19110]: skipping: [127.0.0.1] |
77 |
| - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[19110]: TASK [meta] ******************************************************************** |
78 |
| - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[19110]: skipping: [127.0.0.1] |
79 |
| - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[19110]: PLAY RECAP ********************************************************************* |
80 |
| - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[19110]: 127.0.0.1 : ok=3 changed=1 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0 |
81 |
| - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[18921]: [INFO] executing remote playbooks for stage - post |
82 |
| - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[18921]: [INFO] writing sentinel file /var/lib/ansible-init.done |
83 |
| - Dec 13 20:50:10 rl9-compute-0.rl9.invalid ansible-init[18921]: [INFO] ansible-init completed successfully |
84 |
| - Dec 13 20:50:10 rl9-compute-0.rl9.invalid systemd[1]: Finished ansible-init.service. |
85 |
| - [root@rl9-compute-0 rocky]# ls /mnt/cluster/host |
86 |
| - hosts hostvars/ |
87 |
| - [root@rl9-compute-0 rocky]# ls /mnt/cluster/hostvars/rl9-compute- |
88 |
| - rl9-compute-0/ rl9-compute-1/ |
89 |
| - [root@rl9-compute-0 rocky]# ls /mnt/cluster/hostvars/rl9-compute- |
90 |
| - rl9-compute-0/ rl9-compute-1/ |
91 |
| - [root@rl9-compute-0 rocky]# ls /mnt/cluster/hostvars/rl9-compute-0/ |
92 |
| - hostvars.yml |
93 |
| - |
94 |
| -This commit - shows that hostvars have loaded: |
95 |
| - |
96 |
| - [root@rl9-compute-0 rocky]# systemctl status ansible-init |
97 |
| - ● ansible-init.service |
98 |
| - Loaded: loaded (/etc/systemd/system/ansible-init.service; enabled; preset: disabled) |
99 |
| - Active: active (exited) since Fri 2024-12-13 21:06:20 UTC; 5s ago |
100 |
| - Process: 27585 ExecStart=/usr/bin/ansible-init (code=exited, status=0/SUCCESS) |
101 |
| - Main PID: 27585 (code=exited, status=0/SUCCESS) |
102 |
| - CPU: 8.161s |
103 |
| - |
104 |
| - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27769]: TASK [Demonstrate hostvars have loaded] **************************************** |
105 |
| - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27769]: ok: [127.0.0.1] => { |
106 |
| - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27769]: "prometheus_version": "2.27.0" |
107 |
| - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27769]: } |
108 |
| - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27769]: PLAY RECAP ********************************************************************* |
109 |
| - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27769]: 127.0.0.1 : ok=5 changed=0 unreachable=0 failed=0 skipped=2 rescued=0 ignored=0 |
110 |
| - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27585]: [INFO] executing remote playbooks for stage - post |
111 |
| - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27585]: [INFO] writing sentinel file /var/lib/ansible-init.done |
112 |
| - Dec 13 21:06:20 rl9-compute-0.rl9.invalid ansible-init[27585]: [INFO] ansible-init completed successfully |
113 |
| - Dec 13 21:06:20 rl9-compute-0.rl9.invalid systemd[1]: Finished ansible-init.service. |
| 18 | + ansible-playbook -v ansible/ci/check_slurm.yml |
0 commit comments