Skip to content

Commit 9299e45

Browse files
committed
add upgrade docs
1 parent 6f1554c commit 9299e45

File tree

1 file changed

+92
-0
lines changed

1 file changed

+92
-0
lines changed

docs/upgrades.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# Upgrades
2+
3+
This document explains the generic steps required to upgrade a deployment of the Slurm Appliance with upstream changes from StackHPC.
4+
Generally, upstream releases will happen roughly monthly. Releases may contain new functionality and/or updated images.
5+
6+
Any site-specific instructions in [docs/site/README.md](site/README.md) should be reviewed in tandem with this.
7+
8+
This document assumes the deployment repository has:
9+
1. Remotes:
10+
- `origin` referring to the site-specific remote repository.
11+
- `stackhpc` referring to the StackHPC repository at https://github.com/stackhpc/ansible-slurm-appliance.git.
12+
2. Branches:
13+
- `main` - following `main/origin`, the current site-specific code deployed to production.
14+
- `upstream` - following `main/stackhpc`, i.e. the upstream `main` branch from `stackhpc`.
15+
16+
It also assumes the site has `staging` and `production` environments.
17+
18+
**NB:** Commands which should be run on the Slurm login node are shown below prefixed `[LOGIN]$`.
19+
All other commands should be run on the Ansible deploy host.
20+
21+
1. Update the `upstream` branch from the `stackhpc` remote, including tags:
22+
23+
git fetch stackhpc main --tags
24+
25+
1. Identify the latest release from the [Slurm appliance release page](https://github.com/stackhpc/ansible-slurm-appliance/releases). Below this is shown as `vX.Y`, which is the
26+
27+
1. Ensure your local site branch is up to date and create a new branch from it for the
28+
site-specfic release code:
29+
30+
git checkout main
31+
git pull --prune
32+
git checkout -b update/vX.Y
33+
34+
1. Merge the upstream code into your release branch:
35+
36+
git merge stackhpc/vX.Y
37+
38+
It is possible this will introduce merge conflicts; fix these following the usual git
39+
prompts. Generally merge conflicts should only exist where functionality which was added
40+
for your site (not in a hook) has subsequently been merged upstream.
41+
42+
1. Push this branch and create a PR:
43+
44+
git push
45+
# follow instructions
46+
47+
1. Review the PR to see if any added/changed functionality requires alteration of
48+
site-specific configuration. In general changes to existing functionality will aim to be
49+
backward compatible. Alteration of site-specific configuration will usually only be
50+
necessary to use new functionality or where functionality has been upstreamed as above.
51+
52+
Make changes as necessary.
53+
54+
1. Download the relevant release image(s) using the link from the relevant [Slur
55+
m appliance release](https://github.com/stackhpc/ansible-slurm-appliance/releases), e.g.:
56+
57+
wget https://object.arcus.openstack.hpc.cam.ac.uk/swift/v1/AUTH_3a06571936a0424bb40bc5c672c4ccb1/openhpc-images/openhpc-ofed-RL8-240906-1042-32568dbb
58+
59+
Note that some releases may not include new images. In this case use the image from the latest previous release with new images.
60+
61+
4. If required, build an "extra" image with local modifications. See site-specific instructions in [docs/site/README.md](site/README.md).
62+
63+
5. Modify your environments to use this image, test it in your staging cluster, and push commits to the PR created above. See site-specific instructions in [docs/site/README.md](site/README.md).
64+
65+
6. Declare a future outage window to cluster users and create a [Slurm reservation](https://slurm.schedmd.com/scontrol.html#lbAQ) to prevent jobs running during that window, e.g.:
66+
67+
[LOGIN]$ sudo scontrol create reservation Flags=MAINT ReservationName="upgrade-vX.Y" StartTime=2024-10-16T08:00:00 EndTime=2024-10-16T10:00:00 Nodes=ALL Users=root
68+
69+
7. At the outage window, check there are no jobs running:
70+
71+
[LOGIN]$ squeue
72+
73+
8. Deploy the branch created above to production. See site-specific instructions in [docs/site/README.md](site/README.md).
74+
75+
9. Check slurm is up:
76+
77+
[LOGIN]$ sinfo -R
78+
79+
The `-R` shows the reason for any nodes being down.
80+
81+
10. If the above shows nodes done for having been "unexpectedly rebooted", set them up again:
82+
83+
[LOGIN]$ sudo scontrol update state=RESUME nodename=$HOSTLIST_EXPR
84+
85+
where the hostlist expression might look like e.g. `general-[0-1]` to reset state for nodes 0 and 1 of the general partition.
86+
87+
11. Delete the reservation:
88+
89+
[LOGIN]$ sudo scontrol delete ReservationName="upgrade-slurm-v1.160"
90+
91+
12. Tell users the cluster is available again.
92+

0 commit comments

Comments
 (0)