|
| 1 | +# Upgrades |
| 2 | + |
| 3 | +This document explains the generic steps required to upgrade a deployment of the Slurm Appliance with upstream changes from StackHPC. |
| 4 | +Generally, upstream releases will happen roughly monthly. Releases may contain new functionality and/or updated images. |
| 5 | + |
| 6 | +Any site-specific instructions in [docs/site/README.md](site/README.md) should be reviewed in tandem with this. |
| 7 | + |
| 8 | +This document assumes the deployment repository has: |
| 9 | +1. Remotes: |
| 10 | + - `origin` referring to the site-specific remote repository. |
| 11 | + - `stackhpc` referring to the StackHPC repository at https://github.com/stackhpc/ansible-slurm-appliance.git. |
| 12 | +2. Branches: |
| 13 | + - `main` - following `main/origin`, the current site-specific code deployed to production. |
| 14 | + - `upstream` - following `main/stackhpc`, i.e. the upstream `main` branch from `stackhpc`. |
| 15 | +3. The following environments: |
| 16 | + - `$PRODUCTION`: a production environment, as defined by e.g. `environments/production/`. |
| 17 | + - `$STAGING`: a production environment, as defined by e.g. `environments/staging/`. |
| 18 | + - `$SITE_ENV`: a base site-specific environment, as defined by e.g. `environments/mysite/`. |
| 19 | + |
| 20 | +**NB:** Commands which should be run on the Slurm login node are shown below prefixed `[LOGIN]$`. |
| 21 | +All other commands should be run on the Ansible deploy host. |
| 22 | + |
| 23 | +1. Update the `upstream` branch from the `stackhpc` remote, including tags: |
| 24 | + |
| 25 | + git fetch stackhpc main --tags |
| 26 | + |
| 27 | +1. Identify the latest release from the [Slurm appliance release page](https://github.com/stackhpc/ansible-slurm-appliance/releases). Below this release is shown as `vX.Y`. |
| 28 | + |
| 29 | +1. Ensure your local site branch is up to date and create a new branch from it for the |
| 30 | + site-specfic release code: |
| 31 | + |
| 32 | + git checkout main |
| 33 | + git pull --prune |
| 34 | + git checkout -b update/vX.Y |
| 35 | + |
| 36 | +1. Merge the upstream code into your release branch: |
| 37 | + |
| 38 | + git merge vX.Y |
| 39 | + |
| 40 | + It is possible this will introduce merge conflicts; fix these following the usual git |
| 41 | + prompts. Generally merge conflicts should only exist where functionality which was added |
| 42 | + for your site (not in a hook) has subsequently been merged upstream. |
| 43 | + |
| 44 | +1. Push this branch and create a PR: |
| 45 | + |
| 46 | + git push |
| 47 | + # follow instructions |
| 48 | + |
| 49 | +1. Review the PR to see if any added/changed functionality requires alteration of |
| 50 | + site-specific configuration. In general changes to existing functionality will aim to be |
| 51 | + backward compatible. Alteration of site-specific configuration will usually only be |
| 52 | + necessary to use new functionality or where functionality has been upstreamed as above. |
| 53 | + |
| 54 | + Make changes as necessary. |
| 55 | + |
| 56 | +1. Identify image(s) from the relevant [Slurm appliance release](https://github.com/stackhpc/ansible-slurm-appliance/releases), and download |
| 57 | + using the link on the release plus the image name, e.g. for an image `openhpc-ofed-RL8-240906-1042-32568dbb`: |
| 58 | + |
| 59 | + wget https://object.arcus.openstack.hpc.cam.ac.uk/swift/v1/AUTH_3a06571936a0424bb40bc5c672c4ccb1/openhpc-images/openhpc-ofed-RL8-240906-1042-32568dbb |
| 60 | + |
| 61 | + Note that some releases may not include new images. In this case use the image from the latest previous release with new images. |
| 62 | + |
| 63 | +1. If required, build an "extra" image with local modifications, see [docs/image-build.md](./image-build.md). |
| 64 | + |
| 65 | +1. Modify your site-specific environment to use this image, e.g. via `cluster_image_id` in `environments/$SITE_ENV/terraform/variables.tf`. |
| 66 | + |
| 67 | +1. Test this in your staging cluster. |
| 68 | + |
| 69 | +1. Commit changes and push to the PR created above. |
| 70 | + |
| 71 | +1. Declare a future outage window to cluster users. A [Slurm reservation](https://slurm.schedmd.com/scontrol.html#lbAQ) can be |
| 72 | + used to prevent jobs running during that window, e.g.: |
| 73 | + |
| 74 | + [LOGIN]$ sudo scontrol create reservation Flags=MAINT ReservationName="upgrade-vX.Y" StartTime=2024-10-16T08:00:00 EndTime=2024-10-16T10:00:00 Nodes=ALL Users=root |
| 75 | + |
| 76 | + Note a reservation cannot be created if it may overlap with currently running jobs (defined by job or partition time limits). |
| 77 | + |
| 78 | +1. At the outage window, check there are no jobs running: |
| 79 | + |
| 80 | + [LOGIN]$ squeue |
| 81 | + |
| 82 | +1. Deploy the branch created above to production, i.e. activate the production environment, run OpenTofu to reimage or |
| 83 | +delete/recreate instances with the new images (depending on how the root disk is defined), and run Ansible's `site.yml` |
| 84 | +playbook to reconfigure the cluster, e.g. as described in the main [README.md](../README.md). |
| 85 | + |
| 86 | +1. Check slurm is up: |
| 87 | + |
| 88 | + [LOGIN]$ sinfo -R |
| 89 | + |
| 90 | + The `-R` shows the reason for any nodes being down. |
| 91 | + |
| 92 | +1. If the above shows nodes done for having been "unexpectedly rebooted", set them up again: |
| 93 | + |
| 94 | + [LOGIN]$ sudo scontrol update state=RESUME nodename=$HOSTLIST_EXPR |
| 95 | + |
| 96 | + where the hostlist expression might look like e.g. `general-[0-1]` to reset state for nodes 0 and 1 of the general partition. |
| 97 | + |
| 98 | +1. Delete the reservation: |
| 99 | + |
| 100 | + [LOGIN]$ sudo scontrol delete ReservationName="upgrade-slurm-v1.160" |
| 101 | + |
| 102 | +1. Tell users the cluster is available again. |
| 103 | + |
0 commit comments