-
Notifications
You must be signed in to change notification settings - Fork 34
Add generic upgrade docs #462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
0fb7675
add upgrade docs
sjpb 3db8b95
Merge branch 'main' into docs/upgrade
sjpb 89abd58
link to generic image build docs from upgrade docs
sjpb cadb6a3
address minor upgrade docs issues
sjpb e150fcf
fix upgrade merge tag command
sjpb 51d5e73
fix upgrade docs typo
sjpb cc03687
Merge branch 'main' into docs/upgrade
sjpb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
# Upgrades | ||
|
||
This document explains the generic steps required to upgrade a deployment of the Slurm Appliance with upstream changes from StackHPC. | ||
Generally, upstream releases will happen roughly monthly. Releases may contain new functionality and/or updated images. | ||
|
||
Any site-specific instructions in [docs/site/README.md](site/README.md) should be reviewed in tandem with this. | ||
|
||
This document assumes the deployment repository has: | ||
1. Remotes: | ||
- `origin` referring to the site-specific remote repository. | ||
- `stackhpc` referring to the StackHPC repository at https://github.com/stackhpc/ansible-slurm-appliance.git. | ||
2. Branches: | ||
- `main` - following `main/origin`, the current site-specific code deployed to production. | ||
- `upstream` - following `main/stackhpc`, i.e. the upstream `main` branch from `stackhpc`. | ||
3. The following environments: | ||
- `$PRODUCTION`: a production environment, as defined by e.g. `environments/production/`. | ||
- `$STAGING`: a production environment, as defined by e.g. `environments/staging/`. | ||
- `$SITE_ENV`: a base site-specific environment, as defined by e.g. `environments/mysite/`. | ||
|
||
**NB:** Commands which should be run on the Slurm login node are shown below prefixed `[LOGIN]$`. | ||
All other commands should be run on the Ansible deploy host. | ||
|
||
1. Update the `upstream` branch from the `stackhpc` remote, including tags: | ||
|
||
git fetch stackhpc main --tags | ||
|
||
1. Identify the latest release from the [Slurm appliance release page](https://github.com/stackhpc/ansible-slurm-appliance/releases). Below this release is shown as `vX.Y`. | ||
|
||
1. Ensure your local site branch is up to date and create a new branch from it for the | ||
site-specfic release code: | ||
|
||
git checkout main | ||
git pull --prune | ||
git checkout -b update/vX.Y | ||
|
||
1. Merge the upstream code into your release branch: | ||
|
||
git merge vX.Y | ||
|
||
It is possible this will introduce merge conflicts; fix these following the usual git | ||
prompts. Generally merge conflicts should only exist where functionality which was added | ||
for your site (not in a hook) has subsequently been merged upstream. | ||
|
||
1. Push this branch and create a PR: | ||
|
||
git push | ||
# follow instructions | ||
|
||
1. Review the PR to see if any added/changed functionality requires alteration of | ||
site-specific configuration. In general changes to existing functionality will aim to be | ||
backward compatible. Alteration of site-specific configuration will usually only be | ||
necessary to use new functionality or where functionality has been upstreamed as above. | ||
|
||
Make changes as necessary. | ||
|
||
1. Identify image(s) from the relevant [Slurm appliance release](https://github.com/stackhpc/ansible-slurm-appliance/releases), and download | ||
using the link on the release plus the image name, e.g. for an image `openhpc-ofed-RL8-240906-1042-32568dbb`: | ||
|
||
wget https://object.arcus.openstack.hpc.cam.ac.uk/swift/v1/AUTH_3a06571936a0424bb40bc5c672c4ccb1/openhpc-images/openhpc-ofed-RL8-240906-1042-32568dbb | ||
sjpb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Note that some releases may not include new images. In this case use the image from the latest previous release with new images. | ||
|
||
sjpb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
1. If required, build an "extra" image with local modifications, see [docs/image-build.md](./image-build.md). | ||
|
||
1. Modify your site-specific environment to use this image, e.g. via `cluster_image_id` in `environments/$SITE_ENV/terraform/variables.tf`. | ||
|
||
1. Test this in your staging cluster. | ||
|
||
1. Commit changes and push to the PR created above. | ||
|
||
1. Declare a future outage window to cluster users. A [Slurm reservation](https://slurm.schedmd.com/scontrol.html#lbAQ) can be | ||
used to prevent jobs running during that window, e.g.: | ||
|
||
[LOGIN]$ sudo scontrol create reservation Flags=MAINT ReservationName="upgrade-vX.Y" StartTime=2024-10-16T08:00:00 EndTime=2024-10-16T10:00:00 Nodes=ALL Users=root | ||
sjpb marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Note a reservation cannot be created if it may overlap with currently running jobs (defined by job or partition time limits). | ||
|
||
1. At the outage window, check there are no jobs running: | ||
|
||
[LOGIN]$ squeue | ||
|
||
1. Deploy the branch created above to production, i.e. activate the production environment, run OpenTofu to reimage or | ||
delete/recreate instances with the new images (depending on how the root disk is defined), and run Ansible's `site.yml` | ||
playbook to reconfigure the cluster, e.g. as described in the main [README.md](../README.md). | ||
|
||
1. Check slurm is up: | ||
|
||
[LOGIN]$ sinfo -R | ||
|
||
The `-R` shows the reason for any nodes being down. | ||
|
||
1. If the above shows nodes done for having been "unexpectedly rebooted", set them up again: | ||
|
||
[LOGIN]$ sudo scontrol update state=RESUME nodename=$HOSTLIST_EXPR | ||
|
||
where the hostlist expression might look like e.g. `general-[0-1]` to reset state for nodes 0 and 1 of the general partition. | ||
|
||
1. Delete the reservation: | ||
|
||
[LOGIN]$ sudo scontrol delete ReservationName="upgrade-slurm-v1.160" | ||
|
||
1. Tell users the cluster is available again. | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.