Skip to content

Test upgrade from latest release to current branch image in CI #576

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Mar 28, 2025

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Feb 13, 2025

Tests compute-init via slurm-controlled reboot in CI

  1. Provisions and configures cluster from latest release tag, runs hpctests
  2. Checks out repo to current branch
  3. Re-provisions cluster (ignore_image_changes: true so only control and login are reimaged)
  4. Configures cluster from current branch
  5. Runs --reboot slurm job which reimages compute nodes to match the current branch image
  6. Compute-init runs in the background after compute nodes reimaged, to rejoin the cluster
  7. hpctests run again

@sjpb sjpb force-pushed the ci/test-compute-init branch 2 times, most recently from 32eded3 to e9a38fd Compare February 13, 2025 13:25
@sjpb sjpb force-pushed the ci/test-compute-init branch from e9a38fd to 85eafcc Compare February 13, 2025 13:26
@sjpb sjpb force-pushed the ci/test-compute-init branch from bec0f4b to f0cd48f Compare February 13, 2025 13:50
@sjpb sjpb force-pushed the ci/test-compute-init branch from 6aa5d3d to 8be9087 Compare February 13, 2025 14:02
@sjpb sjpb force-pushed the ci/test-compute-init branch from cace350 to bf1ceed Compare February 13, 2025 14:36
@sjpb sjpb changed the title WIP: Use latest release for initial CI cluster setup Test upgrade from latest release to current branch image in CI Feb 13, 2025
@sjpb sjpb force-pushed the ci/test-compute-init branch from 63c35b7 to d504363 Compare February 14, 2025 09:34
@sjpb
Copy link
Collaborator Author

sjpb commented Feb 14, 2025

First CI run above - RL8 worked, RL9 didn't. Both compute nodes got rebooted (not rebuilt - as no image change), then they froze up. Could ping but couldn't ssh in. Rescued -1, set a root password, unrescued, then it started working (!) then OOMkilled on HPL.

Just trying the RL9 one again to see if we got unlucky with the cloud.

@sjpb
Copy link
Collaborator Author

sjpb commented Feb 14, 2025

Ok it passed the 2nd time!

@sjpb sjpb force-pushed the ci/test-compute-init branch from 1355b5f to 0ac9de5 Compare February 14, 2025 16:45
@sjpb
Copy link
Collaborator Author

sjpb commented Feb 14, 2025

Cancelled tests after rebase, building image: https://github.com/stackhpc/ansible-slurm-appliance/actions/runs/13333602095

NB: once bumped, this should trigger a rebuild rather than reimage

@bertiethorpe

This comment was marked as outdated.

@bertiethorpe
Copy link
Member

bertiethorpe commented Mar 26, 2025

@bertiethorpe bertiethorpe marked this pull request as ready for review March 26, 2025 15:48
@bertiethorpe bertiethorpe requested a review from a team as a code owner March 26, 2025 15:48
Copy link
Collaborator Author

@sjpb sjpb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments.

@bertiethorpe bertiethorpe merged commit 0aec76c into main Mar 28, 2025
7 checks passed
@bertiethorpe bertiethorpe deleted the ci/test-compute-init branch March 28, 2025 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants