Skip to content

Enable build of environment-specific control images #160

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 11, 2022
Merged

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Mar 30, 2022

Enables building environment-specific control images in the same way as for login/compute builds.

Ticket: https://stackhpc.atlassian.net/browse/DEV-695

Note this does NOT move state off-board from the controller, so reimaging the controller will loose all state. However it does at least provide an image which can be used to create a working controller for an existing cluster (regardless of e.g. upstream package changes).

Note that ansible/site.yml will need running after imaging a node with this image, to set:

  • Partition information in slurm.conf - even if compute nodes are up during image build they are not in the packer play so this information cannot be templated
  • Potentially Prometheus scrape config, depending on which compute nodes were up during image build

Fixes #133. Replaces #136.

Requires:

Note that with openondemand enabled, building a control image BEFORE deploying the login node fails in smslabs environment as grafana needs to know the openondemand_servername which requires the private IP (defined as .ansible_host) for the login node. This doesn't occur in CI as a direct deployment of control/login/2x compute is done first which generates the hosts inventory file with this information.

This limitation could be fixed in a later PR / for other environments by changing the TF to be two stages:

  1. Deploy ports to get private IPs and template hosts file using this info (instead of from instances).
  2. Deploy instances
    and only doing step 1 before running the image build. That should also allow login & compute image build without the control node actually existing.

CI does not currently try the built image.

NB: CI is failing until requirements.yml updated, but waiting for another PR to merge on openhpc role before bumping version

Dev deployment: vglabs-steveb-ansible-rocky85:/home/rocky/slurm-app-control-images

@sjpb sjpb mentioned this pull request Mar 30, 2022
1 task
@sjpb
Copy link
Collaborator Author

sjpb commented Mar 30, 2022

Currently failing on:

    openstack.control: TASK [stackhpc.openhpc : Template basic slurm.conf] ****************************
    openstack.control: task path: /home/runner/work/ansible-slurm-appliance/ansible-slurm-appliance/ansible/roles/stackhpc.openhpc/tasks/runtime.yml:84
    openstack.control: An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ansible.errors.AnsibleFilterError: Group "ci2064964918_small" contains no hosts in this play - was --limit used?

Didn't see this in dev deployment as that had no group defined!

@sjpb sjpb force-pushed the feature/control-images2 branch from cce41b2 to ad4f769 Compare April 6, 2022 08:55
sjpb added 5 commits April 19, 2022 09:59
8.4.7-1 appears to ignore admin username/password in grafana.ini, which causes

    TASK [grafana-datasources : Ensure datasources exist (via API)]

to fail with:

    Unauthorized to perform action 'POST' on 'http://<control-hostname>:3000/api/user/using/1'
Base automatically changed from feature/openhpc_config to main April 21, 2022 08:52
@sjpb sjpb changed the title Enable build of control images Enable build of environment-specific control images Apr 21, 2022
@sjpb
Copy link
Collaborator Author

sjpb commented Apr 29, 2022

Think CI ran out of instances, can retry later.

@sjpb sjpb requested review from jovial and m-bull May 10, 2022 13:45
@sjpb
Copy link
Collaborator Author

sjpb commented May 11, 2022

@m-bull I've added 91e91e5 to this, to skip block_devices during build. The need for that commit only surfaced as part of #173. However we are using block devices to manage volumes for /home and stackhpc.openhpc:openhpc_state_save_location (which was already in before #173) in e.g. AlaSKA - both on the control node. So adding this commit to the PR will at least avoid control image build crashing on a deployed configuration.

@sjpb sjpb merged commit 4a97392 into main May 11, 2022
@sjpb sjpb deleted the feature/control-images2 branch May 11, 2022 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Building images for control nodes
2 participants