Skip to content

Merge caas slurm appliance into slurm appliance #325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Nov 24, 2023
Merged

Merge caas slurm appliance into slurm appliance #325

merged 49 commits into from
Nov 24, 2023

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Nov 9, 2023

Replaces https://github.com/stackhpc/caas-slurm-appliance with a .caas environment.

Requires https://github.com/stackhpc/azimuth-caas-operator/pull/75 and azimuth-cloud/ansible-collection-azimuth-ops#202. Note that:

  • ui-meta files are provided in environments/.caas/
  • the cluster playbook is now ansible/site.yml
    Both normal and "fast-volume-type" clusters are supported, by using the appropriate ui-meta file.

Currently the ansible-driven image build functionality from CaaS has not been ported to this repo - use the github workflow or or packer configurations directly.

Implementation Notes

  • There can't be an activate script, so:
    • ansible.cfg is provided in the repo root, as expected by caas operator.
    • ANSIBLE_INVENTORY is set in the cluster type template, using a relative path:
      azimuth_caas_stackhpc_slurm_appliance_template:
      ...
        envVars:
          # Normally set through environment's activate script:
          ANSIBLE_INVENTORY: environments/common/inventory,environments/.caas/inventory # NB: Relative to runner project dir
      
      Ansible then defines ansible_inventory_sources which contains absolute paths, and that is used to derive the appliances_environment_root. This needed some changes in site.yml where the environment var APPLIANCES_ENVIRONMENT_ROOT is not available in caas. Note this environment var can't be got rid of entirely yet because {TF,PKR}_VAR_environment_root are derived from it in non-caas environments).
  • Ansible-driven TF imported only used for caas at present.
  • Inventory groups are now defined by symlink to everything template, and don't need to be created in-memory. This means groups always exist, so logic changes from if 'grafana' in groups --> if groups[‘grafana’] | length > 0 which means lots of caas-specific overrides could be removed.
  • common env ondemand config didn’t always take into account whether grafana was deployed, which caas did (see 2b264c7).
  • Conversely I think the caas ondemand config was broken for jupyter - wrong binary path.
  • Contents of caas repo are now here as follows
    • caas slurm-infra.yml → pre.yml, site.yml, post.yml
    • caas roles/* → ansible/roles/
    • image_build{,_infra} - not currently ported
    • caas ui_meta → environments/.caas/ui_meta
    • caas ansible.cfg → ansible.cfg
    • caas group_vars → environments/.caas/inventory/group_vars
    • caas assets/* → environments/.caas/assets
    • requirements.yml → merged with requirements.yml
  • basic_users and etc_hosts roles are used instead of custom code in the original caas repo.

TODO

Merge following from CaaS:

sjpb added 30 commits November 1, 2023 13:57
@sjpb sjpb requested a review from m-bull November 17, 2023 14:28
@sjpb sjpb force-pushed the feat/caas branch 4 times, most recently from 42544aa to 85b2feb Compare November 17, 2023 23:44
@sjpb sjpb marked this pull request as ready for review November 17, 2023 23:45
@sjpb sjpb requested a review from a team as a code owner November 17, 2023 23:45
@sjpb
Copy link
Collaborator Author

sjpb commented Nov 24, 2023

Tests @ b78f1e4

  • Checked azimuth can still deploy workstation: OK

On azimuth cluster - sb-tf1:

  • Create with hpctests: OK
  • direct ssh: OK
  • OOD portal OK
  • OOD shell: OK
  • OOD jupyter: OK
  • OOD desktop
  • monitoring: OK
    • job list: OK
    • job details OK
  • Delete: OK

Tests on stackhpc environment:

  • provision: OK
  • configure: OK
  • hpctests: OK
  • direct ssh: not tested
  • OOD portal: OK
  • OOD shell: OK
  • OOD jupyter: OK
  • OOD desktop: OK
  • monitoring: OK
    • can see jobs
    • can see job details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants