Skip to content

Update docs to include operations #422

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Oct 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
179 changes: 70 additions & 109 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,36 +2,47 @@

# StackHPC Slurm Appliance

This repository contains playbooks and configuration to define a Slurm-based HPC environment including:
- A Rocky Linux 9 and OpenHPC v3-based Slurm cluster.
- Shared fileystem(s) using NFS (with servers within or external to the cluster).
- Slurm accounting using a MySQL backend.
- A monitoring backend using Prometheus and ElasticSearch.
- Grafana with dashboards for both individual nodes and Slurm jobs.
- Production-ready Slurm defaults for access and memory.
- A Packer-based build pipeline for compute and login node images.

The repository is designed to be forked for a specific use-case/HPC site but can contain multiple environments (e.g. development, staging and production). It has been designed to be modular and extensible, so if you add features for your HPC site please feel free to submit PRs back upstream to us!

While it is tested on OpenStack it should work on any cloud, except for node rebuild/reimaging features which are currently OpenStack-specific.

## Prerequisites
It is recommended to check the following before starting:
- You have root access on the "ansible deploy host" which will be used to deploy the appliance.
This repository contains playbooks and configuration to define a Slurm-based HPC environment. This includes:
- [Rocky Linux](https://rockylinux.org/)-based hosts.
- [OpenTofu](https://opentofu.org/) configurations to define the cluster's infrastructure-as-code.
- Packages for Slurm and MPI software stacks from [OpenHPC](https://openhpc.community/).
- Shared fileystem(s) using NFS (with in-cluster or external servers) or [CephFS](https://docs.ceph.com/en/latest/cephfs/) via [Openstack Manila](https://wiki.openstack.org/wiki/Manila).
- Slurm accounting using a MySQL database.
- Monitoring integrated with Slurm jobs using Prometheus, ElasticSearch and Grafana.
- A web-based portal from [OpenOndemand](https://openondemand.org/).
- Production-ready default Slurm configurations for access and memory limits.
- [Packer](https://developer.hashicorp.com/packer)-based image build configurations for node images.

The repository is expected to be forked for a specific HPC site but can contain multiple environments for e.g. development, staging and production clusters
sharing a common configuration. It has been designed to be modular and extensible, so if you add features for your HPC site please feel free to submit PRs
back upstream to us!

While it is tested on OpenStack it should work on any cloud with appropriate OpenTofu configuration files.

## Demonstration Deployment

The default configuration in this repository may be used to create a cluster to explore use of the appliance. It provides:
- Persistent state backed by an OpenStack volume.
- NFS-based shared file system backed by another OpenStack volume.

Note that the OpenOndemand portal and its remote apps are not usable with this default configuration.

It requires an OpenStack cloud, and an Ansible "deploy host" with access to that cloud.

Before starting ensure that:
- You have root access on the deploy host.
- You can create instances using a Rocky 9 GenericCloud image (or an image based on that).
- **NB**: In general it is recommended to use the [latest released image](https://github.com/stackhpc/ansible-slurm-appliance/releases) which already contains the required packages. This is built and tested in StackHPC's CI. However the appliance will install the necessary packages if a GenericCloud image is used.
- SSH keys get correctly injected into instances.
- Instances have access to internet (note proxies can be setup through the appliance if necessary).
- DNS works (if not this can be partially worked around but additional configuration will be required).
- You have a SSH keypair defined in OpenStack, with the private key available on the deploy host.
- Created instances have access to internet (note proxies can be setup through the appliance if necessary).
- Created instances have accurate/synchronised time (for VM instances this is usually provided by the hypervisor; if not or for bare metal instances it may be necessary to configure a time service via the appliance).

## Installation on deployment host
### Setup deploy host

Current Operating Systems supported to be deploy hosts:
The following operating systems are supported for the deploy host:

- Rocky Linux 9
- Rocky Linux 8
- Ubuntu 22.04

These instructions assume the deployment host is running Rocky Linux 8:

Expand All @@ -40,115 +51,65 @@ These instructions assume the deployment host is running Rocky Linux 8:
cd ansible-slurm-appliance
./dev/setup-env.sh

## Overview of directory structure

- `environments/`: Contains configurations for both a "common" environment and one or more environments derived from this for your site. These define ansible inventory and may also contain provisioning automation such as Terraform or OpenStack HEAT templates.
- `ansible/`: Contains the ansible playbooks to configure the infrastruture.
- `packer/`: Contains automation to use Packer to build compute nodes for an enviromment - see the README in this directory for further information.
- `dev/`: Contains development tools.

## Environments
You will also need to install [OpenTofu](https://opentofu.org/docs/intro/install/rpm/).

### Overview
### Create a new environment

An environment defines the configuration for a single instantiation of this Slurm appliance. Each environment is a directory in `environments/`, containing:
- Any deployment automation required - e.g. Terraform configuration or HEAT templates.
- An ansible `inventory/` directory.
- An `activate` script which sets environment variables to point to this configuration.
- Optionally, additional playbooks in `/hooks` to run before or after the main tasks.

All environments load the inventory from the `common` environment first, with the environment-specific inventory then overriding parts of this as required.

### Creating a new environment

This repo contains a `cookiecutter` template which can be used to create a new environment from scratch. Run the [installation on deployment host](#Installation-on-deployment-host) instructions above, then in the repo root run:
Use the `cookiecutter` template to create a new environment to hold your configuration. In the repository root run:

. venv/bin/activate
cd environments
cookiecutter skeleton

and follow the prompts to complete the environment name and description.

Alternatively, you could copy an existing environment directory.

Now add deployment automation if required, and then complete the environment-specific inventory as described below.
**NB:** In subsequent sections this new environment is refered to as `$ENV`.

### Environment-specific inventory structure
Now generate secrets for this environment:

The ansible inventory for the environment is in `environments/<environment>/inventory/`. It should generally contain:
- A `hosts` file. This defines the hosts in the appliance. Generally it should be templated out by the deployment automation so it is also a convenient place to define variables which depend on the deployed hosts such as connection variables, IP addresses, ssh proxy arguments etc.
- A `groups` file defining ansible groups, which essentially controls which features of the appliance are enabled and where they are deployed. This repository generally follows a convention where functionality is defined using ansible roles applied to a a group of the same name, e.g. `openhpc` or `grafana`. The meaning and use of each group is described in comments in `environments/common/inventory/groups`. As the groups defined there for the common environment are empty, functionality is disabled by default and must be enabled in a specific environment's `groups` file. Two template examples are provided in `environments/commmon/layouts/` demonstrating a minimal appliance with only the Slurm cluster itself, and an appliance with all functionality.
- Optionally, group variable files in `group_vars/<group_name>/overrides.yml`, where the group names match the functional groups described above. These can be used to override the default configuration for each functionality, which are defined in `environments/common/inventory/group_vars/all/<group_name>.yml` (the use of `all` here is due to ansible's precedence rules).
ansible-playbook ansible/adhoc/generate-passwords.yml

Although most of the inventory uses the group convention described above there are a few special cases:
- The `control`, `login` and `compute` groups are special as they need to contain actual hosts rather than child groups, and so should generally be defined in the templated-out `hosts` file.
- The cluster name must be set on all hosts using `openhpc_cluster_name`. Using an `[all:vars]` section in the `hosts` file is usually convenient.
- `environments/common/inventory/group_vars/all/defaults.yml` contains some variables which are not associated with a specific role/feature. These are unlikely to need changing, but if necessary that could be done using a `environments/<environment>/inventory/group_vars/all/overrides.yml` file.
- The `ansible/adhoc/generate-passwords.yml` playbook sets secrets for all hosts in `environments/<environent>/inventory/group_vars/all/secrets.yml`.
- The Packer-based pipeline for building compute images creates a VM in groups `builder` and `compute`, allowing build-specific properties to be set in `environments/common/inventory/group_vars/builder/defaults.yml` or the equivalent inventory-specific path.
- Each Slurm partition must have:
- An inventory group `<cluster_name>_<partition_name>` defining the hosts it contains - these must be homogenous w.r.t CPU and memory.
- An entry in the `openhpc_slurm_partitions` mapping in `environments/<environment>/inventory/group_vars/openhpc/overrides.yml`.
See the [openhpc role documentation](https://github.com/stackhpc/ansible-role-openhpc#slurmconf) for more options.
- On an OpenStack cloud, rebuilding/reimaging compute nodes from Slurm can be enabled by defining a `rebuild` group containing the relevant compute hosts (e.g. in the generated `hosts` file).
### Define infrastructure configuration

## Creating a Slurm appliance
Create an OpenTofu variables file to define the required infrastructure, e.g.:

NB: This section describes generic instructions - check for any environment-specific instructions in `environments/<environment>/README.md` before starting.
# environments/$ENV/terraform/terraform.tfvars:

1. Activate the environment - this **must be done** before any other commands are run:
cluster_name = "mycluster"
cluster_net = "some_network" # *
cluster_subnet = "some_subnet" # *
key_pair = "my_key" # *
control_node_flavor = "some_flavor_name"
login_nodes = {
login-0: "login_flavor_name"
}
cluster_image_id = "rocky_linux_9_image_uuid"
compute = {
general = {
nodes: ["compute-0", "compute-1"]
flavor: "compute_flavor_name"
}
}

source environments/<environment>/activate
Variables marked `*` refer to OpenStack resources which must already exist. The above is a minimal configuration - for all variables
and descriptions see `environments/$ENV/terraform/terraform.tfvars`.

2. Deploy instances - see environment-specific instructions.
### Deploy appliance

3. Generate passwords:
ansible-playbook ansible/site.yml

ansible-playbook ansible/adhoc/generate-passwords.yml
You can now log in to the cluster using:

This will output a set of passwords in `environments/<environment>/inventory/group_vars/all/secrets.yml`. It is recommended that these are encrpyted and then commited to git using:
ssh rocky@$login_ip

ansible-vault encrypt inventory/group_vars/all/secrets.yml
where the IP of the login node is given in `environments/$ENV/inventory/hosts.yml`

See the [Ansible vault documentation](https://docs.ansible.com/ansible/latest/user_guide/vault.html) for more details.

4. Deploy the appliance:

ansible-playbook ansible/site.yml

or if you have encrypted secrets use:

ansible-playbook ansible/site.yml --ask-vault-password

Tags as defined in the various sub-playbooks defined in `ansible/` may be used to only run part of the `site` tasks.

5. "Utility" playbooks for managing a running appliance are contained in `ansible/adhoc` - run these by activating the environment and using:

ansible-playbook ansible/adhoc/<playbook name>

Currently they include the following (see each playbook for links to documentation):
- `hpctests.yml`: MPI-based cluster tests for latency, bandwidth and floating point performance.
- `rebuild.yml`: Rebuild nodes with existing or new images (NB: this is intended for development not for reimaging nodes on an in-production cluster - see `ansible/roles/rebuild` for that).
- `restart-slurm.yml`: Restart all Slurm daemons in the correct order.
- `update-packages.yml`: Update specified packages on cluster nodes.

## Adding new functionality
Please contact us for specific advice, but in outline this generally involves:
- Adding a role.
- Adding a play calling that role into an existing playbook in `ansible/`, or adding a new playbook there and updating `site.yml`.
- Adding a new (empty) group named after the role into `environments/common/inventory/groups` and a non-empty example group into `environments/common/layouts/everything`.
- Adding new default group vars into `environments/common/inventory/group_vars/all/<rolename>/`.
- Updating the default Packer build variables in `environments/common/inventory/group_vars/builder/defaults.yml`.
- Updating READMEs.

## Monitoring and logging

Please see the [monitoring-and-logging.README.md](docs/monitoring-and-logging.README.md) for details.

## CI/CD automation

The `.github` directory contains a set of sample workflows which can be used by downstream site-specific configuration repositories to simplify ongoing maintainence tasks. These include:
## Overview of directory structure

- An [upgrade check](.github/workflows/upgrade-check.yml.sample) workflow which automatically checks this upstream stackhpc/ansible-slurm-appliance repo for new releases and proposes a pull request to the downstream site-specific repo when a new release is published.
- `environments/`: See [docs/environments.md](docs/environments.md).
- `ansible/`: Contains the ansible playbooks to configure the infrastruture.
- `packer/`: Contains automation to use Packer to build machine images for an enviromment - see the README in this directory for further information.
- `dev/`: Contains development tools.

- An [image upload](.github/workflows/upload-s3-image.yml.sample) workflow which takes an image name, downloads it from StackHPC's public S3 bucket if available, and uploads it to the target OpenStack cloud.
For further information see the [docs](docs/) directory.
9 changes: 9 additions & 0 deletions docs/adding-functionality.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Adding new functionality

Please contact us for specific advice, but this generally involves:
- Adding a role.
- Adding a play calling that role into an existing playbook in `ansible/`, or adding a new playbook there and updating `site.yml`.
- Adding a new (empty) group named after the role into `environments/common/inventory/groups` and a non-empty example group into `environments/common/layouts/everything`.
- Adding new default group vars into `environments/common/inventory/group_vars/all/<rolename>/`.
- Updating the default Packer build variables in `environments/common/inventory/group_vars/builder/defaults.yml`.
- Updating READMEs.
8 changes: 8 additions & 0 deletions docs/ci.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# CI/CD automation

The `.github` directory contains a set of sample workflows which can be used by downstream site-specific configuration repositories to simplify ongoing maintainence tasks. These include:

- An [upgrade check](.github/workflows/upgrade-check.yml.sample) workflow which automatically checks this upstream stackhpc/ansible-slurm-appliance repo for new releases and proposes a pull request to the downstream site-specific repo when a new release is published.

- An [image upload](.github/workflows/upload-s3-image.yml.sample) workflow which takes an image name, downloads it from StackHPC's public S3 bucket if available, and uploads it to the target OpenStack cloud.

Loading