Skip to content

Allow extending fat images with site-specific groups #403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/fatimage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ jobs:
. environments/.stackhpc/activate
cd packer/
packer init .
PACKER_LOG=1 packer build -on-error=${{ vars.PACKER_ON_ERROR }} -var-file=$PKR_VAR_environment_root/${{ vars.CI_CLOUD }}.pkrvars.hcl openstack.pkr.hcl
PACKER_LOG=1 packer build -on-error=${{ vars.PACKER_ON_ERROR }} -except=openstack.openhpc-extra -var-file=$PKR_VAR_environment_root/${{ vars.CI_CLOUD }}.pkrvars.hcl openstack.pkr.hcl
env:
PKR_VAR_os_version: ${{ matrix.os_version }}

Expand Down
3 changes: 3 additions & 0 deletions ansible/cleanup.yml
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,6 @@

- name: Run cloud-init cleanup
command: cloud-init clean --logs --seed

- name: Cleanup /tmp
command : rm -rf /tmp/*
1 change: 0 additions & 1 deletion environments/.stackhpc/ARCUS.pkrvars.hcl
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
flavor = "vm.ska.cpu.general.small"
use_blockstorage_volume = true
volume_size = 15 # GB
volume_size_ofed = 15 # GB
image_disk_format = "qcow2"
networks = ["4b6b2722-ee5b-40ec-8e52-a6610e14cc51"] # portal-internal (DNS broken on ilab-60)
ssh_keypair_name = "slurm-app-ci"
Expand Down
1 change: 0 additions & 1 deletion environments/.stackhpc/LEAFCLOUD.pkrvars.hcl
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
flavor = "ec1.large"
use_blockstorage_volume = true
volume_size = 15 # GB
volume_size_ofed = 15 # GB
volume_type = "unencrypted"
image_disk_format = "qcow2"
networks = ["909e49e8-6911-473a-bf88-0495ca63853c"] # slurmapp-ci
Expand Down
93 changes: 66 additions & 27 deletions packer/README.md
Original file line number Diff line number Diff line change
@@ -1,47 +1,86 @@
# Packer-based image build

The appliance contains code and configuration to use Packer with the [OpenStack builder](https://www.packer.io/plugins/builders/openstack) to build images.
The appliance contains code and configuration to use [Packer](https://developer.hashicorp.com/packer) with the [OpenStack builder](https://www.packer.io/plugins/builders/openstack) to build images.

The image built is referred to as a "fat" image as it contains binaries for all nodes, but no configuration. Using a "fat" image:
The Packer configuration defined here builds "fat images" which contain binaries for all nodes, but no cluster-specific configuration. Using these:
- Enables the image to be tested in CI before production use.
- Ensures re-deployment of the cluster or deployment of additional nodes can be completed even if packages are changed in upstream repositories (e.g. due to RockyLinux or OpenHPC updates).
- Improves deployment speed by reducing the number of package downloads to improve deployment speed.

A default fat image is built in StackHPC's CI workflow and made available to clients. However it is possible to build site-specific fat images if required.
By default, a fat image build starts from a RockyLinux GenericCloud image and updates all DNF packages already present.

A fat image build starts from a RockyLinux GenericCloud image and (by default) updates all dnf packages in that image.
The fat images StackHPC builds and test in CI are available from [GitHub releases](https://github.com/stackhpc/ansible-slurm-appliance/releases). However with some additional configuration it is also possible to:
1. Build site-specific fat images from scratch.
2. Extend an existing fat image with additional software.

# Build Process
- Ensure the current OpenStack credentials have sufficient authorisation to upload images (this may or may not require the `member` role for an application credential, depending on your OpenStack configuration).
- Create a file `environments/<environment>/builder.pkrvars.hcl` containing at a minimum e.g.:

```hcl
flavor = "general.v1.small" # VM flavor to use for builder VMs
networks = ["26023e3d-bc8e-459c-8def-dbd47ab01756"] # List of network UUIDs to attach the VM to
source_image_name = "Rocky-8.9-GenericCloud" # Name of source image. This must exist in OpenStack and should be a Rocky Linux GenericCloud-based image.
```

This configuration will generate and use an ephemeral SSH key for communicating with the Packer VM. If this is undesirable, set `ssh_keypair_name` to the name of an existing keypair in OpenStack. The private key must be on the host running Packer, and its path can be set using `ssh_private_key_file`.

The network used for the Packer VM must provide outbound internet access but does not need to provide access to resources which the final cluster nodes require (e.g. Slurm control node, network filesystem servers etc.).
# Usage

The steps for building site-specific fat images or extending an existing fat image are the same:

1. Ensure the current OpenStack credentials have sufficient authorisation to upload images (this may or may not require the `member` role for an application credential, depending on your OpenStack configuration).
2. Create a Packer [variable definition file](https://developer.hashicorp.com/packer/docs/templates/hcl_templates/variables#assigning-values-to-input-variables) at e.g. `environments/<environment>/builder.pkrvars.hcl` containing at a minimum e.g.:

For additional options such as non-default private key locations or jumphost configuration see the variable descriptions in `./openstack.pkr.hcl`.
```hcl
flavor = "general.v1.small" # VM flavor to use for builder VMs
networks = ["26023e3d-bc8e-459c-8def-dbd47ab01756"] # List of network UUIDs to attach the VM to
```

- The network used for the Packer VM must provide outbound internet access but does not need to provide access to resources which the final cluster nodes require (e.g. Slurm control node, network filesystem servers etc.).

- For additional options such as non-default private key locations or jumphost configuration see the variable descriptions in `./openstack.pkr.hcl`.

- Activate the venv and the relevant environment.
- For an example of configuration for extending an existing fat image see below.

- Build images using the relevant variable definition file:
3. Activate the venv and the relevant environment.

4. Build images using the relevant variable definition file, e.g.:

cd packer/
PACKER_LOG=1 /usr/bin/packer build -only=openstack.openhpc --on-error=ask -var-file=$PKR_VAR_environment_root/builder.pkrvars.hcl openstack.pkr.hcl

Note that the `-only` flag here restricts the build to the non-OFED fat image "source" (in Packer terminology). Other
source options are:
- `-only=openhpc-ofed`: Build a fat image including Mellanox OFED
- `-only=openhpc-extra`: Build an image which extends an existing fat image - in this case the variable `source_image` or `source_image_name}` must also be set in the Packer variables file.

5. The built image will be automatically uploaded to OpenStack with a name prefixed `openhpc-` and including a timestamp and a shortened git hash.

# Build Process

cd packer
PACKER_LOG=1 /usr/bin/packer build -only openstack.openhpc --on-error=ask -var-file=$PKR_VAR_environment_root/builder.pkrvars.hcl openstack.pkr.hcl
In summary, Packer creates an OpenStack VM, runs Ansible on that, shuts it down, then creates an image from the root disk.

Note the build VM is added to the `builder` group to differentiate them from "real" nodes - see developer notes below.
Many of the Packer variables defined in `openstack.pkr.hcl` control the definition of the build VM and how to SSH to it to run Ansible, which are generic OpenStack builder options. Packer varibles can be set in a file at any convenient path; the above
example shows the use of the environment variable `$PKR_VAR_environment_root` (which itself sets the Packer variable
`environment_root`) to automatically select a variable file from the current environment, but for site-specific builds
using a path in a "parent" environment is likely to be more appropriate (as builds should not be environment-specific, to allow testing).

- The built image will be automatically uploaded to OpenStack with a name prefixed `openhpc-` and including a timestamp and a shortened git hash.
What is Slurm Appliance-specific are the details of how Ansible is run:
- The build VM is always added to the `builder` inventory group, which differentiates it from "real" nodes. This allows
variables to be set differently during Packer builds, e.g. to prevent services starting. The defaults for this are in `environments/common/inventory/group_vars/builder/`, which could be extended or overriden for site-specific fat image builds using `builder` groupvars for the relevant environment. It also runs some builder-specific code (e.g. to ensure Packer's SSH
keys are removed from the image).
- The default fat image build also adds the build VM to the "top-level" `compute`, `control` and `login` groups. This ensures
the Ansible specific to all of these types of nodes run (other inventory groups are constructed from these by `environments/common/inventory/groups file` - this is not builder-specific).
- Which groups the build VM is added to is controlled by the Packer `groups` variable. This can be redefined for builds using the `openhpc-extra` source to add the build VM into specific groups. E.g. with a Packer variable file:

# Notes for developers
source_image_name = {
RL9 = "openhpc-ofed-RL9-240619-0949-66c0e540"
}
groups = {
openhpc-extra = ["foo"]
}

Packer build VMs are added to both the `builder` group and the other top-level groups (e.g. `control`, `compute`, etc.). The former group allows `environments/common/inventory/group_vars/builder/defaults.yml` to set variables specifically for the Packer builds, e.g. for services which should not be started.
the build VM uses an existing "fat image" (rather than a RockyLinyux GenericCloud one) and is added to the `builder` and `foo` groups. This means only code targeting `builder` and `foo` groups runs. In this way an existing image can be extended with site-specific code, without modifying the part of the image which has already been tested in the StackHPC CI.

Note that hostnames in the Packer VMs are not the same as the equivalent "real" hosts. Therefore variables required inside a Packer VM must be defined as group vars, not hostvars.
- The playbook `ansible/fatimage.yml` is run which is only a subset of `ansible/site.yml`. This allows restricting the code
which runs during build for cases where setting `builder` groupvars is not sufficient (e.g. a role always attempts to configure or start services). This may eventually be removed.

Ansible may need to proxy to compute nodes. If the Packer build should not use the same proxy to connect to the builder VMs, note that proxy configuration should not be added to the `all` group.
There are some things to be aware of when developing Ansible to run in a Packer build VM:
- Only some tasks make sense. E.g. any services with a reliance on the network cannot be started, and may not be able to be enabled if when creating an instance with the resulting image the remote service will not be immediately present.
- Nothing should be written to the persistent state directory `appliances_state_dir`, as this is on the root filesystem rather than an OpenStack volume.
- Care should be taken not to leave data on the root filesystem which is not wanted in the final image, (e.g secrets).
- Build VM hostnames are not the same as for equivalent "real" hosts and do not contain `login`, `control` etc. Therefore variables used by the build VM must be defined as groupvars not hostvars.
- Ansible may need to proxy to real compute nodes. If Packer should not use the same proxy to connect to the
build VMs (e.g. build happens on a different network), proxy configuration should not be added to the `all` group.
- Currently two fat image "sources" are defined, with and without OFED. This simplifies CI configuration by allowing the
default source images to be defined in the `openstack.pkr.hcl` definition.
52 changes: 34 additions & 18 deletions packer/openstack.pkr.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -41,24 +41,27 @@ variable "networks" {

variable "os_version" {
type = string
description = "RL8 or RL9"
description = "'RL8' or 'RL9' with default source_image_* mappings"
default = "RL9"
}

# Must supply either fatimage_source_image_name or fatimage_source_image
variable "fatimage_source_image_name" {
# Must supply either source_image_name or source_image_id
variable "source_image_name" {
type = map(string)
description = "name of source image, keyed from var.os_version"
default = {
RL8: "Rocky-8-GenericCloud-Base-8.9-20231119.0.x86_64.qcow2"
RL9: "Rocky-9-GenericCloud-Base-9.4-20240523.0.x86_64.qcow2"
}
}

variable "fatimage_source_image" {
variable "source_image" {
type = map(string)
default = {
RL8: null
RL9: null
}
description = "UUID of source image, keyed from var.os_version"
}

variable "flavor" {
Expand Down Expand Up @@ -130,11 +133,6 @@ variable "volume_size" {
default = null # When not specified use the size of the builder instance root disk
}

variable "volume_size_ofed" {
type = number
default = null # When not specified use the size of the builder instance root disk
}

variable "image_disk_format" {
type = string
default = null # When not specified use the image default
Expand All @@ -145,6 +143,16 @@ variable "metadata" {
default = {}
}

variable "groups" {
type = map(list(string))
description = "Additional inventory groups (other than 'builder') to add build VM to, keyed by source name"
default = {
# fat image builds:
openhpc = ["control", "compute", "login"]
openhpc-ofed = ["control", "compute", "login", "ofed"]
}
}

source "openstack" "openhpc" {
# Build VM:
flavor = var.flavor
Expand All @@ -154,10 +162,11 @@ source "openstack" "openhpc" {
networks = var.networks
floating_ip_network = var.floating_ip_network
security_groups = var.security_groups
volume_size = var.volume_size

# Input image:
source_image = "${var.fatimage_source_image[var.os_version]}"
source_image_name = "${var.fatimage_source_image_name[var.os_version]}" # NB: must already exist in OpenStack
source_image = "${var.source_image[var.os_version]}"
source_image_name = "${var.source_image_name[var.os_version]}" # NB: must already exist in OpenStack

# SSH:
ssh_username = var.ssh_username
Expand All @@ -174,27 +183,34 @@ source "openstack" "openhpc" {
image_name = "${source.name}-${var.os_version}-${local.timestamp}-${substr(local.git_commit, 0, 8)}"
}

# "fat" image builds:
build {

# non-OFED:
# non-OFED fat image:
source "source.openstack.openhpc" {
name = "openhpc"
volume_size = var.volume_size
}

# OFED:
# OFED fat image:
source "source.openstack.openhpc" {
name = "openhpc-ofed"
volume_size = var.volume_size_ofed
}

# Extended site-specific image, built on fat image:
source "source.openstack.openhpc" {
name = "openhpc-extra"
}

provisioner "ansible" {
playbook_file = "${var.repo_root}/ansible/fatimage.yml"
groups = concat(["builder", "control", "compute", "login"], [for g in split("-", "${source.name}"): g if g != "openhpc"])
groups = concat(["builder"], var.groups[source.name])
keep_inventory_file = true # for debugging
use_proxy = false # see https://www.packer.io/docs/provisioners/ansible#troubleshooting
extra_arguments = ["--limit", "builder", "-i", "${var.repo_root}/packer/ansible-inventory.sh", "-vv", "-e", "@${var.repo_root}/packer/openhpc_extravars.yml"]
extra_arguments = [
"--limit", "builder", # prevent running against real nodes, if in inventory!
"-i", "${var.repo_root}/packer/ansible-inventory.sh",
"-vv",
"-e", "@${var.repo_root}/packer/openhpc_extravars.yml", # not overridable by environments
]
}

post-processor "manifest" {
Expand Down
Loading