Skip to content

Commit 8623b15

Browse files
sjpbMaxBed4d
authored andcommitted
allow extending fat images with site-specific groups (#403)
1 parent fcf4648 commit 8623b15

File tree

6 files changed

+104
-48
lines changed

6 files changed

+104
-48
lines changed

.github/workflows/fatimage.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ jobs:
6363
. environments/.stackhpc/activate
6464
cd packer/
6565
packer init .
66-
PACKER_LOG=1 packer build -on-error=${{ vars.PACKER_ON_ERROR }} -var-file=$PKR_VAR_environment_root/${{ vars.CI_CLOUD }}.pkrvars.hcl openstack.pkr.hcl
66+
PACKER_LOG=1 packer build -on-error=${{ vars.PACKER_ON_ERROR }} -except=openstack.openhpc-extra -var-file=$PKR_VAR_environment_root/${{ vars.CI_CLOUD }}.pkrvars.hcl openstack.pkr.hcl
6767
env:
6868
PKR_VAR_os_version: ${{ matrix.os_version }}
6969

ansible/cleanup.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,6 @@
3535

3636
- name: Run cloud-init cleanup
3737
command: cloud-init clean --logs --seed
38+
39+
- name: Cleanup /tmp
40+
command : rm -rf /tmp/*

environments/.stackhpc/ARCUS.pkrvars.hcl

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
flavor = "vm.ska.cpu.general.small"
22
use_blockstorage_volume = true
33
volume_size = 15 # GB
4-
volume_size_ofed = 15 # GB
54
image_disk_format = "qcow2"
65
networks = ["4b6b2722-ee5b-40ec-8e52-a6610e14cc51"] # portal-internal (DNS broken on ilab-60)
76
ssh_keypair_name = "slurm-app-ci"

environments/.stackhpc/LEAFCLOUD.pkrvars.hcl

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
flavor = "ec1.large"
22
use_blockstorage_volume = true
33
volume_size = 15 # GB
4-
volume_size_ofed = 15 # GB
54
volume_type = "unencrypted"
65
image_disk_format = "qcow2"
76
networks = ["909e49e8-6911-473a-bf88-0495ca63853c"] # slurmapp-ci

packer/README.md

Lines changed: 66 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,86 @@
11
# Packer-based image build
22

3-
The appliance contains code and configuration to use Packer with the [OpenStack builder](https://www.packer.io/plugins/builders/openstack) to build images.
3+
The appliance contains code and configuration to use [Packer](https://developer.hashicorp.com/packer) with the [OpenStack builder](https://www.packer.io/plugins/builders/openstack) to build images.
44

5-
The image built is referred to as a "fat" image as it contains binaries for all nodes, but no configuration. Using a "fat" image:
5+
The Packer configuration defined here builds "fat images" which contain binaries for all nodes, but no cluster-specific configuration. Using these:
66
- Enables the image to be tested in CI before production use.
77
- Ensures re-deployment of the cluster or deployment of additional nodes can be completed even if packages are changed in upstream repositories (e.g. due to RockyLinux or OpenHPC updates).
88
- Improves deployment speed by reducing the number of package downloads to improve deployment speed.
99

10-
A default fat image is built in StackHPC's CI workflow and made available to clients. However it is possible to build site-specific fat images if required.
10+
By default, a fat image build starts from a RockyLinux GenericCloud image and updates all DNF packages already present.
1111

12-
A fat image build starts from a RockyLinux GenericCloud image and (by default) updates all dnf packages in that image.
12+
The fat images StackHPC builds and test in CI are available from [GitHub releases](https://github.com/stackhpc/ansible-slurm-appliance/releases). However with some additional configuration it is also possible to:
13+
1. Build site-specific fat images from scratch.
14+
2. Extend an existing fat image with additional software.
1315

14-
# Build Process
15-
- Ensure the current OpenStack credentials have sufficient authorisation to upload images (this may or may not require the `member` role for an application credential, depending on your OpenStack configuration).
16-
- Create a file `environments/<environment>/builder.pkrvars.hcl` containing at a minimum e.g.:
17-
18-
```hcl
19-
flavor = "general.v1.small" # VM flavor to use for builder VMs
20-
networks = ["26023e3d-bc8e-459c-8def-dbd47ab01756"] # List of network UUIDs to attach the VM to
21-
source_image_name = "Rocky-8.9-GenericCloud" # Name of source image. This must exist in OpenStack and should be a Rocky Linux GenericCloud-based image.
22-
```
23-
24-
This configuration will generate and use an ephemeral SSH key for communicating with the Packer VM. If this is undesirable, set `ssh_keypair_name` to the name of an existing keypair in OpenStack. The private key must be on the host running Packer, and its path can be set using `ssh_private_key_file`.
2516

26-
The network used for the Packer VM must provide outbound internet access but does not need to provide access to resources which the final cluster nodes require (e.g. Slurm control node, network filesystem servers etc.).
17+
# Usage
18+
19+
The steps for building site-specific fat images or extending an existing fat image are the same:
20+
21+
1. Ensure the current OpenStack credentials have sufficient authorisation to upload images (this may or may not require the `member` role for an application credential, depending on your OpenStack configuration).
22+
2. Create a Packer [variable definition file](https://developer.hashicorp.com/packer/docs/templates/hcl_templates/variables#assigning-values-to-input-variables) at e.g. `environments/<environment>/builder.pkrvars.hcl` containing at a minimum e.g.:
2723

28-
For additional options such as non-default private key locations or jumphost configuration see the variable descriptions in `./openstack.pkr.hcl`.
24+
```hcl
25+
flavor = "general.v1.small" # VM flavor to use for builder VMs
26+
networks = ["26023e3d-bc8e-459c-8def-dbd47ab01756"] # List of network UUIDs to attach the VM to
27+
```
28+
29+
- The network used for the Packer VM must provide outbound internet access but does not need to provide access to resources which the final cluster nodes require (e.g. Slurm control node, network filesystem servers etc.).
30+
31+
- For additional options such as non-default private key locations or jumphost configuration see the variable descriptions in `./openstack.pkr.hcl`.
2932
30-
- Activate the venv and the relevant environment.
33+
- For an example of configuration for extending an existing fat image see below.
3134
32-
- Build images using the relevant variable definition file:
35+
3. Activate the venv and the relevant environment.
36+
37+
4. Build images using the relevant variable definition file, e.g.:
38+
39+
cd packer/
40+
PACKER_LOG=1 /usr/bin/packer build -only=openstack.openhpc --on-error=ask -var-file=$PKR_VAR_environment_root/builder.pkrvars.hcl openstack.pkr.hcl
41+
42+
Note that the `-only` flag here restricts the build to the non-OFED fat image "source" (in Packer terminology). Other
43+
source options are:
44+
- `-only=openhpc-ofed`: Build a fat image including Mellanox OFED
45+
- `-only=openhpc-extra`: Build an image which extends an existing fat image - in this case the variable `source_image` or `source_image_name}` must also be set in the Packer variables file.
46+
47+
5. The built image will be automatically uploaded to OpenStack with a name prefixed `openhpc-` and including a timestamp and a shortened git hash.
48+
49+
# Build Process
3350
34-
cd packer
35-
PACKER_LOG=1 /usr/bin/packer build -only openstack.openhpc --on-error=ask -var-file=$PKR_VAR_environment_root/builder.pkrvars.hcl openstack.pkr.hcl
51+
In summary, Packer creates an OpenStack VM, runs Ansible on that, shuts it down, then creates an image from the root disk.
3652
37-
Note the build VM is added to the `builder` group to differentiate them from "real" nodes - see developer notes below.
53+
Many of the Packer variables defined in `openstack.pkr.hcl` control the definition of the build VM and how to SSH to it to run Ansible, which are generic OpenStack builder options. Packer varibles can be set in a file at any convenient path; the above
54+
example shows the use of the environment variable `$PKR_VAR_environment_root` (which itself sets the Packer variable
55+
`environment_root`) to automatically select a variable file from the current environment, but for site-specific builds
56+
using a path in a "parent" environment is likely to be more appropriate (as builds should not be environment-specific, to allow testing).
3857
39-
- The built image will be automatically uploaded to OpenStack with a name prefixed `openhpc-` and including a timestamp and a shortened git hash.
58+
What is Slurm Appliance-specific are the details of how Ansible is run:
59+
- The build VM is always added to the `builder` inventory group, which differentiates it from "real" nodes. This allows
60+
variables to be set differently during Packer builds, e.g. to prevent services starting. The defaults for this are in `environments/common/inventory/group_vars/builder/`, which could be extended or overriden for site-specific fat image builds using `builder` groupvars for the relevant environment. It also runs some builder-specific code (e.g. to ensure Packer's SSH
61+
keys are removed from the image).
62+
- The default fat image build also adds the build VM to the "top-level" `compute`, `control` and `login` groups. This ensures
63+
the Ansible specific to all of these types of nodes run (other inventory groups are constructed from these by `environments/common/inventory/groups file` - this is not builder-specific).
64+
- Which groups the build VM is added to is controlled by the Packer `groups` variable. This can be redefined for builds using the `openhpc-extra` source to add the build VM into specific groups. E.g. with a Packer variable file:
4065
41-
# Notes for developers
66+
source_image_name = {
67+
RL9 = "openhpc-ofed-RL9-240619-0949-66c0e540"
68+
}
69+
groups = {
70+
openhpc-extra = ["foo"]
71+
}
4272
43-
Packer build VMs are added to both the `builder` group and the other top-level groups (e.g. `control`, `compute`, etc.). The former group allows `environments/common/inventory/group_vars/builder/defaults.yml` to set variables specifically for the Packer builds, e.g. for services which should not be started.
73+
the build VM uses an existing "fat image" (rather than a RockyLinyux GenericCloud one) and is added to the `builder` and `foo` groups. This means only code targeting `builder` and `foo` groups runs. In this way an existing image can be extended with site-specific code, without modifying the part of the image which has already been tested in the StackHPC CI.
4474
45-
Note that hostnames in the Packer VMs are not the same as the equivalent "real" hosts. Therefore variables required inside a Packer VM must be defined as group vars, not hostvars.
75+
- The playbook `ansible/fatimage.yml` is run which is only a subset of `ansible/site.yml`. This allows restricting the code
76+
which runs during build for cases where setting `builder` groupvars is not sufficient (e.g. a role always attempts to configure or start services). This may eventually be removed.
4677
47-
Ansible may need to proxy to compute nodes. If the Packer build should not use the same proxy to connect to the builder VMs, note that proxy configuration should not be added to the `all` group.
78+
There are some things to be aware of when developing Ansible to run in a Packer build VM:
79+
- Only some tasks make sense. E.g. any services with a reliance on the network cannot be started, and may not be able to be enabled if when creating an instance with the resulting image the remote service will not be immediately present.
80+
- Nothing should be written to the persistent state directory `appliances_state_dir`, as this is on the root filesystem rather than an OpenStack volume.
81+
- Care should be taken not to leave data on the root filesystem which is not wanted in the final image, (e.g secrets).
82+
- Build VM hostnames are not the same as for equivalent "real" hosts and do not contain `login`, `control` etc. Therefore variables used by the build VM must be defined as groupvars not hostvars.
83+
- Ansible may need to proxy to real compute nodes. If Packer should not use the same proxy to connect to the
84+
build VMs (e.g. build happens on a different network), proxy configuration should not be added to the `all` group.
85+
- Currently two fat image "sources" are defined, with and without OFED. This simplifies CI configuration by allowing the
86+
default source images to be defined in the `openstack.pkr.hcl` definition.

packer/openstack.pkr.hcl

Lines changed: 34 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -41,24 +41,27 @@ variable "networks" {
4141

4242
variable "os_version" {
4343
type = string
44-
description = "RL8 or RL9"
44+
description = "'RL8' or 'RL9' with default source_image_* mappings"
45+
default = "RL9"
4546
}
4647

47-
# Must supply either fatimage_source_image_name or fatimage_source_image
48-
variable "fatimage_source_image_name" {
48+
# Must supply either source_image_name or source_image_id
49+
variable "source_image_name" {
4950
type = map(string)
51+
description = "name of source image, keyed from var.os_version"
5052
default = {
5153
RL8: "Rocky-8-GenericCloud-Base-8.9-20231119.0.x86_64.qcow2"
5254
RL9: "Rocky-9-GenericCloud-Base-9.4-20240523.0.x86_64.qcow2"
5355
}
5456
}
5557

56-
variable "fatimage_source_image" {
58+
variable "source_image" {
5759
type = map(string)
5860
default = {
5961
RL8: null
6062
RL9: null
6163
}
64+
description = "UUID of source image, keyed from var.os_version"
6265
}
6366

6467
variable "flavor" {
@@ -130,11 +133,6 @@ variable "volume_size" {
130133
default = null # When not specified use the size of the builder instance root disk
131134
}
132135

133-
variable "volume_size_ofed" {
134-
type = number
135-
default = null # When not specified use the size of the builder instance root disk
136-
}
137-
138136
variable "image_disk_format" {
139137
type = string
140138
default = null # When not specified use the image default
@@ -145,6 +143,16 @@ variable "metadata" {
145143
default = {}
146144
}
147145

146+
variable "groups" {
147+
type = map(list(string))
148+
description = "Additional inventory groups (other than 'builder') to add build VM to, keyed by source name"
149+
default = {
150+
# fat image builds:
151+
openhpc = ["control", "compute", "login"]
152+
openhpc-ofed = ["control", "compute", "login", "ofed"]
153+
}
154+
}
155+
148156
source "openstack" "openhpc" {
149157
# Build VM:
150158
flavor = var.flavor
@@ -154,10 +162,11 @@ source "openstack" "openhpc" {
154162
networks = var.networks
155163
floating_ip_network = var.floating_ip_network
156164
security_groups = var.security_groups
165+
volume_size = var.volume_size
157166

158167
# Input image:
159-
source_image = "${var.fatimage_source_image[var.os_version]}"
160-
source_image_name = "${var.fatimage_source_image_name[var.os_version]}" # NB: must already exist in OpenStack
168+
source_image = "${var.source_image[var.os_version]}"
169+
source_image_name = "${var.source_image_name[var.os_version]}" # NB: must already exist in OpenStack
161170

162171
# SSH:
163172
ssh_username = var.ssh_username
@@ -174,27 +183,34 @@ source "openstack" "openhpc" {
174183
image_name = "${source.name}-${var.os_version}-${local.timestamp}-${substr(local.git_commit, 0, 8)}"
175184
}
176185

177-
# "fat" image builds:
178186
build {
179187

180-
# non-OFED:
188+
# non-OFED fat image:
181189
source "source.openstack.openhpc" {
182190
name = "openhpc"
183-
volume_size = var.volume_size
184191
}
185192

186-
# OFED:
193+
# OFED fat image:
187194
source "source.openstack.openhpc" {
188195
name = "openhpc-ofed"
189-
volume_size = var.volume_size_ofed
196+
}
197+
198+
# Extended site-specific image, built on fat image:
199+
source "source.openstack.openhpc" {
200+
name = "openhpc-extra"
190201
}
191202

192203
provisioner "ansible" {
193204
playbook_file = "${var.repo_root}/ansible/fatimage.yml"
194-
groups = concat(["builder", "control", "compute", "login"], [for g in split("-", "${source.name}"): g if g != "openhpc"])
205+
groups = concat(["builder"], var.groups[source.name])
195206
keep_inventory_file = true # for debugging
196207
use_proxy = false # see https://www.packer.io/docs/provisioners/ansible#troubleshooting
197-
extra_arguments = ["--limit", "builder", "-i", "${var.repo_root}/packer/ansible-inventory.sh", "-vv", "-e", "@${var.repo_root}/packer/openhpc_extravars.yml"]
208+
extra_arguments = [
209+
"--limit", "builder", # prevent running against real nodes, if in inventory!
210+
"-i", "${var.repo_root}/packer/ansible-inventory.sh",
211+
"-vv",
212+
"-e", "@${var.repo_root}/packer/openhpc_extravars.yml", # not overridable by environments
213+
]
198214
}
199215

200216
post-processor "manifest" {

0 commit comments

Comments
 (0)