Skip to content

Add support for OFED #254

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Apr 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
f72377d
add ofed role
sjpb Mar 2, 2023
9cffe0b
fix ofed dependencies install
sjpb Mar 7, 2023
720852f
use mlnxofedinstall as recommended for Rocky8/9 now
sjpb Mar 14, 2024
8114bb1
Merge branch 'main' into ofed
sjpb Mar 20, 2024
ba00f71
add build of OFED image to CI
sjpb Mar 20, 2024
18b8e3f
add build of OFED image to CI
sjpb Mar 20, 2024
6c8e3cc
fix ofed commands
sjpb Mar 20, 2024
d1d0299
default to OFED hpc package selection
sjpb Mar 21, 2024
d200b82
fix OFED packages concatenation on RL9
sjpb Mar 21, 2024
7f257d6
autobuild on ofed branch
sjpb Mar 21, 2024
2a18a2f
always build RL8 and RL9 images
sjpb Mar 21, 2024
d68559d
fix ofed_package_selection templating
sjpb Mar 21, 2024
f4fa9ec
fix ofed_build_packages
sjpb Mar 21, 2024
2901294
avoid OFED install timeouts
sjpb Mar 21, 2024
69d0562
Merge branch 'ofed' of github.com:stackhpc/ansible-slurm-appliance in…
sjpb Mar 26, 2024
f49922b
Merge branch 'main' into ofed
sjpb Mar 26, 2024
0427319
add additional packages for RL8
sjpb Mar 27, 2024
b854dd3
bump leafcloud build size for memory issues
sjpb Mar 27, 2024
81bcf36
fix missing packages for RL9 build
sjpb Mar 28, 2024
430af8a
remove duplication in packer definition and allow for different OFED …
sjpb Mar 28, 2024
8a0ec9b
add leafcloud OFED disk size
sjpb Mar 28, 2024
db81091
workaround OFED/turbovnc install clash
sjpb Mar 28, 2024
e9fe323
output multiple image names
sjpb Apr 4, 2024
84485ed
bump CI to RL8 and RL9 OFED-enabled images
sjpb Apr 4, 2024
3223846
Merge branch 'main' into ofed
sjpb Apr 4, 2024
5b64a7c
Merge branch 'main' into ofed (default to RL9)
sjpb Apr 9, 2024
4b09ba8
Merge branch 'main' into ofed
sjpb Apr 23, 2024
7b1afa0
bump CI images (non-OFED for RL8)
sjpb Apr 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 9 additions & 8 deletions .github/workflows/fatimage.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,18 +61,19 @@ jobs:
. environments/.stackhpc/activate
cd packer/
packer init .
PACKER_LOG=1 packer build -only openstack.openhpc -on-error=${{ vars.PACKER_ON_ERROR }} -var-file=$PKR_VAR_environment_root/${{ vars.CI_CLOUD }}.pkrvars.hcl openstack.pkr.hcl
PACKER_LOG=1 packer build -on-error=${{ vars.PACKER_ON_ERROR }} -var-file=$PKR_VAR_environment_root/${{ vars.CI_CLOUD }}.pkrvars.hcl openstack.pkr.hcl
env:
PKR_VAR_os_version: ${{ matrix.os_version }}

- name: Get created image name from manifest
- name: Get created image names from manifest
id: manifest
run: |
. venv/bin/activate
IMAGE_ID=$(jq --raw-output '.builds[-1].artifact_id' packer/packer-manifest.json)
while ! openstack image show -f value -c name $IMAGE_ID; do
sleep 30
for IMAGE_ID in $(jq --raw-output '.builds[].artifact_id' packer/packer-manifest.json)
do
while ! openstack image show -f value -c name $IMAGE_ID; do
sleep 5
done
IMAGE_NAME=$(openstack image show -f value -c name $IMAGE_ID)
echo $IMAGE_NAME
done
IMAGE_NAME=$(openstack image show -f value -c name $IMAGE_ID)
echo "IMAGE_ID=${IMAGE_ID}" >> "$GITHUB_OUTPUT"
echo "IMAGE_NAME=${IMAGE_NAME}" >> "$GITHUB_OUTPUT"
3 changes: 2 additions & 1 deletion ansible/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,4 +52,5 @@ roles/*
!roles/image_build/**
!roles/persist_hostkeys/
!roles/persist_hostkeys/**
!roles/requirements.yml
!roles/ofed/
!roles/ofed/**
8 changes: 8 additions & 0 deletions ansible/bootstrap.yml
Original file line number Diff line number Diff line change
Expand Up @@ -196,3 +196,11 @@
- name: update facts
setup:
when: (sestatus.changed | default(false)) or (sestatus.reboot_required | default(false))

- hosts: ofed
gather_facts: no
become: yes
tags: ofed
tasks:
- include_role:
name: ofed
21 changes: 21 additions & 0 deletions ansible/roles/ofed/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# ofed

This role installs Mellanox OFED:
- It checks that the running kernel is the latest installed one, and errors if not.
- Installation uses the `mlnxofedinstall` command, with support for the running kernel
and (by default) without firmware updates.

As OFED installation takes a long time generally this should only be used during image build,
for example by setting:

```
environments/groups/<environment>/groups:
[ofed:children]
builder
```

# Role variables

See `defaults/main.yml`

Note ansible facts are required, unless setting `ofed_distro_version` and `ofed_arch` specifically.
28 changes: 28 additions & 0 deletions ansible/roles/ofed/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
ofed_version: 24.01-0.3.3.1
ofed_download_url: https://content.mellanox.com/ofed/MLNX_OFED-{{ ofed_version }}/MLNX_OFED_LINUX-{{ ofed_version }}-{{ ofed_distro }}{{ ofed_distro_version }}-{{ ofed_arch }}.tgz
ofed_distro: rhel # NB: not expected to work on other distros due to installation differences
ofed_distro_version: "{{ ansible_distribution_version }}" # e.g. '8.9'
ofed_arch: "{{ ansible_architecture }}"
ofed_tmp_dir: /tmp
ofed_update_firmware: false
ofed_build_packages: # may require additional packages depending on ofed_package_selection
- autoconf
- automake
- gcc
- gcc-gfortran
- kernel-devel-{{ _ofed_loaded_kernel.stdout | trim }}
- kernel-rpm-macros
- libtool
- lsof
- patch
- pciutils
- perl
- rpm-build
- tcl
- tk
ofed_build_rl8_packages:
- gdb-headless
- python36
ofed_package_selection: # list of package selection flags for mlnxofedinstall script
- hpc
- with-nfsrdma
73 changes: 73 additions & 0 deletions ansible/roles/ofed/tasks/install.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
- name: Get installed kernels
command: dnf list --installed kernel
register: _ofed_dnf_kernels
changed_when: false

- name: Determine running kernel
command: uname -r # e.g. 4.18.0-513.18.1.el8_9.x86_64
register: _ofed_loaded_kernel
changed_when: false

- name: Check current kernel is newest installed
assert:
that: _ofed_loaded_kernel.stdout == _ofed_dnf_kernels_newest
fail_msg: "Kernel {{ _ofed_loaded_kernel.stdout }} is loaded but newer {{ _ofed_dnf_kernels_newest }} is installed: consider rebooting?"
vars:
_ofed_dnf_kernels_newest: >-
{{ _ofed_dnf_kernels.stdout_lines[1:] | map('regex_replace', '^\w+\.(\w+)\s+(\S+)\s+\S+\s*$', '\2.\1') | community.general.version_sort | last }}
# dnf line format e.g. "kernel.x86_64 4.18.0-513.18.1.el8_9 @baseos "

- name: Enable epel
dnf:
name: epel-release

- name: Check for existing OFED installation
command: ofed_info
changed_when: false
failed_when:
- _ofed_info.rc > 0
- "'No such file or directory' not in _ofed_info.msg"
register: _ofed_info

- name: Install build prerequisites
dnf:
name: "{{ ofed_build_packages + (ofed_build_rl8_packages if ofed_distro_version == '8.9' else []) }}"
when: "'MLNX_OFED_LINUX-' + ofed_version not in _ofed_info.stdout"
# don't want to install a load of prereqs unnecessarily

- name: Download and unpack Mellanox OFED tarball
ansible.builtin.unarchive:
src: "{{ ofed_download_url }}"
dest: "{{ ofed_tmp_dir }}"
remote_src: yes
become: no
when: "'MLNX_OFED_LINUX-' + ofed_version not in _ofed_info.stdout"

# Below from https://docs.nvidia.com/networking/display/mlnxofedv24010331/user+manual
- name: Run OFED install script
command:
cmd: >
./mlnxofedinstall
--add-kernel-support
{% if not ofed_update_firmware %}--without-fw-update{% endif %}
--force
--skip-repo
{% for pkgsel in ofed_package_selection %}
--{{ pkgsel }}
{% endfor %}
chdir: "{{ ofed_tmp_dir }}/MLNX_OFED_LINUX-{{ ofed_version }}-{{ ofed_distro }}{{ ofed_distro_version }}-{{ ofed_arch }}/"
register: _ofed_install
when: "'MLNX_OFED_LINUX-' + ofed_version not in _ofed_info.stdout"
async: "{{ 45 * 60 }}" # wait for up to 45 minutes
poll: 15 # check every 15 seconds

- name: Update initramfs
command:
cmd: dracut -f
when: '"update your initramfs" in _ofed_install.stdout | default("")'
failed_when: false # always shows errors due to deleted modules for inbox RDMA drivers

- name: Load the new driver
command:
cmd: /etc/init.d/openibd restart
when: '"To load the new driver" in _ofed_install.stdout | default("")'
1 change: 1 addition & 0 deletions ansible/roles/ofed/tasks/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
- include_tasks: install.yml
25 changes: 25 additions & 0 deletions ansible/roles/openondemand/tasks/vnc_compute.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,22 @@
yum:
name: epel-release

- name: Check /etc/init.d
ansible.builtin.stat:
path: /etc/init.d
register: init_d

- name: Move OFED-installed init scripts
# turbovnc installs chkconfig which symlinks /etc/init.d from /etc/rc.d/init.d
# but OFED has already created that and installed files in it.
# See https://access.redhat.com/solutions/6969215
ansible.builtin.command:
cmd: mv /etc/init.d /etc/init.d.orig
creates: /etc/init.d.orig
when:
- init_d.stat.exists
- not init_d.stat.islnk

- name: Install VNC-related packages
tags: install
dnf:
Expand All @@ -19,6 +35,15 @@
- python3.9
- dbus-x11

- name: Replace OFED-installed init scripts
ansible.builtin.copy:
src: /etc/init.d.orig/ # trailing / to get contents
dest: /etc/init.d
remote_src: true
when:
- init_d.stat.exists
- not init_d.stat.islnk

- name: Install Xfce desktop
tags: install
yum:
Expand Down
3 changes: 2 additions & 1 deletion environments/.stackhpc/LEAFCLOUD.pkrvars.hcl
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
flavor = "en1.xsmall"
flavor = "ec1.medium"
use_blockstorage_volume = true
volume_size = 12 # GB. Compatible with SMS-lab's general.v1.tiny
volume_size_ofed = 15 # GB
volume_type = "unencrypted"
image_disk_format = "qcow2"
networks = ["909e49e8-6911-473a-bf88-0495ca63853c"] # slurmapp-ci
Expand Down
6 changes: 3 additions & 3 deletions environments/.stackhpc/terraform/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,9 @@ variable "cluster_image" {
description = "single image for all cluster nodes, keyed by os_version - a convenience for CI"
type = map(string)
default = {
# https://github.com/stackhpc/ansible-slurm-appliance/pull/382
RL8: "openhpc-RL8-240327-1050-4812f852"
RL9: "openhpc-RL9-240327-1026-4812f852"
# https://github.com/stackhpc/ansible-slurm-appliance/pull/353
RL8: "openhpc-RL8-240423-1002-4b09ba85"
RL9: "openhpc-ofed-RL9-240423-1059-4b09ba85"
}
}

Expand Down
67 changes: 44 additions & 23 deletions packer/openstack.pkr.hcl
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,11 @@ variable "volume_size" {
default = null # When not specified use the size of the builder instance root disk
}

variable "volume_size_ofed" {
type = number
default = null # When not specified use the size of the builder instance root disk
}

variable "image_disk_format" {
type = string
default = null # When not specified use the image default
Expand All @@ -141,39 +146,55 @@ variable "metadata" {
}

source "openstack" "openhpc" {
flavor = "${var.flavor}"
volume_size = "${var.volume_size}"
use_blockstorage_volume = "${var.use_blockstorage_volume}"
# Build VM:
flavor = var.flavor
use_blockstorage_volume = var.use_blockstorage_volume
volume_type = var.volume_type
image_disk_format = "${var.image_disk_format}"
metadata = "${var.metadata}"
networks = "${var.networks}"
ssh_username = "${var.ssh_username}"
metadata = var.metadata
networks = var.networks
floating_ip_network = var.floating_ip_network
security_groups = var.security_groups

# Input image:
source_image = "${var.fatimage_source_image[var.os_version]}"
source_image_name = "${var.fatimage_source_image_name[var.os_version]}" # NB: must already exist in OpenStack

# SSH:
ssh_username = var.ssh_username
ssh_timeout = "20m"
ssh_private_key_file = "${var.ssh_private_key_file}" # TODO: doc same requirements as for qemu build?
ssh_keypair_name = "${var.ssh_keypair_name}" # TODO: doc this
ssh_bastion_host = "${var.ssh_bastion_host}"
ssh_bastion_username = "${var.ssh_bastion_username}"
ssh_bastion_private_key_file = "${var.ssh_bastion_private_key_file}"
security_groups = "${var.security_groups}"
image_visibility = "${var.image_visibility}"
}

# The "fat" image build with all binaries:
ssh_private_key_file = var.ssh_private_key_file
ssh_keypair_name = var.ssh_keypair_name # TODO: doc this
ssh_bastion_host = var.ssh_bastion_host
ssh_bastion_username = var.ssh_bastion_username
ssh_bastion_private_key_file = var.ssh_bastion_private_key_file

# Output image:
image_disk_format = var.image_disk_format
image_visibility = var.image_visibility
image_name = "${source.name}-${var.os_version}-${local.timestamp}-${substr(local.git_commit, 0, 8)}"
}

# "fat" image builds:
build {

# non-OFED:
source "source.openstack.openhpc" {
name = "openhpc"
volume_size = var.volume_size
}

# OFED:
source "source.openstack.openhpc" {
floating_ip_network = "${var.floating_ip_network}"
source_image = "${var.fatimage_source_image[var.os_version]}"
source_image_name = "${var.fatimage_source_image_name[var.os_version]}" # NB: must already exist in OpenStack
image_name = "${source.name}-${var.os_version}-${local.timestamp}-${substr(local.git_commit, 0, 8)}" # similar to name from slurm_image_builder
name = "openhpc-ofed"
volume_size = var.volume_size_ofed
}

provisioner "ansible" {
playbook_file = "${var.repo_root}/ansible/fatimage.yml"
groups = ["builder", "control", "compute", "login"]
groups = concat(["builder", "control", "compute", "login"], [for g in split("-", "${source.name}"): g if g != "openhpc"])
keep_inventory_file = true # for debugging
use_proxy = false # see https://www.packer.io/docs/provisioners/ansible#troubleshooting
extra_arguments = ["--limit", "builder", "-i", "${var.repo_root}/packer/ansible-inventory.sh", "-vv", "-e", "@${var.repo_root}/packer/${source.name}_extravars.yml"]
extra_arguments = ["--limit", "builder", "-i", "${var.repo_root}/packer/ansible-inventory.sh", "-vv", "-e", "@${var.repo_root}/packer/openhpc_extravars.yml"]
}

post-processor "manifest" {
Expand Down