Skip to content

Commit f7e7760

Browse files
authored
Support lustre client (#447)
* WIP: add lustre role * allow definition of multiple lustre_mounts * fix lustre build for 2.15.5 release candidate * simplify lustre defaults * allow lustre install during build to get kernel version * allow extending fat images with site-specific groups * fix packer build so only roles for defined groups run * enable control of 'extra' build image name * bump to release lustre * add lnet configuration * simplify lustre mount logic * provide lnet config * autodetermine lustre interface * WIP: validation needs fixing for lustre_mounts removal * add working lnet.conf template * refactor lustre role for multiple mounts, selectable lnet label * remove unneeded comments from lustre taskfiles * fix lustre net type * fixup opensearch install permissions * add docs for extra builds * fix packer volume size definition * fix missing image name for cuda build * bump CI image * update packer README for modified image vars * move packer docs into docs/ * make packer extra build directly configurable * tidy packer docs * fix build error 'Error: Unset variable extra_build_volume_size' * fix error with null default during volume size lookup * note lnet protocol limitation * bump CI image to test
1 parent 6f1554c commit f7e7760

File tree

17 files changed

+389
-98
lines changed

17 files changed

+389
-98
lines changed

.github/workflows/fatimage.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,4 +117,4 @@ jobs:
117117
path: |
118118
./image-id.txt
119119
./image-name.txt
120-
overwrite: true
120+
overwrite: true

ansible/.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,4 +58,5 @@ roles/*
5858
!roles/squid/**
5959
!roles/tuned/
6060
!roles/tuned/**
61-
61+
!roles/lustre/
62+
!roles/lustre/**

ansible/fatimage.yml

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525

2626
- hosts: builder
2727
become: yes
28-
gather_facts: no
28+
gather_facts: yes
2929
tasks:
3030
# - import_playbook: iam.yml
3131
- name: Install FreeIPA client
@@ -44,6 +44,11 @@
4444
name: stackhpc.os-manila-mount
4545
tasks_from: install.yml
4646
when: "'manila' in group_names"
47+
- name: Install Lustre packages
48+
include_role:
49+
name: lustre
50+
tasks_from: install.yml
51+
when: "'lustre' in group_names"
4752

4853
- import_playbook: extras.yml
4954

@@ -57,6 +62,7 @@
5762
name: mysql
5863
tasks_from: install.yml
5964
when: "'mysql' in group_names"
65+
6066
- name: OpenHPC
6167
import_role:
6268
name: stackhpc.openhpc
@@ -83,18 +89,21 @@
8389
import_role:
8490
name: openondemand
8591
tasks_from: vnc_compute.yml
92+
8693
when: "'openondemand_desktop' in group_names"
94+
8795
- name: Open Ondemand jupyter node
8896
import_role:
8997
name: openondemand
9098
tasks_from: jupyter_compute.yml
91-
when: "'openondemand' in group_names"
99+
when: "'openondemand_jupyter' in group_names"
92100

93101
# - import_playbook: monitoring.yml:
94102
- import_role:
95103
name: opensearch
96104
tasks_from: install.yml
97105
when: "'opensearch' in group_names"
106+
98107
# slurm_stats - nothing to do
99108
- import_role:
100109
name: filebeat

ansible/filesystems.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,13 @@
2424
tasks:
2525
- include_role:
2626
name: stackhpc.os-manila-mount
27+
28+
- name: Setup Lustre clients
29+
hosts: lustre
30+
become: true
31+
tags: lustre
32+
tasks:
33+
- include_role:
34+
name: lustre
35+
# NB install is ONLY run in builder
36+
tasks_from: configure.yml

ansible/roles/lustre/README.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# lustre
2+
3+
Install and configure a Lustre client. This builds RPM packages from source.
4+
5+
**NB:** The `install.yml` playbook in this role should only be run during image build and is not idempotent. This will install the `kernel-devel` package; if not already installed (e.g. from an `ofed` installation), this may require enabling update of DNF packages during build using `update_enable=true`, which will upgrade the kernel as well.
6+
7+
**NB:** Currently this only supports RockyLinux 9.
8+
9+
## Role Variables
10+
11+
- `lustre_version`: Optional str. Version of lustre to build, default `2.15.5` which is the first version with EL9 support
12+
- `lustre_lnet_label`: Optional str. The "lnet label" part of the host's NID, e.g. `tcp0`. Only the `tcp` protocol type is currently supported. Default `tcp`.
13+
- `lustre_mgs_nid`: Required str. The NID(s) for the MGS, e.g. `192.168.227.11@tcp1` (separate mutiple MGS NIDs using `:`).
14+
- `lustre_mounts`: Required list. Define Lustre filesystems and mountpoints as a list of dicts with keys:
15+
- `fs_name`: Required str. The name of the filesystem to mount
16+
- `mount_point`: Required str. Path to mount filesystem at.
17+
- `mount_state`: Optional mount state, as for [ansible.posix.mount](https://docs.ansible.com/ansible/latest/collections/ansible/posix/mount_module.html#parameter-state). Default is `lustre_mount_state`.
18+
- `mount_options`: Optional mount options. Default is `lustre_mount_options`.
19+
- `lustre_mount_state`. Optional default mount state for all mounts, as for [ansible.posix.mount](https://docs.ansible.com/ansible/latest/collections/ansible/posix/mount_module.html#parameter-state). Default is `mounted`.
20+
- `lustre_mount_options`. Optional default mount options. Default values are systemd defaults from [Lustre client docs](http://wiki.lustre.org/Mounting_a_Lustre_File_System_on_Client_Nodes).
21+
22+
The following variables control the package build and and install and should not generally be required:
23+
- `lustre_build_packages`: Optional list. Prerequisite packages required to build Lustre. See `defaults/main.yml`.
24+
- `lustre_build_dir`: Optional str. Path to build lustre at, default `/tmp/lustre-release`.
25+
- `lustre_configure_opts`: Optional list. Options to `./configure` command. Default builds client rpms supporting Mellanox OFED, without support for GSS keys.
26+
- `lustre_rpm_globs`: Optional list. Shell glob patterns for rpms to install. Note order is important as the built RPMs are not in a yum repo. Default is just the `kmod-lustre-client` and `lustre-client` packages.
27+
- `lustre_build_cleanup`: Optional bool. Whether to uninstall prerequisite packages and delete the build directories etc. Default `true`.
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
lustre_version: '2.15.5' # https://www.lustre.org/lustre-2-15-5-released/
2+
lustre_lnet_label: tcp
3+
#lustre_mgs_nid:
4+
lustre_mounts: []
5+
lustre_mount_state: mounted
6+
lustre_mount_options: 'defaults,_netdev,noauto,x-systemd.automount,x-systemd.requires=lnet.service'
7+
8+
# below variables are for build and should not generally require changes
9+
lustre_build_packages:
10+
- "kernel-devel-{{ ansible_kernel }}"
11+
- git
12+
- gcc
13+
- libtool
14+
- python3
15+
- python3-devel
16+
- openmpi
17+
- elfutils-libelf-devel
18+
- libmount-devel
19+
- libnl3-devel
20+
- libyaml-devel
21+
- rpm-build
22+
- kernel-abi-stablelists
23+
- libaio
24+
- libaio-devel
25+
lustre_build_dir: /tmp/lustre-release
26+
lustre_configure_opts:
27+
- --disable-server
28+
- --with-linux=/usr/src/kernels/*
29+
- --with-o2ib=/usr/src/ofa_kernel/default
30+
- --disable-maintainer-mode
31+
- --disable-gss-keyring
32+
- --enable-mpitests=no
33+
lustre_rpm_globs: # NB: order is important here, as not installing from a repo
34+
- "kmod-lustre-client-{{ lustre_version | split('.') | first }}*" # only take part of the version as -RC versions produce _RC rpms
35+
- "lustre-client-{{ lustre_version | split('.') | first }}*"
36+
lustre_build_cleanup: true
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
- name: Gather Lustre interface info
2+
shell:
3+
cmd: |
4+
ip r get {{ _lustre_mgs_ip }}
5+
changed_when: false
6+
register: _lustre_ip_r_mgs
7+
vars:
8+
_lustre_mgs_ip: "{{ lustre_mgs_nid | split('@') | first }}"
9+
10+
- name: Set facts for Lustre interface
11+
set_fact:
12+
_lustre_interface: "{{ _lustre_ip_r_mgs_info[4] }}"
13+
_lustre_ip: "{{ _lustre_ip_r_mgs_info[6] }}"
14+
vars:
15+
_lustre_ip_r_mgs_info: "{{ _lustre_ip_r_mgs.stdout_lines.0 | split }}"
16+
# first line e.g. "10.167.128.1 via 10.179.0.2 dev eth0 src 10.179.3.149 uid 1000"
17+
18+
- name: Write LNet configuration file
19+
template:
20+
src: lnet.conf.j2
21+
dest: /etc/lnet.conf # exists from package install, expected by lnet service
22+
owner: root
23+
group: root
24+
mode: u=rw,go=r # from package install
25+
register: _lnet_conf
26+
27+
- name: Ensure lnet service state
28+
systemd:
29+
name: lnet
30+
state: "{{ 'restarted' if _lnet_conf.changed else 'started' }}"
31+
32+
- name: Ensure mount points exist
33+
ansible.builtin.file:
34+
path: "{{ item.mount_point }}"
35+
state: directory
36+
loop: "{{ lustre_mounts }}"
37+
when: "(item.mount_state | default(lustre_mount_state)) != 'absent'"
38+
39+
- name: Mount lustre filesystem
40+
ansible.posix.mount:
41+
fstype: lustre
42+
src: "{{ lustre_mgs_nid }}:/{{ item.fs_name }}"
43+
path: "{{ item.mount_point }}"
44+
state: "{{ (item.mount_state | default(lustre_mount_state)) }}"
45+
opts: "{{ item.mount_options | default(lustre_mount_options) }}"
46+
loop: "{{ lustre_mounts }}"
47+
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
- name: Install lustre build prerequisites
2+
ansible.builtin.dnf:
3+
name: "{{ lustre_build_packages }}"
4+
register: _lustre_dnf_build_packages
5+
6+
- name: Clone lustre git repo
7+
# https://git.whamcloud.com/?p=fs/lustre-release.git;a=summary
8+
ansible.builtin.git:
9+
repo: git://git.whamcloud.com/fs/lustre-release.git
10+
dest: "{{ lustre_build_dir }}"
11+
version: "{{ lustre_version }}"
12+
13+
- name: Prepare for lustre configuration
14+
ansible.builtin.command:
15+
cmd: sh ./autogen.sh
16+
chdir: "{{ lustre_build_dir }}"
17+
18+
- name: Configure lustre build
19+
ansible.builtin.command:
20+
cmd: "./configure {{ lustre_configure_opts | join(' ') }}"
21+
chdir: "{{ lustre_build_dir }}"
22+
23+
- name: Build lustre
24+
ansible.builtin.command:
25+
cmd: make rpms
26+
chdir: "{{ lustre_build_dir }}"
27+
28+
- name: Find rpms
29+
ansible.builtin.find:
30+
paths: "{{ lustre_build_dir }}"
31+
patterns: "{{ lustre_rpm_globs }}"
32+
use_regex: false
33+
register: _lustre_find_rpms
34+
35+
- name: Check rpms found
36+
assert:
37+
that: _lustre_find_rpms.files | length
38+
fail_msg: "No lustre repos found with lustre_rpm_globs = {{ lustre_rpm_globs }}"
39+
40+
- name: Install lustre rpms
41+
ansible.builtin.dnf:
42+
name: "{{ _lustre_find_rpms.files | map(attribute='path')}}"
43+
disable_gpg_check: yes
44+
45+
- block:
46+
- name: Remove lustre build prerequisites
47+
# NB Only remove ones this role installed which weren't upgrades
48+
ansible.builtin.dnf:
49+
name: "{{ _new_pkgs }}"
50+
state: absent
51+
vars:
52+
_installed_pkgs: |
53+
{{
54+
_lustre_dnf_build_packages.results |
55+
select('match', 'Installed:') |
56+
map('regex_replace', '^Installed: (.+?)-[0-9].*$', '\1')
57+
}}
58+
_removed_pkgs: |
59+
{{
60+
_lustre_dnf_build_packages.results |
61+
select('match', 'Removed:') |
62+
map('regex_replace', '^Removed: (.+?)-[0-9].*$', '\1')
63+
}}
64+
_new_pkgs: "{{ _installed_pkgs | difference(_removed_pkgs) }}"
65+
66+
- name: Delete lustre build dir
67+
file:
68+
path: "{{ lustre_build_dir }}"
69+
state: absent
70+
when: lustre_build_cleanup | bool
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
- name: Assert using RockyLinux 9
2+
assert:
3+
that: ansible_distribution_major_version | int == 9
4+
fail_msg: The 'lustre' role requires RockyLinux 9
5+
6+
- name: Check kernel-devel package is installed
7+
command: "dnf list --installed kernel-devel-{{ ansible_kernel }}"
8+
changed_when: false
9+
# NB: we don't check here the kernel will remain the same after reboot etc, see ofed/install.yml
10+
11+
- name: Ensure SELinux in permissive mode
12+
assert:
13+
that: selinux_state in ['permissive', 'disabled']
14+
fail_msg: "SELinux must be permissive for Lustre not '{{ selinux_state }}'; see variable selinux_state"
15+
16+
- name: Ensure lustre_mgs_nid is defined
17+
assert:
18+
that: lustre_mgs_nid is defined
19+
fail_msg: Variable lustre_mgs_nid must be defined
20+
21+
- name: Ensure lustre_mounts entries define filesystem name and mount point
22+
assert:
23+
that:
24+
- item.fs_name is defined
25+
- item.mount_point is defined
26+
fail_msg: All lustre_mounts entries must specify fs_name and mount_point
27+
loop: "{{ lustre_mounts }}"
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
net:
2+
- net type: {{ lustre_lnet_label }}
3+
local NI(s):
4+
- nid: {{ _lustre_ip }}@{{ lustre_lnet_label }}
5+
interfaces:
6+
0: {{ _lustre_interface }}

ansible/validate.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,3 +85,11 @@
8585
- import_role:
8686
name: freeipa
8787
tasks_from: validate.yml
88+
89+
- name: Validate lustre configuration
90+
hosts: lustre
91+
tags: lustre
92+
tasks:
93+
- import_role:
94+
name: lustre
95+
tasks_from: validate.yml

0 commit comments

Comments
 (0)