Skip to content

Get custom slurm working in lab #25

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Feb 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions environments/lab/hooks/build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
- name: Ensure build directory exists
file:
state: directory
path: "{{ appliances_environment_root }}/slurmbuild/{{ slurm_build_version }}"

- name: Ensure build directory is empty
shell:
cmd: "rm -rvf {{ appliances_environment_root }}/slurmbuild/{{ slurm_build_version }}/*"
register: _empty_build_dir
changed_when: _empty_build_dir.stdout_lines | length > 0

- name: Build container
command:
cmd: >-
podman --tmpdir=/mnt/image-storage/tmp build
--build-arg SLURM_PREFIX={{ slurm_build_dir }}
--build-arg SLURM_SYSCONFDIR={{ openhpc_slurm_conf_path | dirname }}
. -t slurm-{{ slurm_build_version }}
--output ./{{ slurm_build_version }}
chdir: "{{ appliances_environment_root }}/slurmbuild"
# TODO: doesn't look idempotent although it is
58 changes: 41 additions & 17 deletions environments/lab/hooks/pre.yml
Original file line number Diff line number Diff line change
@@ -1,26 +1,50 @@
- name: NREL lab fixes - Workaround no internal DNS
hosts: all
become: true
gather_facts: false
tags: etc_hosts
tasks:
- name: Create /etc/hosts for all nodes as DNS doesn't work
# the interface used as ansible_host is defined by terraform's `access_network` parameter, so this is deterministic for multi-rail hosts
blockinfile:
path: /etc/hosts
create: yes
state: present
block: |
{% for hostname in groups['all'] %}
{{ hostvars[hostname]['ansible_host'] }} {{ hostname }}
{% endfor %}

- name: NREL lab fixes - For compute nodes
hosts: compute
become: true
gather_facts: false
tags: scratch
tasks:
- name: Create scratch directory - on local SSD on prod
file:
path: /var/scratch
state: directory

- name: Build custom Slurm
hosts: localhost
become: no
gather_facts: no
tags: slurm
tasks:
- include_tasks: build.yml

- name: Copy custom Slurm to storage
hosts: control
become: yes
gather_facts: no
tags: slurm
tasks:
- name: Ensure shared slurm directory exists
file:
state: directory
path: "{{ slurm_build_dir }}" # NB this will be exported by nfs filesystems.yml
owner: root
group: root
mode: u=rwX,go=rX

- name: Copy custom slurm
copy:
src: "{{ item.src }}"
dest: "{{ item.dest }}"
owner: root
group: root
mode: u=rwx,go=rx
loop:
- src: "{{ slurm_local_build_dir }}/sbin/"
dest: "{{ openhpc_sbin_dir }}"
- src: "{{ slurm_local_build_dir }}/lib/"
dest: "{{ openhpc_lib_dir }}"
- src: "{{ slurm_local_build_dir }}/bin/"
dest: "{{ openhpc_bin_dir }}"
vars:
slurm_local_build_dir: "{{ appliances_environment_root }}/slurmbuild/{{ slurm_build_version }}"

4 changes: 4 additions & 0 deletions environments/lab/inventory/extra_groups
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
# Don't have os_manila for lab so fake it using NFS
[nfs:children]
openhpc

# Don't have working internal DNS
[etc_hosts:children]
cluster
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,12 @@ slurm_build_path: /nopt/vtest/slurm
slurm_build_dir: "{{ slurm_build_path }}/{{ slurm_build_version }}"

openhpc_sbin_dir: "{{ slurm_build_dir }}/sbin"
openhpc_lib_dir: "{{ slurm_build_dir }}/slurm"
openhpc_lib_dir: "{{ slurm_build_dir }}/lib" # TODO: investigating RPATH shows it expects to find /nopt/vtest/slurm/23.11.0/lib/slurm which needs this
openhpc_bin_dir: "{{ slurm_build_dir }}/bin"
openhpc_slurm_conf_path: "{{ slurm_build_dir }}/etc/slurm.conf"


openhpc_slurm_partitions:
- name: "sm"
default: NO
default: YES
maxtime: "1-0" # 1 days 0 hours
66 changes: 66 additions & 0 deletions environments/lab/slurmbuild/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
FROM rockylinux:9 as build-stage

ARG SLURM_VERSION=23.11.0 # From https://www.schedmd.com/downloads.php
ARG SLURM_PREFIX=/opt/slurm # Should match directory Slurm is installed at
ARG SLURM_SYSCONFDIR=/etc/slurm # Should match directory slurm.conf will be in

RUN set -ex \
&& yum makecache \
&& yum -y update \
&& yum -y install dnf-plugins-core epel-release \
&& yum -y install dnf-plugins-core \
&& yum config-manager --set-enabled crb \
&& yum -y install \
wget \
bzip2 \
perl \
gcc \
gcc-c++\
git \
gnupg \
make \
munge \
munge-devel \
python3-devel \
python3-pip \
python3 \
mariadb-server \
mariadb-devel \
psmisc \
bash-completion \
vim-enhanced \
http-parser-devel \
json-c-devel \
mpitests-openmpi \
pmix-devel \
hwloc \
hwloc-devel \
dbus-devel \
&& yum clean all \
&& rm -rf /var/cache/yum

RUN pip3 install Cython nose

RUN set -x \
&& wget https://download.schedmd.com/slurm/slurm-${SLURM_VERSION}.tar.bz2 \
&& tar --bzip -x -f slurm*tar.bz2

WORKDIR /slurm-${SLURM_VERSION}

RUN set -x && ./configure \
--enable-debug \
--prefix=${SLURM_PREFIX} \
--without-rpath \
--sysconfdir=${SLURM_SYSCONFDIR} \
--with-mysql_config=/usr/bin

RUN set -x && make install

ENTRYPOINT ["/bin/bash"]


FROM scratch as export-stage

ARG SLURM_PREFIX=/slurm # Should match directory Slurm is installed at
# RUN ls ${SLURM_PREFIX}
COPY --from=build-stage ${SLURM_PREFIX}/ .
11 changes: 11 additions & 0 deletions environments/lab/slurmbuild/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
This uses a podman container to build Slurm, which is then copied out of the container into a version directory.

The following arguments to `./configure` are important:
- `--prefix` must match the path the binaries appear to be at (i.e. from the NFS client side). This is because:
- The `slurm{ctld,d,dbd}` executables hardcode an RPATH, even when passing the `--without-rpath` flag to ./configure.
This means unless the path they are executed at matches the build prefix, they can't find `libslurmfull.so` on startup,
even with entries in `/etc/ld.so.conf.d/`.
- `PluginDir` defaults to being based on the build prefix. Although it can be overriden in `slurm.conf`, the `slurmd`s do not appear to get this parameter when running configless, so they won't start saying the (default) plugin dir doesn't exist
- `--sysconfdir` must match the path the `slurm.conf` file is at on the nodes. Otherwise `s*` commands running on nodes *without* `slurmd` (i.e. the control node only, for a standard Slurm appliance configuration) cannot find the configuration file unless the `SLURM_CONF` environment variable set.

Note that a tmpdir is hardcoded to a volume mounted on the lab deploy host, due to its small root filesystem.
5 changes: 3 additions & 2 deletions environments/nrel/inventory/group_vars/openhpc/overrides.yml
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,7 @@ openhpc_generic_packages:
- mpitests-openmpi

# Additional parameters to set in slurm.conf - use yaml format
openhpc_epilog: '/nopt/slurm/etc/epilog.d/*'
openhpc_slurmd_spool_dir: /var/spool/slurm/slurmd
openhpc_config_extra:
LaunchParameters: use_interactive_step
Expand All @@ -135,7 +136,7 @@ openhpc_config_extra:
# Prolog: '/nopt/slurm/etc/prolog.d/*'
# PrologFlags: 'X11'
# X11Parameters: 'local_xauthority'
# Epilog: '/nopt/slurm/etc/epilog.d/*'
Epilog: '<absent>' # /nopt/slurm/etc/epilog.d/*'
# PrologEpilogTimeout: 180
# UnkillableStepTimeout: 180

Expand Down Expand Up @@ -178,7 +179,7 @@ openhpc_config_extra:

# SCHEDULING
SchedulerType: 'sched/backfill'
SelectType: 'select/cons_res'
SelectType: 'select/cons_tres'
SelectTypeParameters: 'CR_Core'
EnforcePartLimits: 'ALL'
SchedulerParameters:
Expand Down
2 changes: 1 addition & 1 deletion requirements.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ roles:
- src: stackhpc.nfs
version: v23.12.1 # Tolerate state nfs file handles
- src: https://github.com/stackhpc/ansible-role-openhpc.git
version: feat/no-ohpc # https://github.com/stackhpc/ansible-role-openhpc/pull/162
version: 5b73b8a # https://github.com/stackhpc/ansible-role-openhpc/pull/163 # TODO: bump on release
name: stackhpc.openhpc
- src: https://github.com/stackhpc/ansible-node-exporter.git
version: stackhpc
Expand Down