Skip to content

Deploy custom slurm in lab #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions environments/lab/hooks/build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
- name: Ensure build directory exists
file:
state: directory
path: "{{ appliances_environment_root }}/slurmbuild/{{ slurm_build_version }}"

- name: Ensure build directory is empty
shell:
cmd: "rm -rvf {{ appliances_environment_root }}/slurmbuild/{{ slurm_build_version }}/*"
register: _empty_build_dir
changed_when: _empty_build_dir.stdout_lines | length > 0

- name: Build container
command:
cmd: >-
podman --tmpdir=/mnt/image-storage/tmp build
--build-arg SLURM_PREFIX={{ slurm_build_dir }}
--build-arg SLURM_SYSCONFDIR={{ openhpc_slurm_conf_path | dirname }}
. -t slurm-{{ slurm_build_version }}
--output ./{{ slurm_build_version }}
chdir: "{{ appliances_environment_root }}/slurmbuild"
# TODO: doesn't look idempotent although it is
41 changes: 41 additions & 0 deletions environments/lab/hooks/pre.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,49 @@
hosts: compute
become: true
gather_facts: false
tags: scratch
tasks:
- name: Create scratch directory - on local SSD on prod
file:
path: /var/scratch
state: directory

- name: Build custom Slurm
hosts: localhost
become: no
gather_facts: no
tags: slurm
tasks:
- include_tasks: build.yml

- name: Copy custom Slurm to storage
hosts: control # doens't matter, just needs to be one
become: yes
gather_facts: no
tags: slurm
tasks:
- name: Ensure shared slurm directory exists
file:
state: directory
path: "{{ slurm_build_dir }}"
owner: root
group: root
mode: u=rwX,go=rX

- name: Copy custom slurm
copy:
src: "{{ item.src }}"
dest: "{{ item.dest }}"
owner: root
group: root
mode: u=rwx,go=rx
loop:
# - src: "{{ slurm_local_build_dir }}/sbin/"
# dest: "{{ openhpc_sbin_dir }}"
- src: "{{ slurm_local_build_dir }}/lib/"
dest: "{{ openhpc_lib_dir }}"
# - src: "{{ slurm_local_build_dir }}/bin/"
# dest: "{{ openhpc_bin_dir }}"
vars:
slurm_local_build_dir: "{{ appliances_environment_root }}/slurmbuild/{{ slurm_build_version }}"

Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ slurm_build_path: /nopt/vtest/slurm
slurm_build_dir: "{{ slurm_build_path }}/{{ slurm_build_version }}"

openhpc_sbin_dir: "{{ slurm_build_dir }}/sbin"
openhpc_lib_dir: "{{ slurm_build_dir }}/slurm"
openhpc_lib_dir: "{{ slurm_build_dir }}/lib" # TODO: investigating RPATH shows it expects to find /nopt/vtest/slurm/23.11.0/lib/slurm which needs this
openhpc_bin_dir: "{{ slurm_build_dir }}/bin"
openhpc_slurm_conf_path: "{{ slurm_build_dir }}/etc/slurm.conf"

Expand Down
66 changes: 66 additions & 0 deletions environments/lab/slurmbuild/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
FROM rockylinux:9 as build-stage

ARG SLURM_VERSION=23.11.0 # From https://www.schedmd.com/downloads.php
ARG SLURM_PREFIX=/opt/slurm # Should match directory Slurm is installed at
ARG SLURM_SYSCONFDIR=/etc/slurm # Should match directory slurm.conf will be in

RUN set -ex \
&& yum makecache \
&& yum -y update \
&& yum -y install dnf-plugins-core epel-release \
&& yum -y install dnf-plugins-core \
&& yum config-manager --set-enabled crb \
&& yum -y install \
wget \
bzip2 \
perl \
gcc \
gcc-c++\
git \
gnupg \
make \
munge \
munge-devel \
python3-devel \
python3-pip \
python3 \
mariadb-server \
mariadb-devel \
psmisc \
bash-completion \
vim-enhanced \
http-parser-devel \
json-c-devel \
mpitests-openmpi \
pmix-devel \
hwloc \
hwloc-devel \
dbus-devel \
&& yum clean all \
&& rm -rf /var/cache/yum

RUN pip3 install Cython nose

RUN set -x \
&& wget https://download.schedmd.com/slurm/slurm-${SLURM_VERSION}.tar.bz2 \
&& tar --bzip -x -f slurm*tar.bz2

WORKDIR /slurm-${SLURM_VERSION}

RUN set -x && ./configure \
--enable-debug \
--prefix=${SLURM_PREFIX} \
--without-rpath \
--sysconfdir=${SLURM_SYSCONFDIR} \
--with-mysql_config=/usr/bin

RUN set -x && make install

ENTRYPOINT ["/bin/bash"]


FROM scratch as export-stage

ARG SLURM_PREFIX=/slurm # Should match directory Slurm is installed at
# RUN ls ${SLURM_PREFIX}
COPY --from=build-stage ${SLURM_PREFIX}/ .
11 changes: 11 additions & 0 deletions environments/lab/slurmbuild/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
This uses a podman container to build Slurm, which is then copied out of the container into a version directory.

The following arguments to `./configure` are important:
- `--prefix` must match the path the binaries appear to be at (i.e. from the NFS client side). This is because:
- The `slurm{ctld,d,dbd}` executables hardcode an RPATH, even when passing the `--without-rpath` flag to ./configure.
This means unless the path they are executed at matches the build prefix, they can't find `libslurmfull.so` on startup,
even with entries in `/etc/ld.so.conf.d/`.
- `PluginDir` defaults to being based on the build prefix. Although it can be overriden in `slurm.conf`, the `slurmd`s do not appear to get this parameter when running configless, so they won't start saying the (default) plugin dir doesn't exist
- `--sysconfdir` must match the path the `slurm.conf` file is at on the nodes. Otherwise `s*` commands running on nodes *without* `slurmd` (i.e. the control node only, for a standard Slurm appliance configuration) cannot find the configuration file unless the `SLURM_CONF` environment variable set.

Note that a tmpdir is hardcoded to a volume mounted on the lab deploy host, due to its small root filesystem.
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ openhpc_config_extra:

# SCHEDULING
SchedulerType: 'sched/backfill'
SelectType: 'select/cons_res'
SelectType: 'select/cons_tres'
SelectTypeParameters: 'CR_Core'
EnforcePartLimits: 'ALL'
SchedulerParameters:
Expand Down