Skip to content

Use RL9 for caas environment #380

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Mar 22, 2024
Merged

Use RL9 for caas environment #380

merged 5 commits into from
Mar 22, 2024

Conversation

sjpb
Copy link
Collaborator

@sjpb sjpb commented Mar 20, 2024

  • Use RockyLinux9 and hence OpenHPC v3 for caas environment.
  • Makes the persist_hostkeys role now enable-able in any environment.
  • Fixes a bug in caas environment where the OpenOndemand shell aborted after a patch which reimaged the nodes due to hostkeys changing.

Notes:

  • Slurm version is unchanged at 22.05.11.
  • Some of the (default) openhpc package installs which are available via lmod in caas have changed between OpenHPC v2 and v3:
    # RL8                       | # RL9
    gnu12/12.3.0                | gnu12/12.2.0
    hwloc/2.7.2                 | hwloc/2.9.0
    libfabric/1.19.0            | libfabric/1.18.0
    openmpi4/4.1.6              | openmpi4/4.1.5
                                > pmix/4.2.6
    ucx/1.15.0                  | ucx/1.14.0
    
    
  • As OHPCv3 provides pmix and builds openmpi against it, the srun launcher can now be used again (early OpenHPC v2.x could use it with pmi2, also see Is PMIX broken in OpenHPC #190):
    module load gnu12 openmpi4 imb # note pmix does not need to be loaded
    srun --mpi=pmix IMB-MPI1 pingpong
    
    At this time hpctests has not been modified to make use of this.

@sjpb
Copy link
Collaborator Author

sjpb commented Mar 20, 2024

Checks in Azimuth @ 05c29ce, non-manila cluster using image openhpc-RL9-240313-1057-15f9ab38

  • hpctests: OK

  • syslogs: OK

    [root@rl9-v4-control-0 rocky]# grep -rF "cgroupv2 manager" /var/log/messages 
    [root@rl9-v4-control-0 rocky]# 
    
  • OOD shell: OK

  • Monitoring: OK

  • OOD desktop: OK

  • OOD jupyter: OK

@sjpb sjpb marked this pull request as ready for review March 20, 2024 16:14
@sjpb sjpb requested a review from a team as a code owner March 20, 2024 16:14
m-bull
m-bull previously approved these changes Mar 20, 2024
Copy link
Collaborator

@m-bull m-bull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sjpb
Copy link
Collaborator Author

sjpb commented Mar 21, 2024

Tested an upgrade from RL8 to RL9 worked fine:

  1. At 4ec5332 created RL8 cluster in Azimuth with manila project/home and hpctests ON:
# is RL8:
[azimuth@slurm-v7-login-0 ~]$ cat /etc/redhat-release 
Rocky Linux release 8.9 (Green Obsidian)
# is OHPCv2:
[azimuth@slurm-v7-login-0 ~]$ grep baseurl /etc/yum.repos.d/OpenHPC.repo 
baseurl = http://repos.openhpc.community/OpenHPC/2/CentOS_8
baseurl = http://repos.openhpc.community/OpenHPC/2/updates/CentOS_8
# uses manila:
[azimuth@slurm-v7-login-0 ~]$ findmnt -t ceph -o TARGET,FSTYPE
TARGET   FSTYPE
/home    ceph
/project ceph
# show ohpc modules, ignoring unspecific
[azimuth@slurm-v7-login-0 ~]$ module --terse spider | grep -v '/$'
boost/1.81.0
dimemas/5.4.2
extrae/3.8.3
gnu12/12.3.0
hwloc/2.7.2
imb/2021.3
libfabric/1.19.0
likwid/5.2.2
omb/6.1
openblas/0.3.21
openmpi4/4.1.6
os
papi/6.0.0
pdtoolkit/3.25.1
prun/2.2
scalasca/2.5
scorep/7.1
sionlib/1.7.7
tau/2.31.1
ucx/1.15.0
[azimuth@slurm-v7-login-0 ~]$ module load gnu12 openmpi4
[azimuth@slurm-v7-login-0 ~]$ gcc --version
gcc (GCC) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[azimuth@slurm-v7-login-0 ~]$ mpirun --version
mpirun (Open MPI) 4.1.6
...
  1. Patched it to RL9. Hit OOD sshkeys problem. Patched to 963f641, solved that problem. Checks:
# is RL9:
[azimuth@slurm-v7-login-0 ~]$ cat /etc/redhat-release 
Rocky Linux release 9.3 (Blue Onyx)
[azimuth@slurm-v7-login-0 ~]$ srun -N2 cat /etc/redhat-release
Rocky Linux release 9.3 (Blue Onyx)
Rocky Linux release 9.3 (Blue Onyx)

# is OHPC v3:
[azimuth@slurm-v7-login-0 ~]$ grep baseurl /etc/yum.repos.d/OpenHPC.repo 
baseurl = http://repos.openhpc.community/OpenHPC/3/EL_9
baseurl = http://repos.openhpc.community/OpenHPC/3/updates/EL_9

# uses ceph:
[azimuth@slurm-v7-login-0 ~]$ findmnt -t ceph -o TARGET,FSTYPE
TARGET   FSTYPE
/home    ceph
/project ceph

# check modules
[azimuth@slurm-v7-login-0 ~]$ module --terse spider | grep -v '/$'
boost/1.81.0
dimemas/5.4.2
extrae/3.8.3
gnu12/12.2.0
hwloc/2.9.0
imb/2021.3
libfabric/1.18.0
likwid/5.2.2
omb/6.1
openblas/0.3.21
openmpi4/4.1.5
os
papi/6.0.0
pdtoolkit/3.25.1
pmix/4.2.6
prun/2.2
scalasca/2.5
scorep/7.1
sionlib/1.7.7
tau/2.31.1
ucx/1.14.0

[azimuth@slurm-v7-login-0 ~]$ module load gnu12 openmpi4
[azimuth@slurm-v7-login-0 ~]$ gcc --version
gcc (GCC) 12.2.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
[azimuth@slurm-v7-login-0 ~]$ mpirun --version
mpirun (Open MPI) 4.1.5
...
[azimuth@slurm-v7-login-0 ~]$ slurmctld -V
slurm 22.05.11
[azimuth@slurm-v7-login-0 ~]$ slurmd -V
slurm 22.05.11

Also checked that the /home/hpctests/pingpong directory (including xhpl binary) from an RL8 cluster worked when copied onto the RL9 cluster

  • ldd showed binary linked OK
  • ran without errors

@sjpb
Copy link
Collaborator Author

sjpb commented Mar 21, 2024

Checked on upgrade from RL8 to RL9 that previously-run jobs (and new jobs) are shown in dashboard.
Checked OOD desktop, shell, jupyter work.

@sjpb sjpb merged commit 67e1972 into main Mar 22, 2024
@sjpb sjpb deleted the rl9-caas branch March 22, 2024 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants