Skip to content

Commit f3ff220

Browse files
committed
allow using native-format overrides for slurm.conf
1 parent 31017d2 commit f3ff220

File tree

8 files changed

+150
-397
lines changed

8 files changed

+150
-397
lines changed

README.md

Lines changed: 61 additions & 89 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ each list element:
2222
* `gpgcheck`: Optional
2323
* `gpgkey`: Optional
2424

25-
`openhpc_slurm_service_enabled`: boolean, whether to enable the appropriate slurm service (slurmd/slurmctld).
25+
`openhpc_slurm_service_enabled`: Optional boolean. Whether to enable the appropriate slurm service (slurmd/slurmctld).
2626

2727
`openhpc_slurm_service_started`: Optional boolean. Whether to start slurm services. If set to false, all services will be stopped. Defaults to `openhpc_slurm_service_enabled`.
2828

@@ -33,23 +33,25 @@ each list element:
3333
`openhpc_packages`: additional OpenHPC packages to install.
3434

3535
`openhpc_enable`:
36-
* `control`: whether to enable control host
37-
* `database`: whether to enable slurmdbd
38-
* `batch`: whether to enable compute nodes
36+
* `control`: whether host should run slurmctld
37+
* `database`: whether host should run slurmdbd
38+
* `batch`: whether host should run slurmd
3939
* `runtime`: whether to enable OpenHPC runtime
4040

4141
`openhpc_slurmdbd_host`: Optional. Where to deploy slurmdbd if are using this role to deploy slurmdbd, otherwise where an existing slurmdbd is running. This should be the name of a host in your inventory. Set this to `none` to prevent the role from managing slurmdbd. Defaults to `openhpc_slurm_control_host`.
4242

43-
`openhpc_slurm_configless`: Optional, default false. If true then slurm's ["configless" mode](https://slurm.schedmd.com/configless_slurm.html) is used.
43+
Note slurm's ["configless" mode](https://slurm.schedmd.com/configless_slurm.html) is always used.
4444

45-
`openhpc_munge_key`: Optional. Define a munge key to use. If not provided then one is generated but the `openhpc_slurm_control_host` must be in the play.
45+
`openhpc_munge_key`: Required. Define a munge key to use.
4646

47-
`openhpc_login_only_nodes`: Optional. If using "configless" mode specify the name of an ansible group containing nodes which are login-only nodes (i.e. not also control nodes), if required. These nodes will run `slurmd` to contact the control node for config.
47+
`openhpc_login_only_nodes`: Optional. The name of an ansible inventory group containing nodes which are login nodes (i.e. not also control nodes). These nodes must have `openhpc_enable.batch: true` and will run `slurmd` to contact the control node for config.
4848

4949
`openhpc_module_system_install`: Optional, default true. Whether or not to install an environment module system. If true, lmod will be installed. If false, You can either supply your own module system or go without one.
5050

5151
### slurm.conf
5252

53+
`openhpc_cluster_name`: Required, name of the cluster.
54+
5355
`openhpc_slurm_partitions`: Optional. List of one or more slurm partitions, default `[]`. Each partition may contain the following values:
5456
* `groups`: If there are multiple node groups that make up the partition, a list of group objects can be defined here.
5557
Otherwise, `groups` can be omitted and the following attributes can be defined in the partition object:
@@ -64,7 +66,7 @@ each list element:
6466

6567
Note [GresTypes](https://slurm.schedmd.com/slurm.conf.html#OPT_GresTypes) must be set in `openhpc_config` if this is used.
6668

67-
* `default`: Optional. A boolean flag for whether this partion is the default. Valid settings are `YES` and `NO`.
69+
* `default`: Optional. Whether this partion is the default, valid settings are `YES` and `NO`.
6870
* `maxtime`: Optional. A partition-specific time limit following the format of [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `MaxTime`. The default value is
6971
given by `openhpc_job_maxtime`. The value should be quoted to avoid Ansible conversions.
7072
* `partition_params`: Optional. Mapping of additional parameters and values for [partition configuration](https://slurm.schedmd.com/slurm.conf.html#SECTION_PARTITION-CONFIGURATION).
@@ -74,52 +76,29 @@ For each group (if used) or partition any nodes in an ansible inventory group `<
7476
- Nodes in a group are assumed to be homogenous in terms of processor and memory.
7577
- An inventory group may be empty or missing, but if it is not then the play must contain at least one node from it (used to set processor information).
7678

77-
7879
`openhpc_job_maxtime`: Maximum job time limit, default `'60-0'` (60 days). See [slurm.conf](https://slurm.schedmd.com/slurm.conf.html) parameter `MaxTime` for format. The default is 60 days. The value should be quoted to avoid Ansible conversions.
7980

80-
`openhpc_cluster_name`: name of the cluster.
81+
`openhpc_ram_multiplier`: Optional, default `0.95`. Multiplier used in the calculation: `total_memory * openhpc_ram_multiplier` when setting `RealMemory` for the partition in slurm.conf. Can be overriden on a per partition basis using `openhpc_slurm_partitions.ram_multiplier`. Has no effect if `openhpc_slurm_partitions.ram_mb` is set.
8182

82-
`openhpc_config`: Optional. Mapping of additional parameters and values for `slurm.conf`. Note these will override any included in `templates/slurm.conf.j2`.
83+
`openhpc_slurm_conf_default`: Optional. Multiline string giving default key=value parameters for `slurm.conf`. This may include jinja templating. See [defaults/main.yml](defaults/main.yml) for details. Values are only included here if either a) this role sets them to non-default values or b) they are parameterised from other role variables. Note any values here may be overriden using `openhpc_slurm_conf_overrides`.
8384

84-
`openhpc_ram_multiplier`: Optional, default `0.95`. Multiplier used in the calculation: `total_memory * openhpc_ram_multiplier` when setting `RealMemory` for the partition in slurm.conf. Can be overriden on a per partition basis using `openhpc_slurm_partitions.ram_multiplier`. Has no effect if `openhpc_slurm_partitions.ram_mb` is set.
85+
`openhpc_slurm_conf_overrides`: Optional. Multiline string giving key=value parameters for `slurm.conf` to override those from `openhpc_slurm_conf_default`. This may include jinja templating. Note keys must be unique so this cannot be used to add e.g. additional `NodeName=...` entries. TODO: Fix this via an additional var.
86+
87+
`openhpc_slurm_conf_template`: Optional. Name/path of template for `slurm.conf`. The default template uses the relevant role variables and this should not usually need changing.
8588

8689
`openhpc_state_save_location`: Optional. Absolute path for Slurm controller state (`slurm.conf` parameter [StateSaveLocation](https://slurm.schedmd.com/slurm.conf.html#OPT_StateSaveLocation))
8790

8891
#### Accounting
8992

90-
By default, no accounting storage is configured. OpenHPC v1.x and un-updated OpenHPC v2.0 clusters support file-based accounting storage which can be selected by setting the role variable `openhpc_slurm_accounting_storage_type` to `accounting_storage/filetxt`<sup id="accounting_storage">[1](#slurm_ver_footnote)</sup>. Accounting for OpenHPC v2.1 and updated OpenHPC v2.0 clusters requires the Slurm database daemon, `slurmdbd` (although job completion may be a limited alternative, see [below](#Job-accounting). To enable accounting:
93+
By default, no accounting storage is configured. To enable accounting:
9194

9295
* Configure a mariadb or mysql server as described in the slurm accounting [documentation](https://slurm.schedmd.com/accounting.html) on one of the nodes in your inventory and set `openhpc_enable.database `to `true` for this node.
93-
* Set `openhpc_slurm_accounting_storage_type` to `accounting_storage/slurmdbd`.
94-
* Configure the variables for `slurmdbd.conf` below.
95-
96-
The role will take care of configuring the following variables for you:
97-
98-
`openhpc_slurm_accounting_storage_host`: Where the accounting storage service is running i.e where slurmdbd running.
99-
100-
`openhpc_slurm_accounting_storage_port`: Which port to use to connect to the accounting storage.
101-
102-
`openhpc_slurm_accounting_storage_user`: Username for authenticating with the accounting storage.
103-
104-
`openhpc_slurm_accounting_storage_pass`: Mungekey or database password to use for authenticating.
105-
106-
For more advanced customisation or to configure another storage type, you might want to modify these values manually.
107-
108-
#### Job accounting
109-
110-
This is largely redundant if you are using the accounting plugin above, but will give you basic
111-
accounting data such as start and end times. By default no job accounting is configured.
112-
113-
`openhpc_slurm_job_comp_type`: Logging mechanism for job accounting. Can be one of
114-
`jobcomp/filetxt`, `jobcomp/none`, `jobcomp/elasticsearch`.
96+
* Set
11597

116-
`openhpc_slurm_job_acct_gather_type`: Mechanism for collecting job accounting data. Can be one
117-
of `jobacct_gather/linux`, `jobacct_gather/cgroup` and `jobacct_gather/none`.
118-
119-
`openhpc_slurm_job_acct_gather_frequency`: Sampling period for job accounting (seconds).
120-
121-
`openhpc_slurm_job_comp_loc`: Location to store the job accounting records. Depends on value of
122-
`openhpc_slurm_job_comp_type`, e.g for `jobcomp/filetxt` represents a path on disk.
98+
openhpc_slurm_conf_overrides: |
99+
AccountingStorageType=accounting_storage/slurmdbd
100+
101+
* Configure the variables listed in the `slurmdbd.conf` section below.
123102

124103
### slurmdbd.conf
125104

@@ -136,50 +115,43 @@ You will need to configure these variables if you have set `openhpc_enable.datab
136115

137116
`openhpc_slurmdbd_mysql_username`: Username for authenticating with the database, defaults to `slurm`.
138117

139-
## Example Inventory
140-
141-
And an Ansible inventory as this:
142-
143-
[openhpc_login]
144-
openhpc-login-0 ansible_host=10.60.253.40 ansible_user=centos
145-
146-
[openhpc_compute]
147-
openhpc-compute-0 ansible_host=10.60.253.31 ansible_user=centos
148-
openhpc-compute-1 ansible_host=10.60.253.32 ansible_user=centos
149-
150-
[cluster_login:children]
151-
openhpc_login
152-
153-
[cluster_control:children]
154-
openhpc_login
155-
156-
[cluster_batch:children]
157-
openhpc_compute
158-
159-
## Example Playbooks
160-
161-
To deploy, create a playbook which looks like this:
162-
163-
---
164-
- hosts:
165-
- cluster_login
166-
- cluster_control
167-
- cluster_batch
168-
become: yes
169-
roles:
170-
- role: openhpc
171-
openhpc_enable:
172-
control: "{{ inventory_hostname in groups['cluster_control'] }}"
173-
batch: "{{ inventory_hostname in groups['cluster_batch'] }}"
174-
runtime: true
175-
openhpc_slurm_service_enabled: true
176-
openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
177-
openhpc_slurm_partitions:
178-
- name: "compute"
179-
openhpc_cluster_name: openhpc
180-
openhpc_packages: []
181-
...
182-
183-
---
184-
185-
<b id="slurm_ver_footnote">1</b> Slurm 20.11 removed `accounting_storage/filetxt` as an option. This version of Slurm was introduced in OpenHPC v2.1 but the OpenHPC repos are common to all OpenHPC v2.x releases. [](#accounting_storage)
118+
## Example
119+
120+
With this Ansible inventory:
121+
122+
```ini
123+
[cluster_control]
124+
control-0
125+
126+
[cluster_login]
127+
login-0
128+
129+
[cluster_compute]
130+
compute-0
131+
compute-1
132+
```
133+
134+
The following playbook deploys control, login and compute nodes with a customised `slurm.conf` adding debug logging.
135+
136+
```yaml
137+
- hosts:
138+
- cluster_login
139+
- cluster_control
140+
- cluster_compute
141+
become: yes
142+
vars:
143+
openhpc_enable:
144+
control: "{{ inventory_hostname in groups['cluster_control'] }}"
145+
batch: "{{ inventory_hostname in groups['cluster_compute'] + groups['cluster_login'] }}"
146+
runtime: true
147+
openhpc_slurm_control_host: "{{ groups['cluster_control'] | first }}"
148+
openhpc_slurm_partitions:
149+
- name: "compute"
150+
openhpc_cluster_name: openhpc
151+
openhpc_slurm_conf_overrides: |
152+
SlurmctldDebug=debug
153+
SlurmdDebug=debug
154+
tasks:
155+
- import_role:
156+
name: openhpc
157+
```

defaults/main.yml

Lines changed: 39 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -8,32 +8,17 @@ openhpc_slurm_partitions: []
88
openhpc_cluster_name:
99
openhpc_packages:
1010
- slurm-libpmi-ohpc
11-
openhpc_resume_timeout: 300
12-
openhpc_retry_delay: 10
1311
openhpc_job_maxtime: '60-0' # quote this to avoid ansible converting some formats to seconds, which is interpreted as minutes by Slurm
14-
openhpc_config: "{{ openhpc_extra_config | default({}) }}"
1512
openhpc_gres_template: gres.conf.j2
16-
openhpc_slurm_configless: "{{ 'enable_configless' in openhpc_config.get('SlurmctldParameters', []) }}"
17-
1813
openhpc_state_save_location: /var/spool/slurm
1914

2015
# Accounting
21-
openhpc_slurm_accounting_storage_host: "{{ openhpc_slurmdbd_host }}"
22-
openhpc_slurm_accounting_storage_port: 6819
2316
openhpc_slurm_accounting_storage_type: accounting_storage/none
24-
# NOTE: You only need to set these if using accounting_storage/mysql
25-
openhpc_slurm_accounting_storage_user: slurm
26-
#openhpc_slurm_accounting_storage_pass:
27-
28-
# Job accounting
29-
openhpc_slurm_job_acct_gather_type: jobacct_gather/linux
30-
openhpc_slurm_job_acct_gather_frequency: 30
31-
openhpc_slurm_job_comp_type: jobcomp/none
32-
openhpc_slurm_job_comp_loc: /var/log/slurm_jobacct.log
3317

3418
# slurmdbd configuration
19+
# todo: check we can override openhpc_slurm_accounting_storage_port
3520
openhpc_slurmdbd_host: "{{ openhpc_slurm_control_host }}"
36-
openhpc_slurmdbd_port: "{{ openhpc_slurm_accounting_storage_port }}"
21+
openhpc_slurmdbd_port: 6819
3722
openhpc_slurmdbd_mysql_host: "{{ openhpc_slurm_control_host }}"
3823
openhpc_slurmdbd_mysql_database: slurm_acct_db
3924
#openhpc_slurmdbd_mysql_password:
@@ -95,9 +80,45 @@ ohpc_default_extra_repos:
9580
# Concatenate all repo definitions here
9681
ohpc_repos: "{{ ohpc_openhpc_repos[ansible_distribution_major_version] + ohpc_default_extra_repos[ansible_distribution_major_version] + openhpc_extra_repos }}"
9782

98-
openhpc_munge_key:
83+
#openhpc_munge_key:
9984
openhpc_login_only_nodes: ''
10085
openhpc_module_system_install: true
10186

10287
# Auto detection
10388
openhpc_ram_multiplier: 0.95
89+
90+
ohpc_default_directories:
91+
- path: "{{ openhpc_state_save_location }}"
92+
owner: slurm
93+
group: slurm
94+
mode: '0755'
95+
state: directory
96+
openhpc_extra_directories: []
97+
openhpc_directories: "{{ ohpc_default_directories + openhpc_extra_directories }}"
98+
99+
openhpc_slurm_conf_template: slurm.conf.j2
100+
101+
# only include non-default (as constant) or templated values (b/c another part of the role needs it)
102+
openhpc_slurm_conf_default: |
103+
ClusterName={{ openhpc_cluster_name }}
104+
SlurmctldHost={{ openhpc_slurm_control_host }}{% if openhpc_slurm_control_host_address is defined %}({{ openhpc_slurm_control_host_address }}){% endif %}
105+
SlurmUser=slurm
106+
StateSaveLocation={{ openhpc_state_save_location }}
107+
SlurmctldTimeout=300
108+
SelectTypeParameters=CR_Core
109+
PriorityWeightPartition=1000
110+
PreemptType=preempt/partition_prio
111+
PreemptMode=SUSPEND,GANG
112+
AccountingStorageHost={{ openhpc_slurmdbd_host }}
113+
AccountingStoragePort={{ openhpc_slurmdbd_port }}
114+
AccountingStorageUser=slurm
115+
JobCompType=jobcomp/none
116+
JobAcctGatherType=jobacct_gather/cgroup # UPDATED!
117+
SlurmctldSyslogDebug=info
118+
SlurmdSyslogDebug=info
119+
SlurmctldParameters=enable_configless # WARNING this must be included
120+
ReturnToService=2
121+
PropagateResourceLimitsExcept=MEMLOCK
122+
Epilog=/etc/slurm/slurm.epilog.clean
123+
124+
openhpc_slurm_conf_overrides: ''

filter_plugins/slurm_conf.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -91,11 +91,16 @@ def dict2parameters(d):
9191
parts = ['%s=%s' % (k, v) for k, v in d.items()]
9292
return ' '.join(parts)
9393

94+
def from_slurm_conf(s):
95+
""" Convert a slurm.conf format string into a dict """
96+
return dict(line.split('=', 1) for line in s.splitlines())
97+
9498
class FilterModule(object):
9599

96100
def filters(self):
97101
return {
98102
'hostlist_expression': hostlist_expression,
99103
'error': error,
100104
'dict2parameters': dict2parameters,
105+
'from_slurm_conf': from_slurm_conf,
101106
}

handlers/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,5 +64,5 @@
6464
delay: 30
6565
when:
6666
- openhpc_slurm_service_started | bool
67-
- openhpc_enable.batch | default(false) | bool
67+
- openhpc_enable.batch | default(false)
6868
# 2nd condition required as notification happens on controller, which isn't necessarily a compute note

tasks/install.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
---
22

3-
- include_tasks: pre.yml
4-
53
- name: Ensure OpenHPC repos
64
ansible.builtin.yum_repository:
75
name: "{{ item.name }}"

tasks/pre.yml

Lines changed: 0 additions & 6 deletions
This file was deleted.

0 commit comments

Comments
 (0)