@@ -42,9 +42,6 @@ In summary, the way this functionality works is as follows:
42
42
registers the node as having finished rebooting. It then launches the actual
43
43
job, which does not do anything.
44
44
45
- # TODO: check that this is the LAST thing we do?
46
-
47
-
48
45
49
46
TODO: note terraform parallel limits
50
47
@@ -86,7 +83,13 @@ The configuration of this is complex and involves:
86
83
compute = {
87
84
general = {
88
85
nodes = ["general-0", "general-1"]
89
- ignore_image_changes: true
86
+ ignore_image_changes = true
87
+ ...
88
+ }
89
+ gpu = {
90
+ node = ["a100-0", "a100-1"]
91
+ ignore_image_changes = true
92
+ ...
90
93
}
91
94
}
92
95
...
@@ -118,14 +121,16 @@ The configuration of this is complex and involves:
118
121
However production sites will probably be overriding this file anyway to
119
122
customise it.
120
123
121
- An example partition definition is:
124
+ An example partition definition, given the two node groups "general" and
125
+ "gpu" shown in Step 2, is:
122
126
123
127
```yaml
124
128
openhpc_slurm_partitions:
125
129
...
126
130
- name: rebuild
127
131
groups:
128
132
- name: general
133
+ - name: gpu
129
134
default: NO
130
135
maxtime: 30
131
136
partition_params:
@@ -138,15 +143,16 @@ The configuration of this is complex and involves:
138
143
```
139
144
140
145
Which has parameters as follows:
141
- TODO: update me!
142
146
- `name`: Partition name matching `rebuild` role variable `rebuild_partitions`,
143
147
default `rebuild`.
144
- - `groups`: A list of node group names, matching keys in the OpenTofu `compute`
145
- variable (see example configuration above). See discussion below.
148
+ - `groups`: A list of node group names, matching keys in the OpenTofu
149
+ `compute` variable (see example in step 2 above). Normally every compute
150
+ node group should be listed here, unless Slurm-controlled rebuild is not
151
+ required for certain node groups.
146
152
- `default`: Must be set to `NO` so that it is not the default partition.
147
153
- `maxtime`: Maximum time to allow for rebuild jobs, in
148
154
[slurm.conf format](https://slurm.schedmd.com/slurm.conf.html#OPT_MaxTime).
149
- The example here is 30 minutes, but see discussion below
155
+ The example here is 30 minutes, but see discussion below.
150
156
- `partition_params`: A mapping of additional parameters, which must be set
151
157
as follows:
152
158
- `PriorityJobFactor`: Ensures jobs in this partition (i.e. rebuild jobs)
@@ -166,9 +172,12 @@ The configuration of this is complex and involves:
166
172
entire node. This means they do not run on nodes as the same time as
167
173
user jobs running in partitions allowing non-exclusive use.
168
174
169
- Note that this partition overlaps with "normal" partitions. If it is
170
- desirable to roll out changes more gradually, it is possible to create
171
- multiple "rebuild" partitions, but it is necessary that:
175
+ The value for `maxtime` needs to be sufficent not just for a single node
176
+ to be rebuilt, but also to allow for any batching in either OpenTofu or
177
+ in Nova - see remarks in the [production docs](../production.md).
178
+
179
+ If it is desirable to roll out changes more gradually, it is possible to
180
+ create multiple "rebuild" partitions, but it is necessary that:
172
181
- The rebuild partitions should not themselves overlap, else nodes may be
173
182
rebuilt more than once.
174
183
- Each rebuild partition should entirely cover one or more "normal"
@@ -179,8 +188,8 @@ The configuration of this is complex and involves:
179
188
- Add the `control` node into the `rebuild` group.
180
189
- Ensure an application credential to use for rebuilding nodes is available
181
190
on the deploy host (default location `~/.config/openstack/clouds.yaml`).
182
- If not using that location override `rebuild_clouds`.
183
- - **TODO:** CONFIGURE rebuild job defaults!
191
+ - If required, override `rebuild_clouds_path` or other variables in the site
192
+ environment.
184
193
185
194
7. Run `tofu apply` as usual to apply the new OpenTofu configuration.
186
195
0 commit comments