Skip to content

Commit 61742ac

Browse files
committed
add docs re. parellelism
1 parent ebe9af4 commit 61742ac

File tree

3 files changed

+45
-17
lines changed

3 files changed

+45
-17
lines changed

ansible/roles/rebuild/README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -34,9 +34,9 @@ running the `ansible/adhoc/rebuild-via-slurm.yml` playbook:
3434
send to `/dev/null` by default, as the root user running this has no shared
3535
directory for job output.
3636

37-
- `rebuild_job_reboot`: Bool, whether to add the `--reboot` flag to the job
38-
to actually trigger a rebuild. Useful for e.g. testing priorities. Default
39-
`true`.
37+
- `rebuild_job_reboot`: Optional. A bool controlling whether to add the
38+
`--reboot` flag to the job to actually trigger a rebuild. Useful for e.g.
39+
testing partition configurations. Default `true`.
4040

4141
- `rebuild_job_options`: Optional. A string giving any other options to pass to
4242
[sbatch](https://slurm.schedmd.com/sbatch.html). Default is empty string.

docs/experimental/slurm-controlled-rebuild.md

Lines changed: 23 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -42,9 +42,6 @@ In summary, the way this functionality works is as follows:
4242
registers the node as having finished rebooting. It then launches the actual
4343
job, which does not do anything.
4444

45-
# TODO: check that this is the LAST thing we do?
46-
47-
4845

4946
TODO: note terraform parallel limits
5047

@@ -86,7 +83,13 @@ The configuration of this is complex and involves:
8683
compute = {
8784
general = {
8885
nodes = ["general-0", "general-1"]
89-
ignore_image_changes: true
86+
ignore_image_changes = true
87+
...
88+
}
89+
gpu = {
90+
node = ["a100-0", "a100-1"]
91+
ignore_image_changes = true
92+
...
9093
}
9194
}
9295
...
@@ -118,14 +121,16 @@ The configuration of this is complex and involves:
118121
However production sites will probably be overriding this file anyway to
119122
customise it.
120123
121-
An example partition definition is:
124+
An example partition definition, given the two node groups "general" and
125+
"gpu" shown in Step 2, is:
122126
123127
```yaml
124128
openhpc_slurm_partitions:
125129
...
126130
- name: rebuild
127131
groups:
128132
- name: general
133+
- name: gpu
129134
default: NO
130135
maxtime: 30
131136
partition_params:
@@ -138,15 +143,16 @@ The configuration of this is complex and involves:
138143
```
139144
140145
Which has parameters as follows:
141-
TODO: update me!
142146
- `name`: Partition name matching `rebuild` role variable `rebuild_partitions`,
143147
default `rebuild`.
144-
- `groups`: A list of node group names, matching keys in the OpenTofu `compute`
145-
variable (see example configuration above). See discussion below.
148+
- `groups`: A list of node group names, matching keys in the OpenTofu
149+
`compute` variable (see example in step 2 above). Normally every compute
150+
node group should be listed here, unless Slurm-controlled rebuild is not
151+
required for certain node groups.
146152
- `default`: Must be set to `NO` so that it is not the default partition.
147153
- `maxtime`: Maximum time to allow for rebuild jobs, in
148154
[slurm.conf format](https://slurm.schedmd.com/slurm.conf.html#OPT_MaxTime).
149-
The example here is 30 minutes, but see discussion below
155+
The example here is 30 minutes, but see discussion below.
150156
- `partition_params`: A mapping of additional parameters, which must be set
151157
as follows:
152158
- `PriorityJobFactor`: Ensures jobs in this partition (i.e. rebuild jobs)
@@ -166,9 +172,12 @@ The configuration of this is complex and involves:
166172
entire node. This means they do not run on nodes as the same time as
167173
user jobs running in partitions allowing non-exclusive use.
168174
169-
Note that this partition overlaps with "normal" partitions. If it is
170-
desirable to roll out changes more gradually, it is possible to create
171-
multiple "rebuild" partitions, but it is necessary that:
175+
The value for `maxtime` needs to be sufficent not just for a single node
176+
to be rebuilt, but also to allow for any batching in either OpenTofu or
177+
in Nova - see remarks in the [production docs](../production.md).
178+
179+
If it is desirable to roll out changes more gradually, it is possible to
180+
create multiple "rebuild" partitions, but it is necessary that:
172181
- The rebuild partitions should not themselves overlap, else nodes may be
173182
rebuilt more than once.
174183
- Each rebuild partition should entirely cover one or more "normal"
@@ -179,8 +188,8 @@ The configuration of this is complex and involves:
179188
- Add the `control` node into the `rebuild` group.
180189
- Ensure an application credential to use for rebuilding nodes is available
181190
on the deploy host (default location `~/.config/openstack/clouds.yaml`).
182-
If not using that location override `rebuild_clouds`.
183-
- **TODO:** CONFIGURE rebuild job defaults!
191+
- If required, override `rebuild_clouds_path` or other variables in the site
192+
environment.
184193
185194
7. Run `tofu apply` as usual to apply the new OpenTofu configuration.
186195

docs/production.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,3 +130,22 @@ and referenced from the `site` and `production` environments, e.g.:
130130

131131
- See the [hpctests docs](../ansible/roles/hpctests/README.md) for advice on
132132
raising `hpctests_hpl_mem_frac` during tests.
133+
134+
- By default, OpenTofu (and Terraform) [limits](https://opentofu.org/docs/cli/commands/apply/#apply-options)
135+
the number of concurrent operations to 10. This means that for example only
136+
10 ports or 10 instances can be deployed at once. This should be raised by
137+
modifying `environments/$ENV/activate` to add a line like:
138+
139+
export TF_CLI_ARGS_apply="-parallelism=25"
140+
141+
The value chosen should be the highest value demonstrated during testing.
142+
Note that any time spent blocked due to this parallelism limit does not count
143+
against the (un-overridable) internal OpenTofu timeout of 30 minutes
144+
145+
- By default, OpenStack Nova also [limits](https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.max_concurrent_builds)
146+
the number of concurrent instance builds to 10. This is per Nova controller,
147+
so 10x virtual machines per hypervisor. For baremetal nodes it is 10 per cloud
148+
if the OpenStack version is earlier than Caracel, else this limit can be
149+
raised using [shards](https://specs.openstack.org/openstack/nova-specs/specs/2024.1/implemented/ironic-shards.html).
150+
In general it should be possible to raise this value to 50-100 if the cloud
151+
is properly tuned, again, demonstrated through testing.

0 commit comments

Comments
 (0)