add docs re. parellelism

sjpb · sjpb · commit 61742acd4ba4 · 2025-03-28T15:42:38.000Z
diff --git a/ansible/roles/rebuild/README.md b/ansible/roles/rebuild/README.md
@@ -34,9 +34,9 @@ running the `ansible/adhoc/rebuild-via-slurm.yml` playbook:
   send to `/dev/null` by default, as the root user running this has no shared
   directory for job output.
 
-- `rebuild_job_reboot`: Bool, whether to add the `--reboot` flag to the job
-  to actually trigger a rebuild. Useful for e.g. testing priorities. Default
-  `true`.
+- `rebuild_job_reboot`: Optional. A bool controlling whether to add the
+  `--reboot` flag to the job to actually trigger a rebuild. Useful for e.g.
+  testing partition configurations. Default `true`.
 
 - `rebuild_job_options`: Optional. A string giving any other options to pass to
   [sbatch](https://slurm.schedmd.com/sbatch.html). Default is empty string.
diff --git a/docs/experimental/slurm-controlled-rebuild.md b/docs/experimental/slurm-controlled-rebuild.md
@@ -42,9 +42,6 @@ In summary, the way this functionality works is as follows:
    registers the node as having finished rebooting. It then launches the actual
    job, which does not do anything.
 
-   # TODO: check that this is the LAST thing we do?
-   
-
 
 TODO: note terraform parallel limits
 
@@ -86,7 +83,13 @@ The configuration of this is complex and involves:
     compute = {
         general = {
             nodes = ["general-0", "general-1"]
-            ignore_image_changes: true
+            ignore_image_changes = true
+            ...
+        }
+        gpu = {
+            node = ["a100-0", "a100-1"]
+            ignore_image_changes = true
+            ...
         }
     }
     ...
@@ -118,14 +121,16 @@ The configuration of this is complex and involves:
    However production sites will probably be overriding this file anyway to
    customise it.
 
-   An example partition definition is:
+   An example partition definition, given the two node groups "general" and
+   "gpu" shown in Step 2, is:
 
     ```yaml
     openhpc_slurm_partitions:
         ...
         - name: rebuild
           groups:
             - name: general
+            - name: gpu
           default: NO
           maxtime: 30
           partition_params:
@@ -138,15 +143,16 @@ The configuration of this is complex and involves:
     ```
 
     Which has parameters as follows:
-    TODO: update me!
     - `name`: Partition name matching `rebuild` role variable `rebuild_partitions`,
       default `rebuild`.
-    - `groups`: A list of node group names, matching keys in the OpenTofu `compute`
-      variable (see example configuration above). See discussion below.
+    - `groups`: A list of node group names, matching keys in the OpenTofu
+      `compute` variable (see example in step 2 above). Normally every compute
+      node group should be listed here, unless Slurm-controlled rebuild is not
+      required for certain node groups.
     - `default`: Must be set to `NO` so that it is not the default partition.
     - `maxtime`: Maximum time to allow for rebuild jobs, in
       [slurm.conf format](https://slurm.schedmd.com/slurm.conf.html#OPT_MaxTime).
-      The example here is 30 minutes, but see discussion below
+      The example here is 30 minutes, but see discussion below.
     - `partition_params`: A mapping of additional parameters, which must be set
       as follows:
         - `PriorityJobFactor`: Ensures jobs in this partition (i.e. rebuild jobs)
@@ -166,9 +172,12 @@ The configuration of this is complex and involves:
           entire node. This means they do not run on nodes as the same time as
           user jobs running in partitions allowing non-exclusive use.
     
-    Note that this partition overlaps with "normal" partitions. If it is
-    desirable to roll out changes more gradually, it is possible to create
-    multiple "rebuild" partitions, but it is necessary that:
+    The value for `maxtime` needs to be sufficent not just for a single node
+    to be rebuilt, but also to allow for any batching in either OpenTofu or
+    in Nova - see remarks in the [production docs](../production.md).
+
+    If it is desirable to roll out changes more gradually, it is possible to
+    create multiple "rebuild" partitions, but it is necessary that:
     - The rebuild partitions should not themselves overlap, else nodes may be
       rebuilt more than once.
     - Each rebuild partition should entirely cover one or more "normal"
@@ -179,8 +188,8 @@ The configuration of this is complex and involves:
     - Add the `control` node into the `rebuild` group.
     - Ensure an application credential to use for rebuilding nodes is available
       on the deploy host (default location `~/.config/openstack/clouds.yaml`).
-      If not using that location override `rebuild_clouds`.
-    - **TODO:** CONFIGURE rebuild job defaults!
+    - If required, override `rebuild_clouds_path` or other variables in the site
+      environment.
 
 7. Run `tofu apply` as usual to apply the new OpenTofu configuration.
 
diff --git a/docs/production.md b/docs/production.md
@@ -130,3 +130,22 @@ and referenced from the `site` and `production` environments, e.g.:
 
 - See the [hpctests docs](../ansible/roles/hpctests/README.md) for advice on
   raising `hpctests_hpl_mem_frac` during tests.
+
+- By default, OpenTofu (and Terraform) [limits](https://opentofu.org/docs/cli/commands/apply/#apply-options)
+  the number of concurrent operations to 10. This means that for example only
+  10 ports or 10 instances can be deployed at once. This should be raised by
+  modifying `environments/$ENV/activate` to add a line like:
+
+      export TF_CLI_ARGS_apply="-parallelism=25"
+
+  The value chosen should be the highest value demonstrated during testing.
+  Note that any time spent blocked due to this parallelism limit does not count
+  against the (un-overridable) internal OpenTofu timeout of 30 minutes
+
+- By default, OpenStack Nova also [limits](https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.max_concurrent_builds)
+  the number of concurrent instance builds to 10. This is per Nova controller,
+  so 10x virtual machines per hypervisor. For baremetal nodes it is 10 per cloud
+  if the OpenStack version is earlier than Caracel, else this limit can be
+  raised using [shards](https://specs.openstack.org/openstack/nova-specs/specs/2024.1/implemented/ironic-shards.html).
+  In general it should be possible to raise this value to 50-100 if the cloud
+  is properly tuned, again, demonstrated through testing.