Merge master into werror

markgoddard · markgoddard · commit 2f0c2cc0dd34 · 2024-08-16T10:33:35.000+01:00
diff --git a/source/gpus_in_openstack.rst b/source/gpus_in_openstack.rst
@@ -458,6 +458,76 @@ Booting the VM:
   $ openstack server add security group nvidia-dls-1 nvidia-dls
 
 
+Manual VM driver and licence configuration
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+vGPU client VMs need to be configured with Nvidia drivers to run GPU workloads.
+The host drivers should already be applied to the hypervisor. 
+
+GCP hosts compatible client drivers `here
+<https://cloud.google.com/compute/docs/gpus/grid-drivers-table>`__.
+
+Find the correct version (when in doubt, use the same version as the host) and
+download it to the VM. The exact dependencies will depend on the base image you
+are using but at a minimum, you will need GCC installed.
+
+Ubuntu Jammy example:
+
+.. code-block:: bash
+
+    sudo apt update
+    sudo apt install -y make gcc wget
+    wget https://storage.googleapis.com/nvidia-drivers-us-public/GRID/vGPU17.1/NVIDIA-Linux-x86_64-550.54.15-grid.run
+    sudo sh NVIDIA-Linux-x86_64-550.54.15-grid.run
+
+Check the ``nvidia-smi`` client is available:
+
+.. code-block:: bash
+
+    nvidia-smi
+
+Generate a token from the licence server, and copy the token file to the client
+VM.
+
+On the client, create an Nvidia grid config file from the template:
+
+.. code-block:: bash
+
+    sudo cp /etc/nvidia/gridd.conf.template  /etc/nvidia/gridd.conf
+
+Edit it to set ``FeatureType=1`` and leave the rest of the settings as default.
+
+Copy the client configuration token into the ``/etc/nvidia/ClientConfigToken``
+directory.
+
+Ensure the correct permissions are set:
+
+.. code-block:: bash
+
+    sudo chmod 744 /etc/nvidia/ClientConfigToken/client_configuration_token_<datetime>.tok
+
+Restart the ``nvidia-gridd`` service:
+
+.. code-block:: bash
+
+    sudo systemctl restart nvidia-gridd
+
+Check that the token has been recognised:
+
+.. code-block:: bash
+
+    nvidia-smi -q | grep 'License Status'
+
+If not, an error should appear in the journal:
+
+.. code-block:: bash
+
+    sudo journalctl -xeu nvidia-gridd
+
+A successfully licenced VM can be snapshotted to create an image in Glance that
+includes the drivers and licencing token. Alternatively, an image can be
+created using Diskimage Builder.
+
 Disk image builder recipe to automatically license VGPU on boot
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -536,6 +606,66 @@ when copying the contents as it can contain invisible characters. It is best to
 into your openstack-config repository and vault encrypt it. The ``file`` lookup plugin can be used to decrypt
 the file (as shown in the example above).
 
+Testing vGPU VMs
+^^^^^^^^^^^^^^^^
+
+vGPU VMs can be validated using the following test workload. The test should
+succeed if the VM is correctly licenced and drivers are correctly installed for
+both the host and client VM.
+
+Install ``cuda-toolkit`` using the instructions `here
+<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`__.
+
+Ubuntu Jammy example:
+
+.. code-block:: bash
+
+    wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
+    sudo dpkg -i cuda-keyring_1.1-1_all.deb
+    sudo apt update -y 
+    sudo apt install -y cuda-toolkit make
+
+The VM may require a reboot at this point.
+
+Clone the ``cuda-samples`` repo:
+
+.. code-block:: bash
+
+    git clone https://github.com/NVIDIA/cuda-samples.git
+
+Build and run a test workload:
+
+.. code-block:: bash
+
+    cd cuda-samples/Samples/6_Performance/transpose
+    make
+    ./transpose
+
+Example output:
+
+.. code-block::
+
+    Transpose Starting...
+
+    GPU Device 0: "Ampere" with compute capability 8.0
+
+    > Device 0: "GRID A100D-1-10C MIG 1g.10gb"
+    > SM Capability 8.0 detected:
+    > [GRID A100D-1-10C MIG 1g.10gb] has 14 MP(s) x 64 (Cores/MP) = 896 (Cores)
+    > Compute performance scaling factor = 1.00
+
+    Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16
+
+    transpose simple copy       , Throughput = 159.1779 GB/s, Time = 0.04908 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose shared memory copy, Throughput = 152.1922 GB/s, Time = 0.05133 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose naive             , Throughput = 117.2670 GB/s, Time = 0.06662 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose coalesced         , Throughput = 135.0813 GB/s, Time = 0.05784 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose optimized         , Throughput = 145.4326 GB/s, Time = 0.05372 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose coarse-grained    , Throughput = 145.2941 GB/s, Time = 0.05377 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose fine-grained      , Throughput = 150.5703 GB/s, Time = 0.05189 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    transpose diagonal          , Throughput = 117.6831 GB/s, Time = 0.06639 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
+    Test passed
+
 Changing VGPU device types
 ^^^^^^^^^^^^^^^^^^^^^^^^^^
 
diff --git a/source/hardware_inventory_management.rst b/source/hardware_inventory_management.rst
@@ -76,7 +76,7 @@ We can then provision and configure them:
    :substitutions:
 
    kayobe# kayobe overcloud provision --limit |hypervisor_hostname|
-   kayobe# kayobe overcloud host configure --limit |hypervisor_hostname| --kolla-limit |hypervisor_hostname|
+   kayobe# kayobe overcloud host configure --limit |hypervisor_hostname|
    kayobe# kayobe overcloud service deploy --limit |hypervisor_hostname| --kolla-limit |hypervisor_hostname|
 
 Replacing a Failing Hypervisor
diff --git a/source/include/ceph_ansible.rst b/source/include/ceph_ansible.rst
@@ -67,7 +67,7 @@ Apply LVM configuration using Kayobe for the replaced device (here on ``storage-
 
 .. code-block:: console
 
-   kayobe$ kayobe overcloud host configure -t lvm -kt none -l storage-0 -kl storage-0
+   kayobe$ kayobe overcloud host configure -t lvm -l storage-0
 
 Before running Ceph-Ansible, also remove vestigial state directory
 from ``/var/lib/ceph/osd`` for the purged OSD
diff --git a/source/operations_and_monitoring.rst b/source/operations_and_monitoring.rst
@@ -4,22 +4,22 @@
 Operations and Monitoring
 =========================
 
-Access to Kibana
-================
+Access to OpenSearch Dashboards
+===============================
 
 OpenStack control plane logs are aggregated from all servers by Fluentd and
-stored in ElasticSearch. The control plane logs can be accessed from
-ElasticSearch using Kibana, which is available at the following URL:
-|kibana_url|
+stored in OpenSearch. The control plane logs can be accessed from
+OpenSearch using OpenSearch Dashboards, which is available at the following URL:
+|opensearch_dashboards_url|
 
-To log in, use the ``kibana`` user. The password is auto-generated by
+To log in, use the ``opensearch`` user. The password is auto-generated by
 Kolla-Ansible and can be extracted from the encrypted passwords file
 (|kolla_passwords|):
 
 .. code-block:: console
    :substitutions:
 
-   kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^kibana
+   kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^opensearch_dashboards
 
 Access to Grafana
 =================
@@ -293,7 +293,7 @@ Monitoring
 ----------
 
 * `Back up InfluxDB <https://docs.influxdata.com/influxdb/v1.8/administration/backup_and_restore/>`__
-* `Back up ElasticSearch <https://www.elastic.co/guide/en/elasticsearch/reference/current/backup-cluster-data.html>`__
+* `Back up OpenSearch <https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/snapshots/snapshot-restore/>`__
 * `Back up Prometheus <https://prometheus.io/docs/prometheus/latest/querying/api/#snapshot>`__
 
 Seed
@@ -309,8 +309,8 @@ Ansible control host
 Control Plane Monitoring
 ========================
 
-The control plane has been configured to collect logs centrally using the EFK
-stack (Elasticsearch, Fluentd and Kibana).
+The control plane has been configured to collect logs centrally using the FOOD
+stack (Fluentd, OpenSearch and OpenSearch Dashboards).
 
 Telemetry monitoring of the control plane is performed by Prometheus. Metrics
 are collected by Prometheus exporters, which are either running on all hosts
@@ -508,7 +508,8 @@ Overview
 * Remove the node from maintenance mode in bifrost
 * Bifrost should automatically power on the node via IPMI
 * Check that all docker containers are running
-* Check Kibana for any messages with log level ERROR or equivalent
+* Check OpenSearch Dashboards for any messages with log level ERROR or
+  equivalent
 
 Controllers
 -----------
@@ -647,28 +648,25 @@ perform the following cleanup procedure regularly:
             fi
           done
 
-Elasticsearch indexes retention
+OpenSearch indexes retention
 ===============================
 
-To enable and alter default rotation values for Elasticsearch Curator, edit
+To alter default rotation values for OpenSearch, edit
 ``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``:
 
 .. code-block:: console
 
-   # Allow Elasticsearch Curator to apply a retention policy to logs
-   enable_elasticsearch_curator: true
-
-   # Duration after which index is closed
-   elasticsearch_curator_soft_retention_period_days: 90
+   # Duration after which index is closed (default 30)
+   opensearch_soft_retention_period_days: 90
 
-   # Duration after which index is deleted
-   elasticsearch_curator_hard_retention_period_days: 180
+   # Duration after which index is deleted (default 60)
+   opensearch_hard_retention_period_days: 180
 
-Reconfigure Elasticsearch with new values:
+Reconfigure Opensearch with new values:
 
 .. code-block:: console
 
-   kayobe overcloud service reconfigure --kolla-tags elasticsearch
+   kayobe overcloud service reconfigure --kolla-tags opensearch
 
 For more information see the `upstream documentation
-<https://docs.openstack.org/kolla-ansible/latest/reference/logging-and-monitoring/central-logging-guide.html#curator>`__.
+<https://docs.openstack.org/kolla-ansible/latest/reference/logging-and-monitoring/central-logging-guide.html#applying-log-retention-policies>`__.
diff --git a/source/vars.rst b/source/vars.rst
@@ -25,7 +25,7 @@
 .. |kayobe_source_url| replace:: https://github.com/acme-openstack/kayobe.git
 .. |kayobe_source_version| replace:: ``acme/yoga``
 .. |keystone_public_url| replace:: https://openstack.acme.example:5000
-.. |kibana_url| replace:: https://openstack.acme.example:5601
+.. |opensearch_dashboards_url| replace:: https://openstack.acme.example:5601
 .. |kolla_passwords| replace:: https://github.com/acme-openstack/kayobe-config/blob/acme/yoga/etc/kayobe/kolla/passwords.yml
 .. |monitoring_host| replace:: ``mon0``
 .. |network_name| replace:: admin-vxlan