Skip to content

Commit 2f0c2cc

Browse files
committed
Merge master into werror
2 parents c34e433 + 445287e commit 2f0c2cc

File tree

5 files changed

+154
-26
lines changed

5 files changed

+154
-26
lines changed

source/gpus_in_openstack.rst

Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -458,6 +458,76 @@ Booting the VM:
458458
$ openstack server add security group nvidia-dls-1 nvidia-dls
459459
460460
461+
Manual VM driver and licence configuration
462+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
463+
464+
vGPU client VMs need to be configured with Nvidia drivers to run GPU workloads.
465+
The host drivers should already be applied to the hypervisor.
466+
467+
GCP hosts compatible client drivers `here
468+
<https://cloud.google.com/compute/docs/gpus/grid-drivers-table>`__.
469+
470+
Find the correct version (when in doubt, use the same version as the host) and
471+
download it to the VM. The exact dependencies will depend on the base image you
472+
are using but at a minimum, you will need GCC installed.
473+
474+
Ubuntu Jammy example:
475+
476+
.. code-block:: bash
477+
478+
sudo apt update
479+
sudo apt install -y make gcc wget
480+
wget https://storage.googleapis.com/nvidia-drivers-us-public/GRID/vGPU17.1/NVIDIA-Linux-x86_64-550.54.15-grid.run
481+
sudo sh NVIDIA-Linux-x86_64-550.54.15-grid.run
482+
483+
Check the ``nvidia-smi`` client is available:
484+
485+
.. code-block:: bash
486+
487+
nvidia-smi
488+
489+
Generate a token from the licence server, and copy the token file to the client
490+
VM.
491+
492+
On the client, create an Nvidia grid config file from the template:
493+
494+
.. code-block:: bash
495+
496+
sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
497+
498+
Edit it to set ``FeatureType=1`` and leave the rest of the settings as default.
499+
500+
Copy the client configuration token into the ``/etc/nvidia/ClientConfigToken``
501+
directory.
502+
503+
Ensure the correct permissions are set:
504+
505+
.. code-block:: bash
506+
507+
sudo chmod 744 /etc/nvidia/ClientConfigToken/client_configuration_token_<datetime>.tok
508+
509+
Restart the ``nvidia-gridd`` service:
510+
511+
.. code-block:: bash
512+
513+
sudo systemctl restart nvidia-gridd
514+
515+
Check that the token has been recognised:
516+
517+
.. code-block:: bash
518+
519+
nvidia-smi -q | grep 'License Status'
520+
521+
If not, an error should appear in the journal:
522+
523+
.. code-block:: bash
524+
525+
sudo journalctl -xeu nvidia-gridd
526+
527+
A successfully licenced VM can be snapshotted to create an image in Glance that
528+
includes the drivers and licencing token. Alternatively, an image can be
529+
created using Diskimage Builder.
530+
461531
Disk image builder recipe to automatically license VGPU on boot
462532
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
463533

@@ -536,6 +606,66 @@ when copying the contents as it can contain invisible characters. It is best to
536606
into your openstack-config repository and vault encrypt it. The ``file`` lookup plugin can be used to decrypt
537607
the file (as shown in the example above).
538608

609+
Testing vGPU VMs
610+
^^^^^^^^^^^^^^^^
611+
612+
vGPU VMs can be validated using the following test workload. The test should
613+
succeed if the VM is correctly licenced and drivers are correctly installed for
614+
both the host and client VM.
615+
616+
Install ``cuda-toolkit`` using the instructions `here
617+
<https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html>`__.
618+
619+
Ubuntu Jammy example:
620+
621+
.. code-block:: bash
622+
623+
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
624+
sudo dpkg -i cuda-keyring_1.1-1_all.deb
625+
sudo apt update -y
626+
sudo apt install -y cuda-toolkit make
627+
628+
The VM may require a reboot at this point.
629+
630+
Clone the ``cuda-samples`` repo:
631+
632+
.. code-block:: bash
633+
634+
git clone https://github.com/NVIDIA/cuda-samples.git
635+
636+
Build and run a test workload:
637+
638+
.. code-block:: bash
639+
640+
cd cuda-samples/Samples/6_Performance/transpose
641+
make
642+
./transpose
643+
644+
Example output:
645+
646+
.. code-block::
647+
648+
Transpose Starting...
649+
650+
GPU Device 0: "Ampere" with compute capability 8.0
651+
652+
> Device 0: "GRID A100D-1-10C MIG 1g.10gb"
653+
> SM Capability 8.0 detected:
654+
> [GRID A100D-1-10C MIG 1g.10gb] has 14 MP(s) x 64 (Cores/MP) = 896 (Cores)
655+
> Compute performance scaling factor = 1.00
656+
657+
Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16
658+
659+
transpose simple copy , Throughput = 159.1779 GB/s, Time = 0.04908 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
660+
transpose shared memory copy, Throughput = 152.1922 GB/s, Time = 0.05133 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
661+
transpose naive , Throughput = 117.2670 GB/s, Time = 0.06662 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
662+
transpose coalesced , Throughput = 135.0813 GB/s, Time = 0.05784 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
663+
transpose optimized , Throughput = 145.4326 GB/s, Time = 0.05372 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
664+
transpose coarse-grained , Throughput = 145.2941 GB/s, Time = 0.05377 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
665+
transpose fine-grained , Throughput = 150.5703 GB/s, Time = 0.05189 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
666+
transpose diagonal , Throughput = 117.6831 GB/s, Time = 0.06639 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
667+
Test passed
668+
539669
Changing VGPU device types
540670
^^^^^^^^^^^^^^^^^^^^^^^^^^
541671

source/hardware_inventory_management.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ We can then provision and configure them:
7676
:substitutions:
7777
7878
kayobe# kayobe overcloud provision --limit |hypervisor_hostname|
79-
kayobe# kayobe overcloud host configure --limit |hypervisor_hostname| --kolla-limit |hypervisor_hostname|
79+
kayobe# kayobe overcloud host configure --limit |hypervisor_hostname|
8080
kayobe# kayobe overcloud service deploy --limit |hypervisor_hostname| --kolla-limit |hypervisor_hostname|
8181
8282
Replacing a Failing Hypervisor

source/include/ceph_ansible.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,7 @@ Apply LVM configuration using Kayobe for the replaced device (here on ``storage-
6767

6868
.. code-block:: console
6969
70-
kayobe$ kayobe overcloud host configure -t lvm -kt none -l storage-0 -kl storage-0
70+
kayobe$ kayobe overcloud host configure -t lvm -l storage-0
7171
7272
Before running Ceph-Ansible, also remove vestigial state directory
7373
from ``/var/lib/ceph/osd`` for the purged OSD

source/operations_and_monitoring.rst

Lines changed: 21 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -4,22 +4,22 @@
44
Operations and Monitoring
55
=========================
66

7-
Access to Kibana
8-
================
7+
Access to OpenSearch Dashboards
8+
===============================
99

1010
OpenStack control plane logs are aggregated from all servers by Fluentd and
11-
stored in ElasticSearch. The control plane logs can be accessed from
12-
ElasticSearch using Kibana, which is available at the following URL:
13-
|kibana_url|
11+
stored in OpenSearch. The control plane logs can be accessed from
12+
OpenSearch using OpenSearch Dashboards, which is available at the following URL:
13+
|opensearch_dashboards_url|
1414

15-
To log in, use the ``kibana`` user. The password is auto-generated by
15+
To log in, use the ``opensearch`` user. The password is auto-generated by
1616
Kolla-Ansible and can be extracted from the encrypted passwords file
1717
(|kolla_passwords|):
1818

1919
.. code-block:: console
2020
:substitutions:
2121
22-
kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^kibana
22+
kayobe# ansible-vault view ${KAYOBE_CONFIG_PATH}/kolla/passwords.yml --vault-password-file |vault_password_file_path| | grep ^opensearch_dashboards
2323
2424
Access to Grafana
2525
=================
@@ -293,7 +293,7 @@ Monitoring
293293
----------
294294

295295
* `Back up InfluxDB <https://docs.influxdata.com/influxdb/v1.8/administration/backup_and_restore/>`__
296-
* `Back up ElasticSearch <https://www.elastic.co/guide/en/elasticsearch/reference/current/backup-cluster-data.html>`__
296+
* `Back up OpenSearch <https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/snapshots/snapshot-restore/>`__
297297
* `Back up Prometheus <https://prometheus.io/docs/prometheus/latest/querying/api/#snapshot>`__
298298

299299
Seed
@@ -309,8 +309,8 @@ Ansible control host
309309
Control Plane Monitoring
310310
========================
311311

312-
The control plane has been configured to collect logs centrally using the EFK
313-
stack (Elasticsearch, Fluentd and Kibana).
312+
The control plane has been configured to collect logs centrally using the FOOD
313+
stack (Fluentd, OpenSearch and OpenSearch Dashboards).
314314

315315
Telemetry monitoring of the control plane is performed by Prometheus. Metrics
316316
are collected by Prometheus exporters, which are either running on all hosts
@@ -508,7 +508,8 @@ Overview
508508
* Remove the node from maintenance mode in bifrost
509509
* Bifrost should automatically power on the node via IPMI
510510
* Check that all docker containers are running
511-
* Check Kibana for any messages with log level ERROR or equivalent
511+
* Check OpenSearch Dashboards for any messages with log level ERROR or
512+
equivalent
512513

513514
Controllers
514515
-----------
@@ -647,28 +648,25 @@ perform the following cleanup procedure regularly:
647648
fi
648649
done
649650
650-
Elasticsearch indexes retention
651+
OpenSearch indexes retention
651652
===============================
652653

653-
To enable and alter default rotation values for Elasticsearch Curator, edit
654+
To alter default rotation values for OpenSearch, edit
654655
``${KAYOBE_CONFIG_PATH}/kolla/globals.yml``:
655656

656657
.. code-block:: console
657658
658-
# Allow Elasticsearch Curator to apply a retention policy to logs
659-
enable_elasticsearch_curator: true
660-
661-
# Duration after which index is closed
662-
elasticsearch_curator_soft_retention_period_days: 90
659+
# Duration after which index is closed (default 30)
660+
opensearch_soft_retention_period_days: 90
663661
664-
# Duration after which index is deleted
665-
elasticsearch_curator_hard_retention_period_days: 180
662+
# Duration after which index is deleted (default 60)
663+
opensearch_hard_retention_period_days: 180
666664
667-
Reconfigure Elasticsearch with new values:
665+
Reconfigure Opensearch with new values:
668666

669667
.. code-block:: console
670668
671-
kayobe overcloud service reconfigure --kolla-tags elasticsearch
669+
kayobe overcloud service reconfigure --kolla-tags opensearch
672670
673671
For more information see the `upstream documentation
674-
<https://docs.openstack.org/kolla-ansible/latest/reference/logging-and-monitoring/central-logging-guide.html#curator>`__.
672+
<https://docs.openstack.org/kolla-ansible/latest/reference/logging-and-monitoring/central-logging-guide.html#applying-log-retention-policies>`__.

source/vars.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
.. |kayobe_source_url| replace:: https://github.com/acme-openstack/kayobe.git
2626
.. |kayobe_source_version| replace:: ``acme/yoga``
2727
.. |keystone_public_url| replace:: https://openstack.acme.example:5000
28-
.. |kibana_url| replace:: https://openstack.acme.example:5601
28+
.. |opensearch_dashboards_url| replace:: https://openstack.acme.example:5601
2929
.. |kolla_passwords| replace:: https://github.com/acme-openstack/kayobe-config/blob/acme/yoga/etc/kayobe/kolla/passwords.yml
3030
.. |monitoring_host| replace:: ``mon0``
3131
.. |network_name| replace:: admin-vxlan

0 commit comments

Comments
 (0)