@@ -458,6 +458,76 @@ Booting the VM:
458
458
$ openstack server add security group nvidia-dls-1 nvidia-dls
459
459
460
460
461
+ Manual VM driver and licence configuration
462
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
463
+
464
+ vGPU client VMs need to be configured with Nvidia drivers to run GPU workloads.
465
+ The host drivers should already be applied to the hypervisor.
466
+
467
+ GCP hosts compatible client drivers `here
468
+ <https://cloud.google.com/compute/docs/gpus/grid-drivers-table> `__.
469
+
470
+ Find the correct version (when in doubt, use the same version as the host) and
471
+ download it to the VM. The exact dependencies will depend on the base image you
472
+ are using but at a minimum, you will need GCC installed.
473
+
474
+ Ubuntu Jammy example:
475
+
476
+ .. code-block :: bash
477
+
478
+ sudo apt update
479
+ sudo apt install -y make gcc wget
480
+ wget https://storage.googleapis.com/nvidia-drivers-us-public/GRID/vGPU17.1/NVIDIA-Linux-x86_64-550.54.15-grid.run
481
+ sudo sh NVIDIA-Linux-x86_64-550.54.15-grid.run
482
+
483
+ Check the ``nvidia-smi `` client is available:
484
+
485
+ .. code-block :: bash
486
+
487
+ nvidia-smi
488
+
489
+ Generate a token from the licence server, and copy the token file to the client
490
+ VM.
491
+
492
+ On the client, create an Nvidia grid config file from the template:
493
+
494
+ .. code-block :: bash
495
+
496
+ sudo cp /etc/nvidia/gridd.conf.template /etc/nvidia/gridd.conf
497
+
498
+ Edit it to set ``FeatureType=1 `` and leave the rest of the settings as default.
499
+
500
+ Copy the client configuration token into the ``/etc/nvidia/ClientConfigToken ``
501
+ directory.
502
+
503
+ Ensure the correct permissions are set:
504
+
505
+ .. code-block :: bash
506
+
507
+ sudo chmod 744 /etc/nvidia/ClientConfigToken/client_configuration_token_< datetime> .tok
508
+
509
+ Restart the ``nvidia-gridd `` service:
510
+
511
+ .. code-block :: bash
512
+
513
+ sudo systemctl restart nvidia-gridd
514
+
515
+ Check that the token has been recognised:
516
+
517
+ .. code-block :: bash
518
+
519
+ nvidia-smi -q | grep ' License Status'
520
+
521
+ If not, an error should appear in the journal:
522
+
523
+ .. code-block :: bash
524
+
525
+ sudo journalctl -xeu nvidia-gridd
526
+
527
+ A successfully licenced VM can be snapshotted to create an image in Glance that
528
+ includes the drivers and licencing token. Alternatively, an image can be
529
+ created using Diskimage Builder.
530
+
461
531
Disk image builder recipe to automatically license VGPU on boot
462
532
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
463
533
@@ -536,6 +606,66 @@ when copying the contents as it can contain invisible characters. It is best to
536
606
into your openstack-config repository and vault encrypt it. The ``file `` lookup plugin can be used to decrypt
537
607
the file (as shown in the example above).
538
608
609
+ Testing vGPU VMs
610
+ ^^^^^^^^^^^^^^^^
611
+
612
+ vGPU VMs can be validated using the following test workload. The test should
613
+ succeed if the VM is correctly licenced and drivers are correctly installed for
614
+ both the host and client VM.
615
+
616
+ Install ``cuda-toolkit `` using the instructions `here
617
+ <https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html> `__.
618
+
619
+ Ubuntu Jammy example:
620
+
621
+ .. code-block :: bash
622
+
623
+ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
624
+ sudo dpkg -i cuda-keyring_1.1-1_all.deb
625
+ sudo apt update -y
626
+ sudo apt install -y cuda-toolkit make
627
+
628
+ The VM may require a reboot at this point.
629
+
630
+ Clone the ``cuda-samples `` repo:
631
+
632
+ .. code-block :: bash
633
+
634
+ git clone https://github.com/NVIDIA/cuda-samples.git
635
+
636
+ Build and run a test workload:
637
+
638
+ .. code-block :: bash
639
+
640
+ cd cuda-samples/Samples/6_Performance/transpose
641
+ make
642
+ ./transpose
643
+
644
+ Example output:
645
+
646
+ .. code-block ::
647
+
648
+ Transpose Starting...
649
+
650
+ GPU Device 0: "Ampere" with compute capability 8.0
651
+
652
+ > Device 0: "GRID A100D-1-10C MIG 1g.10gb"
653
+ > SM Capability 8.0 detected:
654
+ > [GRID A100D-1-10C MIG 1g.10gb] has 14 MP(s) x 64 (Cores/MP) = 896 (Cores)
655
+ > Compute performance scaling factor = 1.00
656
+
657
+ Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16
658
+
659
+ transpose simple copy , Throughput = 159.1779 GB/s, Time = 0.04908 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
660
+ transpose shared memory copy, Throughput = 152.1922 GB/s, Time = 0.05133 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
661
+ transpose naive , Throughput = 117.2670 GB/s, Time = 0.06662 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
662
+ transpose coalesced , Throughput = 135.0813 GB/s, Time = 0.05784 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
663
+ transpose optimized , Throughput = 145.4326 GB/s, Time = 0.05372 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
664
+ transpose coarse-grained , Throughput = 145.2941 GB/s, Time = 0.05377 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
665
+ transpose fine-grained , Throughput = 150.5703 GB/s, Time = 0.05189 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
666
+ transpose diagonal , Throughput = 117.6831 GB/s, Time = 0.06639 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256
667
+ Test passed
668
+
539
669
Changing VGPU device types
540
670
^^^^^^^^^^^^^^^^^^^^^^^^^^
541
671
0 commit comments