-
Notifications
You must be signed in to change notification settings - Fork 34
Pin nvidia-driver and cuda packages to working packages #496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
19 commits
Select commit
Hold shift + click to select a range
c4b2795
move cuda tasks to install
sjpb ad20d20
pin nvidia driver to working version and autodetect os/arch
sjpb 66a2056
make install of cuda packages optional
sjpb c40107d
don't run cuda install tasks unless during build
sjpb 63bf06a
move doca install before cuda
sjpb fa26080
update cuda docs
sjpb 6a0fcb3
add cuda to extra build test CI
sjpb dcf61da
add cuda runtime tasks
sjpb be805cc
fix typo in extras playbook
sjpb 1319e72
bump extra build size to 30GB for cuda
sjpb f7dc0d3
pin both cuda package version
sjpb 4d6c44a
make cuda idempotent/restartable
sjpb 6e00c1b
Merge branch 'main' into feat/cuda-pin
sjpb 4b82cba
allow using computed tasks_from for cuda role
sjpb 59e95de
fix showing image summary
sjpb cae1ccf
rename nvidia driver version var
sjpb 8a921b2
Merge branch 'main' into feat/cuda-pin
sjpb 09ac426
Merge branch 'main' into feat/cuda-pin
sjpb 1f6f227
bump CI image
sjpb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,15 @@ | ||
# cuda | ||
|
||
Install NVIDIA CUDA. The CUDA binaries are added to the PATH for all users, and the [NVIDIA persistence daemon](https://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon) is enabled. | ||
Install NVIDIA drivers and optionally CUDA packages. CUDA binaries are added to the `$PATH` for all users, and the [NVIDIA persistence daemon](https://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon) is enabled. | ||
|
||
## Prerequisites | ||
|
||
Requires OFED to be installed to provide required kernel-* packages. | ||
|
||
## Role Variables | ||
|
||
- `cuda_distro`: Optional. Default `rhel8`. | ||
- `cuda_repo`: Optional. Default `https://developer.download.nvidia.com/compute/cuda/repos/{{ cuda_distro }}/x86_64/cuda-{{ cuda_distro }}.repo` | ||
- `cuda_driver_stream`: Optional. The default value `default` will, on first use of this role, enable the dkms-flavour `nvidia-driver` DNF module stream with the current highest version number. The `latest-dkms` stream is not enabled, and subsequent runs of the role will *not* change the enabled stream, even if a later version has become available. Changing this value once an `nvidia-driver` stream has been enabled raises an error. If an upgrade of the `nvidia-driver` module is required, the currently-enabled stream and all packages should be manually removed. | ||
- `cuda_repo_url`: Optional. URL of `.repo` file. Default is upstream for appropriate OS/architecture. | ||
- `cuda_nvidia_driver_stream`: Optional. Version of `nvidia-driver` stream to enable. This controls whether the open or proprietary drivers are installed and the major version. Changing this once the drivers are installed does not change the version. | ||
- `cuda_packages`: Optional. Default: `['cuda', 'nvidia-gds']`. | ||
- `cuda_package_version`: Optional. Default `latest` which will install the latest packages if not installed but won't upgrade already-installed packages. Use `'none'` to skip installing CUDA. | ||
- `cuda_persistenced_state`: Optional. State of systemd `nvidia-persistenced` service. Values as [ansible.builtin.systemd:state](https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_module.html#parameter-state). Default `started`. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
- name: Ensure NVIDIA Persistence Daemon state | ||
systemd: | ||
name: nvidia-persistenced | ||
enabled: true | ||
state: "{{ cuda_persistenced_state }}" |
4 changes: 2 additions & 2 deletions
4
environments/.stackhpc/terraform/cluster_image.auto.tfvars.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,6 @@ | ||
{ | ||
"cluster_image": { | ||
"RL8": "openhpc-RL8-241218-1011-5effb3fa", | ||
"RL9": "openhpc-RL9-241218-1011-5effb3fa" | ||
"RL8": "openhpc-RL8-241218-1705-09ac4268", | ||
"RL9": "openhpc-RL9-241218-1705-09ac4268" | ||
} | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.