Skip to content

Disable updates and test OFED build in arcus CI #184

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 30 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
18187c1
copy smslabs TF to cookiecutter skeleton
sjpb May 4, 2022
c22083e
use skeleton TF for arcus
sjpb May 4, 2022
79e0f8d
define ports in skeleton TF
sjpb May 4, 2022
a6f87f0
add direct-mode ports for arcus env
sjpb May 4, 2022
c0d016b
copy getfaults.py from smslabs to skeleton
sjpb May 4, 2022
c341256
getfaults takes TF state directory
sjpb May 4, 2022
05ab5ec
make security groups idempotent for TF apply
sjpb May 4, 2022
0326aaf
add arcus env config
sjpb May 4, 2022
8104cc2
add arcus CI workflow
sjpb May 4, 2022
ef5f59b
temporarily disable smslabs CI during arcus CI development
sjpb May 4, 2022
632099c
move arcus to same base image as smslabs
sjpb May 4, 2022
2e66f3d
bugfix reading TF state for provisioning errors
sjpb May 5, 2022
31f3b8f
automate slurm partition definition
sjpb May 5, 2022
fa58cce
fix arcus bastion definition for CI
sjpb May 5, 2022
f9d882c
try to fix ansible transfer mechanism failing after login rebuild
sjpb May 5, 2022
5d75fc7
Merge branch 'main' into ci/arcus-basic
sjpb May 5, 2022
a2301cc
use latest base image openhpc-220504-0904 in arcus
sjpb May 5, 2022
ad714e5
try to fix 'Connection timed out during banner exchange' after login …
sjpb May 9, 2022
494ef9b
default to no package updates in both packer build and direct configu…
sjpb May 11, 2022
e543021
use ofed image in arcus CI
sjpb May 11, 2022
2ea84c6
try to fix 'wait for login' failing login rebuild
sjpb May 11, 2022
d390241
Merge branch 'main' into feature/ofed
sjpb May 12, 2022
ddb5853
wait more for control node after reimage in CI
sjpb May 12, 2022
fa9113b
Merge branch 'main' into feature/ofed - using base_image_name in matr…
sjpb May 16, 2022
7139c6b
remove separate arcus workflow file
sjpb May 16, 2022
ba7bd83
add README for smslabs,arcus CI environments
sjpb May 17, 2022
0f03150
bump smslabs CI to 20GB VMs now building images on 20GB Arcus VM
sjpb May 17, 2022
875b32d
fix source_image_name not being provided to Packer
sjpb May 17, 2022
5a70e31
add instance_id to skeleton TF hosts templating to cope with multiple…
sjpb May 17, 2022
c6b47a0
make instance name unique between matrix jobs for clarity/debugging
sjpb May 17, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 26 additions & 27 deletions .github/workflows/stackhpc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,32 @@ on:
pull_request:
jobs:
openstack:
name: openstack-ci-${{ matrix.cloud }}
name: openstack-ci-${{ matrix.name }}
strategy:
matrix:
cloud:
- "smslabs" # SMS-Labs OpenStack in stackhpc-ci project
- "arcus" # Arcus OpenStack in rcp-cloud-portal-demo project, with RoCE
include:
# SMS-Labs OpenStack in stackhpc-ci project
- cloud: smslabs
base_image_name: openhpc-220510-1910.qcow2
name: smslabs
cluster_name_suffix: ''
# Arcus OpenStack in rcp-cloud-portal-demo project, with RoCE
- cloud: arcus
base_image_name: openhpc-220510-1910.qcow2
name: arcus
cluster_name_suffix: ''
# As above using OFED
- cloud: arcus
base_image_name: openhpc-220510-1911-ofed.qcow2
name: arcus-ofed
cluster_name_suffix: ofed
fail-fast: false # as want clouds to continue independently
concurrency: ${{ matrix.cloud }}
concurrency: ${{ matrix.name }}
runs-on: ubuntu-20.04
env:
TF_VAR_cluster_name: ci${{ github.run_id }}${{ matrix.cluster_name_suffix }}
OS_CLOUD: openstack
ANSIBLE_FORCE_COLOR: True
steps:
- uses: actions/checkout@v2

Expand Down Expand Up @@ -60,8 +77,7 @@ jobs:
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
terraform apply -auto-approve
env:
OS_CLOUD: openstack
TF_VAR_cluster_name: ci${{ github.run_id }}
TF_VAR_base_image_name: ${{ matrix.base_image_name }}

- name: Get server provisioning failure messages
id: provision_failure
Expand All @@ -70,9 +86,6 @@ jobs:
. environments/${{ matrix.cloud }}/activate
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
echo "::set-output name=messages::$(../../skeleton/\{\{cookiecutter.environment\}\}/terraform/getfaults.py $PWD)"
env:
OS_CLOUD: openstack
TF_VAR_cluster_name: ci${{ github.run_id }}
if: always() && steps.provision.outcome == 'failure'

- name: Delete infrastructure if failed due to lack of hosts
Expand All @@ -81,9 +94,6 @@ jobs:
. environments/${{ matrix.cloud }}/activate
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
terraform destroy -auto-approve
env:
OS_CLOUD: openstack
TF_VAR_cluster_name: ci${{ github.run_id }}
if: ${{ always() && steps.provision.outcome == 'failure' && contains('not enough hosts available', steps.provision_failure.messages) }}

- name: Directly configure cluster and build compute, login and control images
Expand All @@ -96,9 +106,8 @@ jobs:
echo test_user_password: "$TEST_USER_PASSWORD" > $APPLIANCES_ENVIRONMENT_ROOT/inventory/group_vars/basic_users/defaults.yml
ansible-playbook -vv ansible/site.yml
env:
OS_CLOUD: openstack
ANSIBLE_FORCE_COLOR: True
TEST_USER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}
PKR_VAR_source_image_name: ${{ matrix.base_image_name }}

- name: Confirm Open Ondemand is up (via SOCKS proxy)
run: |
Expand Down Expand Up @@ -138,18 +147,12 @@ jobs:
. environments/${{ matrix.cloud }}/activate
ansible all -m wait_for_connection
ansible-playbook -vv ansible/ci/test_reimage.yml
env:
OS_CLOUD: openstack
ANSIBLE_FORCE_COLOR: True


- name: Run MPI-based tests
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
ansible-playbook -vv ansible/adhoc/hpctests.yml
env:
ANSIBLE_FORCE_COLOR: True
OS_CLOUD: openstack

- name: Delete infrastructure
run: |
Expand All @@ -158,15 +161,11 @@ jobs:
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
terraform destroy -auto-approve
env:
OS_CLOUD: openstack
TF_VAR_cluster_name: ci${{ github.run_id }}
TF_VAR_base_image_name: ${{ matrix.base_image_name }}
if: ${{ success() || cancelled() }}

- name: Delete images
run: |
. venv/bin/activate
. environments/${{ matrix.cloud }}/activate
ansible-playbook -vv ansible/ci/delete_images.yml
env:
OS_CLOUD: openstack
ANSIBLE_FORCE_COLOR: True
37 changes: 1 addition & 36 deletions ansible/bootstrap.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,41 +52,6 @@
- import_role:
name: fail2ban

- hosts: update
gather_facts: false
become: yes
tags:
- update
tasks:
- block:
- name: Update selected packages
yum:
name: "{{ update_name }}"
state: "{{ update_state }}"
exclude: "{{ update_exclude }}"
disablerepo: "{{ update_disablerepo }}"
async: "{{ 30 * 60 }}" # wait for up to 30 minutes
poll: 15 # check every 15 seconds
register: updates
- debug:
var: updates
- name: Ensure update log directory on localhost exists
file:
path: "{{ update_log_path | dirname }}"
state: directory
become: false
delegate_to: localhost
run_once: true
- name: Log updated packages
copy:
content: "{{ updates.results | join('\n') }}"
dest: "{{ update_log_path }}"
delegate_to: localhost
become: no
- debug:
msg: "{{ updates.results | length }} changes to packages - see {{ update_log_path }} for details"
when: "update_enable | default('false') | bool"

- hosts:
- selinux
- update
Expand All @@ -110,4 +75,4 @@
sleep: 15
- name: update facts
setup:
when: (sestatus.changed | default(false)) or (sestatus.reboot_required | default(false))
when: (sestatus.changed | default(false)) or (sestatus.reboot_required | default(false))
15 changes: 15 additions & 0 deletions environments/arcus/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Provides a CI environment on Arcus in the `rcp-cloud-portal-demo` project. Uses "direct"-mode VNICs.

This can be deployed manually on Arcus but will require the following to be defined:
- Terraform variables `cluster_name` and `base_image_name`. E.g.:

```
environments/smslabs/terraform/terraform.tfvars:
cluster_name = "ofed"
base_image_name = "openhpc-220510-1911-ofed.qcow2"
```
- Ansible variable `vault_testuser_password` defining the password for `testuser` (used for accessing Open Ondemand), e.g.:
```
environments/smslabs/inventory/group_vars/all/dev.yml:
vault_testuser_password: somesecretstring
```
1 change: 0 additions & 1 deletion environments/arcus/builder.pkrvars.hcl
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
flavor = "vm.alaska.cpu.general.small"
networks = ["a262aabd-e6bf-4440-a155-13dbc1b5db0e"] # WCDC-iLab-60
source_image_name = "openhpc-220413-1545.qcow2"
ssh_keypair_name = "slurm-app-ci"
security_groups = ["default", "SSH"]
ssh_bastion_host = "128.232.222.183"
Expand Down
11 changes: 8 additions & 3 deletions environments/arcus/terraform/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ variable "cluster_name" {
description = "Name for cluster, used as prefix for resources - set by environment var in CI"
}

variable "base_image_name" {
type = string
description = "Name of base image for all nodes - set by environment var in CI"
}

module "cluster" {
source = "../../skeleton/{{cookiecutter.environment}}/terraform/"

Expand All @@ -18,18 +23,18 @@ module "cluster" {
key_pair = "slurm-app-ci"
control_node = {
flavor: "vm.alaska.cpu.general.small"
image: "openhpc-220504-0904.qcow2"
image: var.base_image_name
}
login_nodes = {
login-0: {
flavor: "vm.alaska.cpu.general.small"
image: "openhpc-220504-0904.qcow2"
image: var.base_image_name
}
}
compute_types = {
small: {
flavor: "vm.alaska.cpu.general.small"
image: "openhpc-220504-0904.qcow2"
image: var.base_image_name
}
}
compute_nodes = {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
openhpc_slurm_service_started: false
nfs_client_mnt_state: present
openhpc_rebuild_reconfigure: false
update_enable: true
block_devices_partition_state: skip
block_devices_filesystem_state: skip
block_devices_mount_state: present
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,16 @@ ansible_user=rocky
openhpc_cluster_name=${cluster_name}

[control]
${control.name} ansible_host=${[for n in control.network: n.fixed_ip_v4 if n.access_network][0]} server_networks='${jsonencode({for net in control.network: net.name => [ net.fixed_ip_v4 ] })}'
${control.name} ansible_host=${[for n in control.network: n.fixed_ip_v4 if n.access_network][0]} instance_id=${control.id} server_networks='${jsonencode({for net in control.network: net.name => [ net.fixed_ip_v4 ] })}'

[login]
%{ for login in logins ~}
${login.name} ansible_host=${[for n in login.network: n.fixed_ip_v4 if n.access_network][0]} server_networks='${jsonencode({for net in login.network: net.name => [ net.fixed_ip_v4 ] })}'
${login.name} ansible_host=${[for n in login.network: n.fixed_ip_v4 if n.access_network][0]} instance_id=${login.id} server_networks='${jsonencode({for net in login.network: net.name => [ net.fixed_ip_v4 ] })}'
%{ endfor ~}

[compute]
%{ for compute in computes ~}
${compute.name} ansible_host=${[for n in compute.network: n.fixed_ip_v4 if n.access_network][0]} server_networks='${jsonencode({for net in compute.network: net.name => [ net.fixed_ip_v4 ] })}'
${compute.name} ansible_host=${[for n in compute.network: n.fixed_ip_v4 if n.access_network][0]} instance_id=${compute.id} server_networks='${jsonencode({for net in compute.network: net.name => [ net.fixed_ip_v4 ] })}'
%{ endfor ~}

# Define groups for slurm parititions:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,6 @@ resource "openstack_networking_port_v2" "nonlogin" {
}
}


resource "openstack_compute_instance_v2" "control" {

name = "${var.cluster_name}-control"
Expand Down
15 changes: 15 additions & 0 deletions environments/smslabs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Provides a CI environment on SMS-labs in the `stackhpc-ci` project. Uses "normal"-mode VNICs.

This can be deployed manually on SMS-labs but will require the following to be defined:
- Terraform variables `cluster_name` and `base_image_name`. E.g.:

```
environments/smslabs/terraform/terraform.tfvars:
cluster_name = "ofed"
base_image_name = "openhpc-220510-1911-ofed.qcow2"
```
- Ansible variable `vault_testuser_password` defining the password for `testuser` (used for accessing Open Ondemand), e.g.:
```
environments/smslabs/inventory/group_vars/all/dev.yml:
vault_testuser_password: somesecretstring
```
3 changes: 1 addition & 2 deletions environments/smslabs/builder.pkrvars.hcl
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
flavor = "general.v1.tiny"
flavor = "general.v1.small"
networks = ["c245901d-6b84-4dc4-b02b-eec0fb6122b2"] # stackhpc-ci-geneve
source_image_name = "openhpc-220413-1545.qcow2"
ssh_keypair_name = "slurm-app-ci"
security_groups = ["default", "SSH"]
ssh_bastion_host = "185.45.78.150"
Expand Down
17 changes: 11 additions & 6 deletions environments/smslabs/terraform/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@ variable "cluster_name" {
description = "Name for cluster, used as prefix for resources - set by environment var in CI"
}

variable "base_image_name" {
type = string
description = "Name of base image for all nodes - set by environment var in CI"
}

module "cluster" {
source = "../../skeleton/{{cookiecutter.environment}}/terraform/"

Expand All @@ -20,19 +25,19 @@ module "cluster" {
]
key_pair = "slurm-app-ci"
control_node = {
flavor: "general.v1.tiny"
image: "openhpc-220504-0904.qcow2"
flavor: "general.v1.small"
image: var.base_image_name
}
login_nodes = {
login-0: {
flavor: "general.v1.tiny"
image: "openhpc-220504-0904.qcow2"
flavor: "general.v1.small"
image: var.base_image_name
}
}
compute_types = {
small: {
flavor: "general.v1.tiny"
image: "openhpc-220504-0904.qcow2"
flavor: "general.v1.small"
image: var.base_image_name
}
}
compute_nodes = {
Expand Down