Skip to content

Commit 00950d5

Browse files
committed
Merge branch 'main' into block_device_ids
2 parents 8790ce2 + 4352108 commit 00950d5

File tree

94 files changed

+1760
-437
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

94 files changed

+1760
-437
lines changed

.github/workflows/smslabs.yml

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
2+
name: Test on SMS-Labs OpenStack in stackhpc-ci
3+
on:
4+
push:
5+
branches:
6+
- main
7+
pull_request:
8+
concurrency: stackhpc-ci # openstack project
9+
jobs:
10+
openstack-example:
11+
runs-on: ubuntu-20.04
12+
steps:
13+
- uses: actions/checkout@v2
14+
15+
- name: Setup ssh
16+
run: |
17+
set -x
18+
mkdir ~/.ssh
19+
echo "$SSH_KEY" > ~/.ssh/id_rsa
20+
chmod 0600 ~/.ssh/id_rsa
21+
env:
22+
SSH_KEY: ${{ secrets.SSH_KEY }}
23+
24+
- name: Add bastion's ssh key to known_hosts
25+
run: cat environments/smslabs/bastion_fingerprint >> ~/.ssh/known_hosts
26+
shell: bash
27+
28+
- name: Install ansible etc
29+
run: dev/setup-env.sh
30+
31+
- name: Install terraform
32+
uses: hashicorp/setup-terraform@v1
33+
34+
- name: Initialise terraform
35+
run: terraform init
36+
working-directory: ${{ github.workspace }}/environments/smslabs/terraform
37+
38+
- name: Write clouds.yaml
39+
run: |
40+
mkdir -p ~/.config/openstack/
41+
echo "$CLOUDS_YAML" > ~/.config/openstack/clouds.yaml
42+
shell: bash
43+
env:
44+
CLOUDS_YAML: ${{ secrets.CLOUDS_YAML }}
45+
46+
- name: Provision infrastructure
47+
id: provision
48+
run: |
49+
. venv/bin/activate
50+
. environments/smslabs/activate
51+
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
52+
terraform apply -auto-approve
53+
env:
54+
OS_CLOUD: openstack
55+
TF_VAR_cluster_name: ci${{ github.run_id }}
56+
57+
- name: Get server provisioning failure messages
58+
id: provision_failure
59+
run: |
60+
. venv/bin/activate
61+
. environments/smslabs/activate
62+
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
63+
echo "::set-output name=messages::$(./getfaults.py)"
64+
env:
65+
OS_CLOUD: openstack
66+
TF_VAR_cluster_name: ci${{ github.run_id }}
67+
if: always() && steps.provision.outcome == 'failure'
68+
69+
- name: Delete infrastructure if failed due to lack of hosts
70+
run: |
71+
. venv/bin/activate
72+
. environments/smslabs/activate
73+
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
74+
terraform destroy -auto-approve
75+
env:
76+
OS_CLOUD: openstack
77+
TF_VAR_cluster_name: ci${{ github.run_id }}
78+
if: ${{ always() && steps.provision.outcome == 'failure' && contains('not enough hosts available', steps.provision_failure.messages) }}
79+
80+
- name: Configure infrastructure
81+
run: |
82+
. venv/bin/activate
83+
. environments/smslabs/activate
84+
ansible all -m wait_for_connection
85+
ansible-playbook ansible/adhoc/generate-passwords.yml
86+
ansible-playbook -vv ansible/site.yml
87+
env:
88+
ANSIBLE_FORCE_COLOR: True
89+
TEST_USER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}
90+
91+
- name: Run MPI-based tests
92+
run: |
93+
. venv/bin/activate
94+
. environments/smslabs/activate
95+
ansible-playbook -vv ansible/adhoc/hpctests.yml
96+
env:
97+
ANSIBLE_FORCE_COLOR: True
98+
TEST_USER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}
99+
100+
- name: Build control and compute images
101+
run: |
102+
. venv/bin/activate
103+
. environments/smslabs/activate
104+
cd packer
105+
PACKER_LOG=1 PACKER_LOG_PATH=build.log packer build -var-file=$PKR_VAR_environment_root/builder.pkrvars.hcl openstack.pkr.hcl
106+
env:
107+
OS_CLOUD: openstack
108+
TEST_USER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}
109+
110+
- name: Reimage compute nodes via slurm and check cluster still up
111+
run: |
112+
. venv/bin/activate
113+
. environments/smslabs/activate
114+
ansible-playbook -vv $APPLIANCES_ENVIRONMENT_ROOT/ci/reimage-compute.yml
115+
ansible-playbook -vv $APPLIANCES_ENVIRONMENT_ROOT/hooks/post.yml
116+
env:
117+
OS_CLOUD: openstack
118+
TEST_USER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}
119+
120+
- name: Reimage login nodes via openstack and check cluster still up
121+
run: |
122+
. venv/bin/activate
123+
. environments/smslabs/activate
124+
ansible-playbook -vv $APPLIANCES_ENVIRONMENT_ROOT/ci/reimage-login.yml
125+
ansible-playbook -vv $APPLIANCES_ENVIRONMENT_ROOT/hooks/post.yml
126+
env:
127+
OS_CLOUD: openstack
128+
TEST_USER_PASSWORD: ${{ secrets.TEST_USER_PASSWORD }}
129+
130+
- name: Delete infrastructure
131+
run: |
132+
. venv/bin/activate
133+
. environments/smslabs/activate
134+
cd $APPLIANCES_ENVIRONMENT_ROOT/terraform
135+
terraform destroy -auto-approve
136+
env:
137+
OS_CLOUD: openstack
138+
TF_VAR_cluster_name: ci${{ github.run_id }}
139+
if: ${{ success() || cancelled() }}

.github/workflows/vagrant.yml

Lines changed: 0 additions & 32 deletions
This file was deleted.

README.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
1+
[![Test on OpenStack via smslabs](https://github.com/stackhpc/ansible-slurm-appliance/actions/workflows/smslabs.yml/badge.svg)](https://github.com/stackhpc/ansible-slurm-appliance/actions/workflows/smslabs.yml)
2+
13
# StackHPC Slurm Appliance
24

35
This repository contains playbooks and configuration to define a Slurm-based HPC environment including:
4-
- A Centos 8 and OpenHPC v2-based Slurm cluster.
6+
- A Rocky Linux 8 and OpenHPC v2-based Slurm cluster.
57
- Shared fileystem(s) using NFS (with servers within or external to the cluster).
68
- Slurm accounting using a MySQL backend.
79
- A monitoring backend using Prometheus and ElasticSearch.
@@ -16,15 +18,15 @@ While it is tested on OpenStack it should work on any cloud, except for node reb
1618
## Prerequisites
1719
It is recommended to check the following before starting:
1820
- You have root access on the "ansible deploy host" which will be used to deploy the appliance.
19-
- You can create instances using a CentOS 8 GenericCloud image (or an image based on that).
21+
- You can create instances using a Rocky 8 GenericCloud image (or an image based on that).
2022
- SSH keys get correctly injected into instances.
2123
- Instances have access to internet (note proxies can be setup through the appliance if necessary).
2224
- DNS works (if not this can be partially worked around but additional configuration will be required).
2325
- Created instances have accurate/synchronised time (for VM instances this is usually provided by the hypervisor; if not or for bare metal instances it may be necessary to configure a time service via the appliance).
2426

2527
## Installation on deployment host
2628

27-
These instructions assume the deployment host is running Centos 8:
29+
These instructions assume the deployment host is running Centos/Rocky 8:
2830

2931
sudo yum install -y git python3
3032
git clone https://github.com/stackhpc/ansible-slurm-appliance

ansible/.gitignore

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,6 @@ roles/*
88
!roles/opendistro/**
99
!roles/podman/
1010
!roles/podman/**
11-
!roles/kibana/
12-
!roles/kibana/**
1311
!roles/grafana-dashboards/
1412
!roles/grafana-dashboards/**
1513
!roles/grafana-datasources/
@@ -22,3 +20,5 @@ roles/*
2220
!roles/block_devices/**
2321
!roles/basic_users/
2422
!roles/basic_users/**
23+
!roles/openondemand/
24+
!roles/openondemand/**

ansible/bootstrap.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,8 @@
4949
state: "{{ update_state }}"
5050
exclude: "{{ update_exclude }}"
5151
disablerepo: "{{ update_disablerepo }}"
52+
async: "{{ 30 * 60 }}" # wait for up to 30 minutes
53+
poll: 15 # check every 15 seconds
5254
register: updates
5355
- debug:
5456
var: updates

ansible/filter_plugins/utils.py

Lines changed: 1 addition & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -9,22 +9,10 @@
99
from ansible.module_utils.six import string_types
1010
import os.path
1111

12-
def _get_hostvar(context, var_name, inventory_hostname=None):
13-
if inventory_hostname is None:
14-
namespace = context
15-
else:
16-
if inventory_hostname not in context['hostvars']:
17-
raise AnsibleFilterError(
18-
"Inventory hostname '%s' not in hostvars" % inventory_hostname)
19-
namespace = context["hostvars"][inventory_hostname]
20-
return namespace.get(var_name)
21-
22-
@jinja2.contextfilter
23-
def prometheus_node_exporter_targets(context, hosts):
12+
def prometheus_node_exporter_targets(hosts, env):
2413
result = []
2514
per_env = defaultdict(list)
2615
for host in hosts:
27-
env = _get_hostvar(context, "env", host) or "ungrouped"
2816
per_env[env].append(host)
2917
for env, hosts in per_env.items():
3018
target = {

ansible/iam.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
- hosts: basic_users
22
become: yes
3+
tags:
4+
- basic_users
35
gather_facts: yes
46
tasks:
57
- import_role:

ansible/monitoring.yml

Lines changed: 11 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -29,25 +29,6 @@
2929
tasks_from: deploy.yml
3030
tags: deploy
3131

32-
- name: Setup kibana
33-
hosts: kibana
34-
tags: kibana
35-
tasks:
36-
- import_role:
37-
name: kibana
38-
tasks_from: config.yml
39-
tags: config
40-
41-
- import_role:
42-
name: kibana
43-
tasks_from: deploy.yml
44-
tags: deploy
45-
46-
- import_role:
47-
name: kibana
48-
tasks_from: post.yml
49-
tags: post
50-
5132
- name: Setup slurm stats
5233
hosts: slurm_stats
5334
tags: slurm_stats
@@ -80,6 +61,17 @@
8061
tasks:
8162
- import_role: name=cloudalchemy.node_exporter
8263

64+
- name: Deploy OpenOndemand exporter
65+
hosts: openondemand
66+
become: true
67+
tags:
68+
- openondemand
69+
- openondemand_server
70+
tasks:
71+
- import_role:
72+
name: openondemand
73+
tasks_from: exporter.yml
74+
8375
- name: Setup core monitoring software
8476
hosts: prometheus
8577
tags: prometheus

ansible/portal.yml

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
- hosts: openondemand
2+
tags:
3+
- openondemand
4+
- openondemand_server
5+
become: yes
6+
gather_facts: yes # TODO
7+
tasks:
8+
- import_role:
9+
name: openondemand
10+
tasks_from: main.yml
11+
12+
- hosts: openondemand_desktop
13+
tags:
14+
- openondemand
15+
- openondemand_desktop
16+
become: yes
17+
gather_facts: yes
18+
tasks:
19+
- import_role:
20+
name: openondemand
21+
tasks_from: vnc_compute.yml
22+
23+
- hosts: openondemand_jupyter
24+
tags:
25+
- openondemand
26+
- openondemand_jupyter
27+
become: yes
28+
gather_facts: yes
29+
tasks:
30+
- import_role:
31+
name: openondemand
32+
tasks_from: jupyter_compute.yml

ansible/roles/basic_users/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,12 @@ Requirements
1616
Role Variables
1717
--------------
1818

19-
`basic_users_users`: Required. A list of mappings defining information for each user. In general, mapping keys/values are the parameters to [ansible.builtin.user](https://docs.ansible.com/ansible/latest/collections/ansible/builtin/user_module.html) and default values are as given there. However:
19+
`basic_users_users`: Required. A list of mappings defining information for each user. In general, mapping keys/values are passed through as parameters to [ansible.builtin.user](https://docs.ansible.com/ansible/latest/collections/ansible/builtin/user_module.html) and default values are as given there. However:
2020
- `create_home`, `generate_ssh_key` and `ssh_key_comment` are set automatically and should not be overriden.
21+
- `uid` should be set, so that the UID/GID is consistent across the cluster (which Slurm requires).
2122
- `shell` may be set if required, but will be overriden with `/sbin/nologin` on `control` nodes to prevent user login.
2223
- An additional key `public_key` may optionally be specified to define a key to log into the cluster.
24+
- Any other keys may present for other purposes (i.e. not used by this role).
2325

2426
Dependencies
2527
------------

ansible/roles/basic_users/filter_plugins/filter_keys.py

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,18 +2,23 @@
22

33
import copy
44

5+
USER_MODULE_PARAMS = ('append authorization comment create_home createhome expires force generate_ssh_key group '
6+
'groups hidden home local login_class move_home name user non_unique password password_expire_min '
7+
'password_expire_max password_lock profile remove role seuser shell skeleton ssh_key_bits '
8+
'ssh_key_comment ssh_key_file ssh_key_passphrase ssh_key_type state system uid update_password').split()
59

610
class FilterModule(object):
711

812
def filters(self):
913
return {
10-
'filter_keys': self.filter_keys
14+
'filter_user_params': self.filter_user_params
1115
}
1216

13-
def filter_keys(self, orig_dict, keys_to_remove):
14-
''' Return a copy of `orig_dict` without the keys in the list `keys_to_remove`'''
15-
dict_to_return = copy.deepcopy(orig_dict)
16-
for item in keys_to_remove:
17-
if item in dict_to_return:
18-
del dict_to_return[item]
19-
return dict_to_return
17+
def filter_user_params(self, d):
18+
''' Return a copy of dict `d` containing only keys which are parameters for the user module'''
19+
20+
user_dict = copy.deepcopy(d)
21+
remove_keys = set(user_dict).difference(USER_MODULE_PARAMS)
22+
for key in remove_keys:
23+
del user_dict[key]
24+
return user_dict

ansible/roles/basic_users/tasks/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
- "item.state | default('present') == 'absent'"
1010

1111
- name: Create users and generate public keys
12-
user: "{{ basic_users_userdefaults | combine(item) | filter_keys(['public_key']) }}"
12+
user: "{{ basic_users_userdefaults | combine(item) | filter_user_params() }}"
1313
loop: "{{ basic_users_users }}"
1414
loop_control:
1515
label: "{{ item.name }} [{{ item.state | default('present') }}]"
Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,4 @@
11
---
22

3-
# You must define this variable.
4-
#filebeat_config_path: undefined
5-
filebeat_podman_user: "{{ ansible_user }}"
3+
#filebeat_config_path: undefined # REQUIRED. Path to filebeat.yml configuration file template
4+
filebeat_podman_user: "{{ ansible_user }}" # User that runs the filebeat container

0 commit comments

Comments
 (0)