|
1 | 1 | # Vagrant-Example cluster
|
2 | 2 |
|
3 |
| -Provisions an environment using vagrant |
| 3 | +Provisions an environment using vagrant - this is used by Gitlab CI too. |
4 | 4 |
|
5 |
| -# Directory structure |
| 5 | +This README is supplimentary to the main readme at `<repo_root>/README.md` so only differences/additional information is noted here. Paths are relative to this environment unless otherwise noted. |
6 | 6 |
|
7 |
| -## terraform |
| 7 | +## Pre-requisites |
| 8 | +No additional comments. |
8 | 9 |
|
9 |
| -Contains terraform configuration to deploy infrastructure. |
| 10 | +## Installation on deployment host |
| 11 | +See main README and then additionally install Vagrant and a provider. For CentOS 8, you can install Vagrant + VirtualBox using: |
10 | 12 |
|
11 |
| -## inventory |
| 13 | + sudo dnf install https://releases.hashicorp.com/vagrant/2.2.6/vagrant_2.2.6_x86_64.rpm |
| 14 | + sudo dnf config-manager --add-repo=https://download.virtualbox.org/virtualbox/rpm/el/virtualbox.repo |
| 15 | + sudo yum install VirtualBox-6.0 |
12 | 16 |
|
13 |
| -Ansible inventory for configuring the infrastructure. |
| 17 | +(Note that each Vagrant version only supports a subset of VirtualBox releases.) |
14 | 18 |
|
15 |
| -# Setup |
| 19 | +## Overview of directory structure |
| 20 | +See main README, plus: |
| 21 | +- The vagrant configuration is contained in the `vagrant/` directory. |
| 22 | +- Scripts are provided in the `<repo_root>dev/` directory to provision and configure the environment. |
16 | 23 |
|
17 |
| -In the repo root, run: |
| 24 | +## Creating a Slurm appliance |
18 | 25 |
|
19 |
| - python3 -m venv venv # TODO: do we need system-site-packages? |
20 |
| - . venv/bin/activate |
21 |
| - pip install -U upgrade pip |
22 |
| - pip install requirements.txt |
23 |
| - ansible-galaxy install -r requirements.yml -p ansible/roles |
24 |
| - ansible-galaxy collection install -r requirements.yml -p ansible/collections # don't worry about collections path warning |
25 |
| - |
26 |
| -# Activating the environment |
27 |
| - |
28 |
| -There is a small environment file that you must `source` which defines environment |
29 |
| -variables that reference the configuration path. This is so that we can locate |
30 |
| -resources relative the environment directory. |
31 |
| - |
32 |
| - . environments/vagrant-example/activate |
33 |
| - |
34 |
| -The pattern we use is that all resources referenced in the inventory |
35 |
| -are located in the environment directory containing the inventory that |
36 |
| -references them. |
37 |
| - |
38 |
| -# Common configuration |
39 |
| - |
40 |
| -Configuarion is shared by specifiying multiple inventories. We reference the `common` |
41 |
| -inventory from `ansible.cfg`, including it before the environment specific |
42 |
| -inventory, located at `./inventory`. |
43 |
| - |
44 |
| -Inventories specified later in the list can override values set in the inventories |
45 |
| -that appear earlier. This allows you to override values set by the `common` inventory. |
46 |
| - |
47 |
| -Any variables that would be identical for all environments should be defined in the `common` inventory. |
48 |
| - |
49 |
| -# Passwords |
50 |
| - |
51 |
| -Prior to running any other playbooks, you need to define a set of passwords. You can |
52 |
| -use the `generate-passwords.yml` playbook to automate this process: |
53 |
| - |
54 |
| -``` |
55 |
| -cd <repo root> |
56 |
| -ansible-playbook ansible/adhoc/generate-passwords.yml # can actually be run from anywhere once environment activated |
57 |
| -``` |
58 |
| - |
59 |
| -This will output a set of passwords `inventory/group_vars/all/secrets.yml`. |
60 |
| -Placing them in the inventory means that they will be defined for all playbooks. |
61 |
| - |
62 |
| -It is recommended to encrypt the contents of this file prior to commiting to git: |
63 |
| - |
64 |
| -``` |
65 |
| -ansible-vault encrypt inventory/group_vars/all/secrets.yml |
66 |
| -``` |
67 |
| - |
68 |
| -You will then need to provide a password when running the playbooks e.g: |
69 |
| - |
70 |
| -``` |
71 |
| -ansible-playbook ../ansible/site.yml --tags grafana --ask-vault-password |
72 |
| -``` |
73 |
| - |
74 |
| -See the [Ansible vault documentation](https://docs.ansible.com/ansible/latest/user_guide/vault.html) for more details. |
75 |
| - |
76 |
| - |
77 |
| -# Deploy nodes with Terraform |
78 |
| - |
79 |
| -- Modify the keypair in `main.tf` and ensure the required Centos images are available on OpenStack. |
80 |
| -- Activate the virtualenv and create the instances: |
81 |
| - |
82 |
| - . venv/bin/activate |
83 |
| - cd environments/vagrant-example/ |
84 |
| - terraform apply |
85 |
| - |
86 |
| -This creates an ansible inventory file `./inventory`. |
87 |
| - |
88 |
| -Note that this terraform deploys instances onto an existing network - for production use you probably want to create a network for the cluster. |
89 |
| - |
90 |
| -# Create and configure cluster with Ansible |
91 |
| - |
92 |
| -Now run one or more playbooks using: |
| 26 | +To provision and configure the appliance in the same way as the CI use: |
93 | 27 |
|
94 | 28 | cd <repo root>
|
95 |
| - ansible-playbook ansible/site.yml |
96 |
| - |
97 |
| -This provides: |
98 |
| -- grafana at `http://<login_ip>:3000` - username `grafana`, password as set above |
99 |
| -- prometheus at `http://<login_ip>:9090` |
| 29 | + dev/vagrant-provision-example.sh |
| 30 | + dev/vagrant-example-configure.sh |
100 | 31 |
|
101 |
| -NB: if grafana's yum repos are down you will see `Errors during downloading metadata for repository 'grafana' ...`. You can work around this using: |
| 32 | +To debug failures, activate the venv and environment and switch to the vagrant project directory: |
102 | 33 |
|
103 |
| - ssh centos@<login_ip> |
104 |
| - sudo rm -rf /etc/yum.repos.d/grafana.repo |
105 |
| - wget https://dl.grafana.com/oss/release/grafana-7.3.1-1.x86_64.rpm |
106 |
| - sudo yum install grafana-7.3.1-1.x86_64.rpm |
107 |
| - exit |
108 |
| - ansible-playbook -i inventory monitoring.yml -e grafana_password=<password> --skip-tags grafana_install |
109 |
| - |
110 |
| -# rebuild.yml |
111 |
| - |
112 |
| -# FIXME: outdated |
113 |
| - |
114 |
| -Enable the compute nodes of a Slurm-based OpenHPC cluster on Openstack to be reimaged from Slurm. |
115 |
| - |
116 |
| -For full details including the Slurm commmands to use see the [role's README](https://github.com/stackhpc/ansible_collection_slurm_openstack_tools/blob/main/roles/rebuild/README.md) |
117 |
| - |
118 |
| -Ensure you have `~/.config/openstack/clouds.yaml` defining authentication for a a single Openstack cloud (see above README to change location). |
119 |
| - |
120 |
| -Then run: |
121 |
| - |
122 |
| - ansible-playbook -i inventory rebuild.yml |
123 |
| - |
124 |
| -Note this does not rebuild the nodes, only deploys the tools to do so. |
125 |
| - |
126 |
| -# test.yml |
127 |
| - |
128 |
| -This runs MPI-based tests on the cluster: |
129 |
| -- `pingpong`: Runs Intel MPI Benchmark's IMB-MPI1 pingpong between a pair of (scheduler-selected) nodes. Reports zero-size message latency and maximum bandwidth. |
130 |
| -- `pingmatrix`: Runs a similar pingpong test but between all pairs of nodes. Reports zero-size message latency & maximum bandwidth. |
131 |
| -- `hpl-solo`: Runs HPL **separately** on all nodes, using 80% of memory, reporting Gflops on each node. |
132 |
| - |
133 |
| -These names can be used as tags to run only a subset of tests. For full details see the [role's README](https://github.com/stackhpc/ansible_collection_slurm_openstack_tools/blob/main/roles/test/README.md). |
134 |
| - |
135 |
| -Note these are intended as post-deployment tests for a cluster to which you have root access - they are **not** intended for use on a system running production jobs: |
136 |
| -- Test directories are created within `openhpc_tests_rootdir` (here `/mnt/nfs/ohcp-tests`) which must be on a shared filesystem (read/write from login/control and compute nodes) |
137 |
| -- Generally, packages are only installed on the control/login node, and `/opt` is exported via NFS to the compute nodes. |
138 |
| -- The exception is the `slurm-libpmi-ohpc` package (required for `srun` with Intel MPI) which is installed on all nodes. |
139 |
| - |
140 |
| -To achieve best performance for HPL set `openhpc_tests_hpl_NB` in [test.yml](test.yml) to the appropriate the HPL blocksize 'NB' for the compute node processor - for Intel CPUs see [here](https://software.intel.com/content/www/us/en/develop/documentation/mkl-linux-developer-guide/top/intel-math-kernel-library-benchmarks/intel-distribution-for-linpack-benchmark/configuring-parameters.html). |
141 |
| - |
142 |
| -Then run: |
143 |
| - |
144 |
| - ansible-playbook ../ansible/adhoc/test.yml |
145 |
| - |
146 |
| -Results will be reported in the ansible stdout - the pingmatrix test also writes an html results file onto the ansible host. |
147 |
| - |
148 |
| -Note that you can still use the `test.yml` playbook even if the terraform/ansible in this repo wasn't used to deploy the cluster - as long as it's running OpenHPC v2. Simply create an appropriate `inventory` file, e.g: |
149 |
| - |
150 |
| - [all:vars] |
151 |
| - ansible_user=centos |
152 |
| - |
153 |
| - [cluster:children] |
154 |
| - cluster_login |
155 |
| - cluster_compute |
156 |
| - |
157 |
| - [cluster_login] |
158 |
| - slurm-control |
159 |
| - |
160 |
| - [cluster_compute] |
161 |
| - cpu-h21a5-u3-svn2 |
162 |
| - cpu-h21a5-u3-svn4 |
163 |
| - ... |
164 |
| - |
165 |
| -And run the `test.yml` playbook as described above. If you want to run tests only on a group from this inventory, rather than an entire partition, you can |
166 |
| -use ``--limit`` |
167 |
| - |
168 |
| -Then running the tests passing this file as extra_vars: |
169 |
| - |
170 |
| - ansible-playbook ../ansible/test.yml --limit group-in-inventory |
171 |
| - |
172 |
| -# Destroying the cluster |
173 |
| - |
174 |
| -When finished, run: |
| 34 | + . venv/bin/activate |
| 35 | + . environments/vagrant-example/activate |
| 36 | + cd $APPLIANCES_ENVIRONMENT_ROOT/vagrant |
175 | 37 |
|
176 |
| - terraform destroy --auto-approve |
| 38 | +(see the main README for an explanation of environment activation). Example vagrant commands are: |
| 39 | + |
| 40 | + vagrant status # list vms |
| 41 | + vagrant ssh <hostname> # login |
| 42 | + vagrant destroy --parallel # destroy all VMs in parallel **without confirmation** |
0 commit comments