|
| 1 | +=================== |
| 2 | +Nova Compute Ironic |
| 3 | +=================== |
| 4 | + |
| 5 | +This section describes the deployment of the OpenStack Nova Compute |
| 6 | +Ironic service. The Nova Compute Ironic service is used to integrate |
| 7 | +OpenStack Ironic into Nova as a 'hypervisor' driver. The end users of Nova |
| 8 | +can then deploy and manage baremetal hardware, in a similar way to VMs. |
| 9 | + |
| 10 | +High Availability (HA) |
| 11 | +====================== |
| 12 | + |
| 13 | +The OpenStack Nova Compute service is designed to be installed once on every |
| 14 | +hypervisor in an OpenStack deployment. In this configuration, it makes little |
| 15 | +sense to run additional service instances. Even if you wanted to, it's not |
| 16 | +supported by design. This pattern breaks down with the Ironic baremetal |
| 17 | +service, which must run on the OpenStack control plane. It is not feasible |
| 18 | +to have a 1:1 mapping of Nova Compute Ironic services to baremetal nodes. |
| 19 | + |
| 20 | +The obvious HA solution is to run multiple instances of Nova Compute Ironic |
| 21 | +on the control plane, so that if one fails, the others can take over. However, |
| 22 | +due to assumptions long baked into the Nova source code, this is not trivial. |
| 23 | +The HA feature provided by the Nova Compute Ironic service has proven to be |
| 24 | +unstable, and the direction upstream is to switch to an active/passive |
| 25 | +solution [1]. |
| 26 | + |
| 27 | +However, challenges still exist with the active/passive solution. Since the |
| 28 | +Nova Compute Ironic HA feature is 'always on', one must ensure that only a |
| 29 | +single instance (per Ironic conductor group) is ever running. It is not |
| 30 | +possible to simply put multiple service instances behind HAProxy and use the |
| 31 | +active/passive mode. |
| 32 | + |
| 33 | +Such problems are commonly solved with a technology such as Pacemaker, or in |
| 34 | +the modern world, with a container orchestration engine such as Kubernetes. |
| 35 | +Kolla Ansible provides neither, because in general it doesn't need to. Its |
| 36 | +goal is simplicity. |
| 37 | + |
| 38 | +The interim solution is to therefore run a single Nova Compute Ironic |
| 39 | +service. If the service goes down, remedial action must be taken before |
| 40 | +Ironic nodes can be managed. In many environments the loss of the Ironic |
| 41 | +API for short periods is acceptable, providing that it can be easily |
| 42 | +resurrected. The purpose of this document is to faciliate that. |
| 43 | + |
| 44 | +.. note:: |
| 45 | + |
| 46 | + The new sharding mode is not covered here and it is assumed that you are |
| 47 | + not using it. See [1] for further information. This will be updated in |
| 48 | + the future. |
| 49 | + |
| 50 | +Optimal configuration of Nova Compute Ironic |
| 51 | +============================================ |
| 52 | + |
| 53 | +Determine the current configuration for the site. How many Nova Compute |
| 54 | +Ironic instances are running on the control plane? |
| 55 | + |
| 56 | +.. code-block:: console |
| 57 | +
|
| 58 | + $ openstack compute service list |
| 59 | +
|
| 60 | +Typically you will see either three or one. By default the host will |
| 61 | +marked with a postfix, eg. ``controller1-ironic``. If you find more than |
| 62 | +one, you will need to remove some instances. You must complete the |
| 63 | +following section. |
| 64 | + |
| 65 | +Moving from multiple Nova Compute Instances to a single instance |
| 66 | +---------------------------------------------------------------- |
| 67 | + |
| 68 | +1. Decide where the single instance should run. This should normally be |
| 69 | + one of the three OpenStack control plane hosts. For convention, pick |
| 70 | + the first one, unless you can think of a good reason not to. Once you |
| 71 | + have chosen, set the following variable in ``etc/kayobe/nova.yml``. |
| 72 | + Here we have picked ``controller1``. |
| 73 | + |
| 74 | + .. code-block:: yaml |
| 75 | +
|
| 76 | + kolla_nova_compute_ironic_host: controller1 |
| 77 | +
|
| 78 | +2. Ensure that you have organised a maintenance window, during which |
| 79 | + there will be no Ironic operations. You will be breaking the Ironic |
| 80 | + API. |
| 81 | + |
| 82 | +3. Perform a database backup. |
| 83 | + |
| 84 | + .. code-block:: console |
| 85 | +
|
| 86 | + $ kayobe overcloud database backup -vvv |
| 87 | +
|
| 88 | + Check the output of the command, and locate the backup files. |
| 89 | + |
| 90 | +4. Identify baremetal nodes associated with Nova Compute Ironic instances |
| 91 | + that will be removed. You don't need to do anything with these |
| 92 | + specifically, it's just for reference later. For example: |
| 93 | + |
| 94 | + .. code-block:: console |
| 95 | +
|
| 96 | + $ openstack baremetal node list --long -c "Instance Info" | grep controller3-ironic | wc -l |
| 97 | + 61 |
| 98 | + $ openstack baremetal node list --long -c "Instance Info" | grep controller2-ironic | wc -l |
| 99 | + 35 |
| 100 | + $ openstack baremetal node list --long -c "Instance Info" | grep controller1-ironic | wc -l |
| 101 | + 55 |
| 102 | +
|
| 103 | +5. Disable the redundant Nova Compute Ironic services: |
| 104 | + |
| 105 | + .. code-block:: console |
| 106 | +
|
| 107 | + $ openstack compute service set controller3-ironic nova-compute --disable |
| 108 | + $ openstack compute service set controller2-ironic nova-compute --disable |
| 109 | +
|
| 110 | +6. Delete the redundant Nova Compute Ironic services. You will need the service |
| 111 | + ID. For example: |
| 112 | + |
| 113 | + .. code-block:: console |
| 114 | +
|
| 115 | + $ ID=$(openstack compute service list | grep foo | awk '{print $2}') |
| 116 | + $ openstack compute service delete --os-compute-api-version 2.53 $ID |
| 117 | +
|
| 118 | + In older releases, you may hit a bug where the service can't be deleted if it |
| 119 | + is not managing any instances. In this case just move on and leave the service |
| 120 | + disabled. Eg. |
| 121 | + |
| 122 | + .. code-block:: console |
| 123 | +
|
| 124 | + $ openstack compute service delete --os-compute-api-version 2.53 c993b57e-f60c-4652-8328-5fb0e17c99c0 |
| 125 | + Failed to delete compute service with ID 'c993b57e-f60c-4652-8328-5fb0e17c99c0': HttpException: 500: Server Error for url: |
| 126 | + https://acme.pl-2.internal.hpc.is:8774/v2.1/os-services/c993b57e-f60c-4652-8328-5fb0e17c99c0, Unexpected API Error. |
| 127 | + Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. |
| 128 | +
|
| 129 | +7. Remove the Docker containers for the redundant Nova Compute Ironic services: |
| 130 | + |
| 131 | + .. code-block:: console |
| 132 | +
|
| 133 | + $ ssh controller2 sudo docker rm -f nova_compute_ironic |
| 134 | + $ ssh controller3 sudo docker rm -f nova_compute_ironic |
| 135 | +
|
| 136 | +8. Ensure that all Ironic nodes are using the single remaining Nova Compute |
| 137 | + Ironic instance. Eg. Baremetal nodes in use by compute instances will not |
| 138 | + fail over to the remaining Nova Compute Ironic service. Here, the active |
| 139 | + service is running on ``controller1``: |
| 140 | + |
| 141 | + .. code-block:: console |
| 142 | +
|
| 143 | + $ ssh controller1 |
| 144 | + $ sudo docker exec -it mariadb mysql -u nova -p$(sudo grep 'mysql+pymysql://nova:' /etc/kolla/nova-api/nova.conf | awk -F'[:,@]' '{print $3}') |
| 145 | + $ MariaDB [(none)]> use nova; |
| 146 | +
|
| 147 | + Proceed with caution. It is good practise to update one record first: |
| 148 | + |
| 149 | + .. code-block:: console |
| 150 | +
|
| 151 | + $ MariaDB [nova]> update instances set host='controller1-ironic' where uuid=0 and host='controller3-ironic' limit 1; |
| 152 | + Query OK, 1 row affected (0.002 sec) |
| 153 | + Rows matched: 1 Changed: 1 Warnings: 0 |
| 154 | +
|
| 155 | + At this stage you should go back to step 4 and check that the numbers have |
| 156 | + changed as expected. When you are happy, update remaining records for all |
| 157 | + services which have been removed: |
| 158 | + |
| 159 | + .. code-block:: console |
| 160 | +
|
| 161 | + $ MariaDB [nova]> update instances set host='controller1-ironic' where deleted=0 and host='controller3-ironic'; |
| 162 | + Query OK, 59 rows affected (0.009 sec) |
| 163 | + Rows matched: 59 Changed: 59 Warnings: 0 |
| 164 | + $ MariaDB [nova]> update instances set host='controller1-ironic' where deleted=0 and host='controller2-ironic'; |
| 165 | + Query OK, 35 rows affected (0.003 sec) |
| 166 | + Rows matched: 35 Changed: 35 Warnings: 0 |
| 167 | +
|
| 168 | +9. Repeat step 4. Verify that all Ironic nodes are using the single remaining |
| 169 | + Nova Compute Ironic instance. |
| 170 | + |
| 171 | + |
| 172 | +Making it easy to re-deploy Nova Compute Ironic |
| 173 | +----------------------------------------------- |
| 174 | + |
| 175 | +In the previous section we saw that at any given time, a baremetal node is |
| 176 | +associated with a single Nova Compute Ironic instance. At this stage, assuming |
| 177 | +that you have diligently followed the instructions, you are in the situation |
| 178 | +where all Ironic baremetal nodes are managed by a single Nova Compute Ironic |
| 179 | +instance. If this service goes down, you will not be able to manage /any/ |
| 180 | +baremetal nodes. |
| 181 | + |
| 182 | +By default, the single remaining Nova Compute Ironic instance will be named |
| 183 | +after the host on which it is deployed. The host name is passed to the Nova |
| 184 | +Compute Ironic instance via the default section of the ``nova.conf`` file, |
| 185 | +using the field: ``host``. |
| 186 | + |
| 187 | +If you wish to re-deploy this instance, for example because the original host |
| 188 | +was permanently mangled in the World Server Throwing Championship [2], you |
| 189 | +must ensure that the new instance has the same name as the old one. Simply |
| 190 | +setting ``kolla_nova_compute_ironic_host`` to another controller and |
| 191 | +re-deploying the service is not enough; the new instance will be named after |
| 192 | +the new host. |
| 193 | + |
| 194 | +To work around this you should set the ``host`` field in ``nova.conf`` to a |
| 195 | +constant, such that the new Nova Compute Ironic instance comes up with the |
| 196 | +same name as the one it replaces. |
| 197 | + |
| 198 | +For example, if the original instance resides on ``controller1``, then set the |
| 199 | +following in ``etc/kayobe/nova.yml``: |
| 200 | + |
| 201 | +.. code-block:: yaml |
| 202 | +
|
| 203 | + kolla_nova_compute_ironic_static_host_name: controller1-ironic |
| 204 | +
|
| 205 | +Note that an ``-ironic`` postfix is added to the hostname. This comes from |
| 206 | +a convention in Kolla Ansible. It is worth making this change ahead of time, |
| 207 | +even if you don't need to immediately re-deploy the service. |
| 208 | + |
| 209 | +It is also possible to use an arbitrary ``host`` name, but you will need |
| 210 | +to edit the database again. That is an optional exercise left for the reader. |
| 211 | +See [1] for further details. |
| 212 | + |
| 213 | +.. note:: |
| 214 | + |
| 215 | + There is a bug when overriding the host name in Kolla Ansible, where it |
| 216 | + is currently assumed that it will be set to the actual hostname + an |
| 217 | + -ironic postfix. The service will come up correctly, but Kolla Ansible |
| 218 | + will not detect it. See here: |
| 219 | + https://bugs.launchpad.net/kolla-ansible/+bug/2056571 |
| 220 | + |
| 221 | +Re-deploying Nova Compute Ironic |
| 222 | +-------------------------------- |
| 223 | + |
| 224 | +The decision to re-deploy Nova Compute Ironic to another host should only be |
| 225 | +taken if there is a strong reason to do so. The objective is to minimise |
| 226 | +the chance of the old instance starting up alongside the new one. If the |
| 227 | +original host has been re-imaged, or physically replaced there is no risk. |
| 228 | +However, if the original host has been taken down for non-destructive |
| 229 | +maintenance, it is better to avoid re-deploying the service if the end users |
| 230 | +can tolerate the wait. If you are forced to re-deploy the service, knowing |
| 231 | +that the original instance may start when the host comes back online, you |
| 232 | +must plan accordingly. For example, by booting the original host in maintenance |
| 233 | +mode and removing the old service before it can start, or by stopping the |
| 234 | +new instance before the original one comes back up, and then reverting the |
| 235 | +config to move it to the new host. |
| 236 | + |
| 237 | +There are essentially two scenarios for re-deploying Nova Compute Ironic. |
| 238 | +These are described in the following sub-sections: |
| 239 | + |
| 240 | +Current host is accessible |
| 241 | +~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 242 | + |
| 243 | +Adjust the ``kolla_nova_compute_ironic_host`` variable to point to the |
| 244 | +new host, eg. |
| 245 | + |
| 246 | +.. code-block:: diff |
| 247 | +
|
| 248 | + +kolla_nova_compute_ironic_host: controller2 |
| 249 | + -kolla_nova_compute_ironic_host: controller1 |
| 250 | +
|
| 251 | +Remove the old container: |
| 252 | + |
| 253 | +.. code-block:: console |
| 254 | +
|
| 255 | + $ ssh controller1 sudo docker rm -f nova_compute_ironic |
| 256 | +
|
| 257 | +Deploy the new service: |
| 258 | + |
| 259 | +.. code-block:: console |
| 260 | +
|
| 261 | + $ kayobe overcloud service deploy -kl controller2 -l controller2 -kt nova |
| 262 | +
|
| 263 | +Verify that the new service appears as 'up' and 'enabled': |
| 264 | + |
| 265 | +.. code-block:: console |
| 266 | +
|
| 267 | + $ openstack compute service list |
| 268 | +
|
| 269 | +Current host is not accessible |
| 270 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 271 | + |
| 272 | +In this case you will need to remove the inaccessible host from the inventory. |
| 273 | +For example, in ``etc/kayobe/inventory/hosts``, remove ``controller1`` from |
| 274 | +the ``controllers`` group. |
| 275 | + |
| 276 | +Adjust the ``kolla_nova_compute_ironic_host`` variable to point to the |
| 277 | +new host, eg. |
| 278 | + |
| 279 | +.. code-block:: diff |
| 280 | +
|
| 281 | + +kolla_nova_compute_ironic_host: controller2 |
| 282 | + -kolla_nova_compute_ironic_host: controller1 |
| 283 | +
|
| 284 | +Deploy the new service: |
| 285 | + |
| 286 | +.. code-block:: console |
| 287 | +
|
| 288 | + $ kayobe overcloud service reconfigure -kl controller2 -l controller2 -kt nova |
| 289 | +
|
| 290 | +Verify that the new service appears as 'up' and 'enabled': |
| 291 | + |
| 292 | +.. code-block:: console |
| 293 | +
|
| 294 | + $ openstack compute service list |
| 295 | +
|
| 296 | +.. note:: |
| 297 | + |
| 298 | + It is important to stop the original service from starting up again. It is |
| 299 | + up to you to prevent this. |
| 300 | + |
| 301 | +.. note:: |
| 302 | + |
| 303 | + Once merged, the work on 'Kayobe reliability' may allow this step to run |
| 304 | + without modifying the inventory to remove the broken host. |
| 305 | + |
| 306 | +[1] https://specs.openstack.org/openstack/nova-specs/specs/2024.1/approved/ironic-shards.html#migrate-from-peer-list-to-shard-key |
| 307 | +[2] https://www.cloudfest.com/world-server-throwing-championship |
0 commit comments