Skip to content

Commit 15cee99

Browse files
authored
Merge pull request #977 from stackhpc/bugfix/nova-compute-ironic-failover
Nova Compute Ironic failover procedure
2 parents 08b2f47 + ee1aa83 commit 15cee99

File tree

4 files changed

+327
-3
lines changed

4 files changed

+327
-3
lines changed

doc/source/operations/index.rst

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,10 @@ This guide is for operators of the StackHPC Kayobe configuration project.
77
.. toctree::
88
:maxdepth: 1
99

10-
upgrading
11-
rabbitmq
12-
octavia
1310
hotfix-playbook
11+
nova-compute-ironic
12+
octavia
13+
rabbitmq
1414
secret-rotation
1515
tempest
16+
upgrading
Lines changed: 307 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,307 @@
1+
===================
2+
Nova Compute Ironic
3+
===================
4+
5+
This section describes the deployment of the OpenStack Nova Compute
6+
Ironic service. The Nova Compute Ironic service is used to integrate
7+
OpenStack Ironic into Nova as a 'hypervisor' driver. The end users of Nova
8+
can then deploy and manage baremetal hardware, in a similar way to VMs.
9+
10+
High Availability (HA)
11+
======================
12+
13+
The OpenStack Nova Compute service is designed to be installed once on every
14+
hypervisor in an OpenStack deployment. In this configuration, it makes little
15+
sense to run additional service instances. Even if you wanted to, it's not
16+
supported by design. This pattern breaks down with the Ironic baremetal
17+
service, which must run on the OpenStack control plane. It is not feasible
18+
to have a 1:1 mapping of Nova Compute Ironic services to baremetal nodes.
19+
20+
The obvious HA solution is to run multiple instances of Nova Compute Ironic
21+
on the control plane, so that if one fails, the others can take over. However,
22+
due to assumptions long baked into the Nova source code, this is not trivial.
23+
The HA feature provided by the Nova Compute Ironic service has proven to be
24+
unstable, and the direction upstream is to switch to an active/passive
25+
solution [1].
26+
27+
However, challenges still exist with the active/passive solution. Since the
28+
Nova Compute Ironic HA feature is 'always on', one must ensure that only a
29+
single instance (per Ironic conductor group) is ever running. It is not
30+
possible to simply put multiple service instances behind HAProxy and use the
31+
active/passive mode.
32+
33+
Such problems are commonly solved with a technology such as Pacemaker, or in
34+
the modern world, with a container orchestration engine such as Kubernetes.
35+
Kolla Ansible provides neither, because in general it doesn't need to. Its
36+
goal is simplicity.
37+
38+
The interim solution is to therefore run a single Nova Compute Ironic
39+
service. If the service goes down, remedial action must be taken before
40+
Ironic nodes can be managed. In many environments the loss of the Ironic
41+
API for short periods is acceptable, providing that it can be easily
42+
resurrected. The purpose of this document is to faciliate that.
43+
44+
.. note::
45+
46+
The new sharding mode is not covered here and it is assumed that you are
47+
not using it. See [1] for further information. This will be updated in
48+
the future.
49+
50+
Optimal configuration of Nova Compute Ironic
51+
============================================
52+
53+
Determine the current configuration for the site. How many Nova Compute
54+
Ironic instances are running on the control plane?
55+
56+
.. code-block:: console
57+
58+
$ openstack compute service list
59+
60+
Typically you will see either three or one. By default the host will
61+
marked with a postfix, eg. ``controller1-ironic``. If you find more than
62+
one, you will need to remove some instances. You must complete the
63+
following section.
64+
65+
Moving from multiple Nova Compute Instances to a single instance
66+
----------------------------------------------------------------
67+
68+
1. Decide where the single instance should run. This should normally be
69+
one of the three OpenStack control plane hosts. For convention, pick
70+
the first one, unless you can think of a good reason not to. Once you
71+
have chosen, set the following variable in ``etc/kayobe/nova.yml``.
72+
Here we have picked ``controller1``.
73+
74+
.. code-block:: yaml
75+
76+
kolla_nova_compute_ironic_host: controller1
77+
78+
2. Ensure that you have organised a maintenance window, during which
79+
there will be no Ironic operations. You will be breaking the Ironic
80+
API.
81+
82+
3. Perform a database backup.
83+
84+
.. code-block:: console
85+
86+
$ kayobe overcloud database backup -vvv
87+
88+
Check the output of the command, and locate the backup files.
89+
90+
4. Identify baremetal nodes associated with Nova Compute Ironic instances
91+
that will be removed. You don't need to do anything with these
92+
specifically, it's just for reference later. For example:
93+
94+
.. code-block:: console
95+
96+
$ openstack baremetal node list --long -c "Instance Info" | grep controller3-ironic | wc -l
97+
61
98+
$ openstack baremetal node list --long -c "Instance Info" | grep controller2-ironic | wc -l
99+
35
100+
$ openstack baremetal node list --long -c "Instance Info" | grep controller1-ironic | wc -l
101+
55
102+
103+
5. Disable the redundant Nova Compute Ironic services:
104+
105+
.. code-block:: console
106+
107+
$ openstack compute service set controller3-ironic nova-compute --disable
108+
$ openstack compute service set controller2-ironic nova-compute --disable
109+
110+
6. Delete the redundant Nova Compute Ironic services. You will need the service
111+
ID. For example:
112+
113+
.. code-block:: console
114+
115+
$ ID=$(openstack compute service list | grep foo | awk '{print $2}')
116+
$ openstack compute service delete --os-compute-api-version 2.53 $ID
117+
118+
In older releases, you may hit a bug where the service can't be deleted if it
119+
is not managing any instances. In this case just move on and leave the service
120+
disabled. Eg.
121+
122+
.. code-block:: console
123+
124+
$ openstack compute service delete --os-compute-api-version 2.53 c993b57e-f60c-4652-8328-5fb0e17c99c0
125+
Failed to delete compute service with ID 'c993b57e-f60c-4652-8328-5fb0e17c99c0': HttpException: 500: Server Error for url:
126+
https://acme.pl-2.internal.hpc.is:8774/v2.1/os-services/c993b57e-f60c-4652-8328-5fb0e17c99c0, Unexpected API Error.
127+
Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
128+
129+
7. Remove the Docker containers for the redundant Nova Compute Ironic services:
130+
131+
.. code-block:: console
132+
133+
$ ssh controller2 sudo docker rm -f nova_compute_ironic
134+
$ ssh controller3 sudo docker rm -f nova_compute_ironic
135+
136+
8. Ensure that all Ironic nodes are using the single remaining Nova Compute
137+
Ironic instance. Eg. Baremetal nodes in use by compute instances will not
138+
fail over to the remaining Nova Compute Ironic service. Here, the active
139+
service is running on ``controller1``:
140+
141+
.. code-block:: console
142+
143+
$ ssh controller1
144+
$ sudo docker exec -it mariadb mysql -u nova -p$(sudo grep 'mysql+pymysql://nova:' /etc/kolla/nova-api/nova.conf | awk -F'[:,@]' '{print $3}')
145+
$ MariaDB [(none)]> use nova;
146+
147+
Proceed with caution. It is good practise to update one record first:
148+
149+
.. code-block:: console
150+
151+
$ MariaDB [nova]> update instances set host='controller1-ironic' where uuid=0 and host='controller3-ironic' limit 1;
152+
Query OK, 1 row affected (0.002 sec)
153+
Rows matched: 1 Changed: 1 Warnings: 0
154+
155+
At this stage you should go back to step 4 and check that the numbers have
156+
changed as expected. When you are happy, update remaining records for all
157+
services which have been removed:
158+
159+
.. code-block:: console
160+
161+
$ MariaDB [nova]> update instances set host='controller1-ironic' where deleted=0 and host='controller3-ironic';
162+
Query OK, 59 rows affected (0.009 sec)
163+
Rows matched: 59 Changed: 59 Warnings: 0
164+
$ MariaDB [nova]> update instances set host='controller1-ironic' where deleted=0 and host='controller2-ironic';
165+
Query OK, 35 rows affected (0.003 sec)
166+
Rows matched: 35 Changed: 35 Warnings: 0
167+
168+
9. Repeat step 4. Verify that all Ironic nodes are using the single remaining
169+
Nova Compute Ironic instance.
170+
171+
172+
Making it easy to re-deploy Nova Compute Ironic
173+
-----------------------------------------------
174+
175+
In the previous section we saw that at any given time, a baremetal node is
176+
associated with a single Nova Compute Ironic instance. At this stage, assuming
177+
that you have diligently followed the instructions, you are in the situation
178+
where all Ironic baremetal nodes are managed by a single Nova Compute Ironic
179+
instance. If this service goes down, you will not be able to manage /any/
180+
baremetal nodes.
181+
182+
By default, the single remaining Nova Compute Ironic instance will be named
183+
after the host on which it is deployed. The host name is passed to the Nova
184+
Compute Ironic instance via the default section of the ``nova.conf`` file,
185+
using the field: ``host``.
186+
187+
If you wish to re-deploy this instance, for example because the original host
188+
was permanently mangled in the World Server Throwing Championship [2], you
189+
must ensure that the new instance has the same name as the old one. Simply
190+
setting ``kolla_nova_compute_ironic_host`` to another controller and
191+
re-deploying the service is not enough; the new instance will be named after
192+
the new host.
193+
194+
To work around this you should set the ``host`` field in ``nova.conf`` to a
195+
constant, such that the new Nova Compute Ironic instance comes up with the
196+
same name as the one it replaces.
197+
198+
For example, if the original instance resides on ``controller1``, then set the
199+
following in ``etc/kayobe/nova.yml``:
200+
201+
.. code-block:: yaml
202+
203+
kolla_nova_compute_ironic_static_host_name: controller1-ironic
204+
205+
Note that an ``-ironic`` postfix is added to the hostname. This comes from
206+
a convention in Kolla Ansible. It is worth making this change ahead of time,
207+
even if you don't need to immediately re-deploy the service.
208+
209+
It is also possible to use an arbitrary ``host`` name, but you will need
210+
to edit the database again. That is an optional exercise left for the reader.
211+
See [1] for further details.
212+
213+
.. note::
214+
215+
There is a bug when overriding the host name in Kolla Ansible, where it
216+
is currently assumed that it will be set to the actual hostname + an
217+
-ironic postfix. The service will come up correctly, but Kolla Ansible
218+
will not detect it. See here:
219+
https://bugs.launchpad.net/kolla-ansible/+bug/2056571
220+
221+
Re-deploying Nova Compute Ironic
222+
--------------------------------
223+
224+
The decision to re-deploy Nova Compute Ironic to another host should only be
225+
taken if there is a strong reason to do so. The objective is to minimise
226+
the chance of the old instance starting up alongside the new one. If the
227+
original host has been re-imaged, or physically replaced there is no risk.
228+
However, if the original host has been taken down for non-destructive
229+
maintenance, it is better to avoid re-deploying the service if the end users
230+
can tolerate the wait. If you are forced to re-deploy the service, knowing
231+
that the original instance may start when the host comes back online, you
232+
must plan accordingly. For example, by booting the original host in maintenance
233+
mode and removing the old service before it can start, or by stopping the
234+
new instance before the original one comes back up, and then reverting the
235+
config to move it to the new host.
236+
237+
There are essentially two scenarios for re-deploying Nova Compute Ironic.
238+
These are described in the following sub-sections:
239+
240+
Current host is accessible
241+
~~~~~~~~~~~~~~~~~~~~~~~~~~
242+
243+
Adjust the ``kolla_nova_compute_ironic_host`` variable to point to the
244+
new host, eg.
245+
246+
.. code-block:: diff
247+
248+
+kolla_nova_compute_ironic_host: controller2
249+
-kolla_nova_compute_ironic_host: controller1
250+
251+
Remove the old container:
252+
253+
.. code-block:: console
254+
255+
$ ssh controller1 sudo docker rm -f nova_compute_ironic
256+
257+
Deploy the new service:
258+
259+
.. code-block:: console
260+
261+
$ kayobe overcloud service deploy -kl controller2 -l controller2 -kt nova
262+
263+
Verify that the new service appears as 'up' and 'enabled':
264+
265+
.. code-block:: console
266+
267+
$ openstack compute service list
268+
269+
Current host is not accessible
270+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
271+
272+
In this case you will need to remove the inaccessible host from the inventory.
273+
For example, in ``etc/kayobe/inventory/hosts``, remove ``controller1`` from
274+
the ``controllers`` group.
275+
276+
Adjust the ``kolla_nova_compute_ironic_host`` variable to point to the
277+
new host, eg.
278+
279+
.. code-block:: diff
280+
281+
+kolla_nova_compute_ironic_host: controller2
282+
-kolla_nova_compute_ironic_host: controller1
283+
284+
Deploy the new service:
285+
286+
.. code-block:: console
287+
288+
$ kayobe overcloud service reconfigure -kl controller2 -l controller2 -kt nova
289+
290+
Verify that the new service appears as 'up' and 'enabled':
291+
292+
.. code-block:: console
293+
294+
$ openstack compute service list
295+
296+
.. note::
297+
298+
It is important to stop the original service from starting up again. It is
299+
up to you to prevent this.
300+
301+
.. note::
302+
303+
Once merged, the work on 'Kayobe reliability' may allow this step to run
304+
without modifying the inventory to remove the broken host.
305+
306+
[1] https://specs.openstack.org/openstack/nova-specs/specs/2024.1/approved/ironic-shards.html#migrate-from-peer-list-to-shard-key
307+
[2] https://www.cloudfest.com/world-server-throwing-championship
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
{% if kolla_enable_ironic|bool and kolla_nova_compute_ironic_host is not none %}
2+
[DEFAULT]
3+
host = {{ kolla_nova_compute_ironic_static_host_name | mandatory('You must set a static host name to help with service failover. See the operations documentation, Ironic section.') }}
4+
{% endif %}
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
---
2+
fixes:
3+
- |
4+
Adds basic support and a document explaining how to migrate to a single
5+
nova-compute-ironic instance, and how to re-deploy the instance to another
6+
machine in the event of failure. See the operations / nova-compute-ironic
7+
doc for further details.
8+
upgrade:
9+
- |
10+
Ensure that your deployment has only one nova-compute-ironic service running
11+
per conductor group. See the operations / nova-compute-ironic doc for further
12+
details.

0 commit comments

Comments
 (0)