Skip to content

Commit 47fd7e3

Browse files
committed
Add testing guide for down cells
This adds a testing guide for creating a down cell environment with a basic single-node devstack setup. Change-Id: I8c021129a4df914f56193cca9ff136390a7240c3
1 parent 8856009 commit 47fd7e3

File tree

3 files changed

+241
-0
lines changed

3 files changed

+241
-0
lines changed

doc/source/contributor/index.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,8 @@ be Python code. All new code needs to be validated somehow.
7878

7979
* :doc:`/contributor/testing/zero-downtime-upgrade`
8080

81+
* :doc:`/contributor/testing/down-cell`
82+
8183
The Nova API
8284
============
8385

Lines changed: 238 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,238 @@
1+
==================
2+
Testing Down Cells
3+
==================
4+
5+
This document describes how to recreate a down-cell scenario in a single-node
6+
devstack environment. This can be useful for testing the reliability of the
7+
controller services when a cell in the deployment is down.
8+
9+
10+
Setup
11+
=====
12+
13+
DevStack config
14+
---------------
15+
16+
This guide is based on a devstack install from the Train release using
17+
an Ubuntu Bionic 18.04 VM with 8 VCPU, 8 GB RAM and 200 GB of disk following
18+
the `All-In-One Single Machine`_ guide.
19+
20+
The following minimal local.conf was used:
21+
22+
.. code-block:: ini
23+
24+
[[local|localrc]]
25+
# Define passwords
26+
OS_PASSWORD=openstack1
27+
SERVICE_TOKEN=$OS_PASSWORD
28+
ADMIN_PASSWORD=$OS_PASSWORD
29+
MYSQL_PASSWORD=$OS_PASSWORD
30+
RABBIT_PASSWORD=$OS_PASSWORD
31+
SERVICE_PASSWORD=$OS_PASSWORD
32+
# Logging config
33+
LOGFILE=$DEST/logs/stack.sh.log
34+
LOGDAYS=2
35+
# Disable non-essential services
36+
disable_service horizon tempest
37+
38+
.. _All-In-One Single Machine: https://docs.openstack.org/devstack/latest/guides/single-machine.html
39+
40+
Populate cell1
41+
--------------
42+
43+
Create a test server first so there is something in cell1:
44+
45+
.. code-block:: console
46+
47+
$ source openrc admin admin
48+
$ IMAGE=$(openstack image list -f value -c ID)
49+
$ openstack server create --wait --flavor m1.tiny --image $IMAGE cell1-server
50+
51+
52+
Take down cell1
53+
===============
54+
55+
Break the connection to the cell1 database by changing the
56+
``database_connection`` URL, in this case with an invalid host IP:
57+
58+
.. code-block:: console
59+
60+
mysql> select database_connection from cell_mappings where name='cell1';
61+
+-------------------------------------------------------------------+
62+
| database_connection |
63+
+-------------------------------------------------------------------+
64+
| mysql+pymysql://root:[email protected]/nova_cell1?charset=utf8 |
65+
+-------------------------------------------------------------------+
66+
1 row in set (0.00 sec)
67+
68+
mysql> update cell_mappings set database_connection='mysql+pymysql://root:[email protected]/nova_cell1?charset=utf8' where name='cell1';
69+
Query OK, 1 row affected (0.01 sec)
70+
Rows matched: 1 Changed: 1 Warnings: 0
71+
72+
73+
Update controller services
74+
==========================
75+
76+
Prepare the controller services for the down cell. See
77+
:ref:`Handling cell failures <handling-cell-failures>` for details.
78+
79+
Modify nova.conf
80+
----------------
81+
82+
Configure the API to avoid long timeouts and slow start times due to
83+
`bug 1815697`_ by modifying ``/etc/nova/nova.conf``:
84+
85+
.. code-block:: ini
86+
87+
[database]
88+
...
89+
max_retries = 1
90+
retry_interval = 1
91+
92+
[upgrade_levels]
93+
...
94+
compute = stein # N-1 from train release, just something other than "auto"
95+
96+
.. _bug 1815697: https://bugs.launchpad.net/nova/+bug/1815697
97+
98+
Restart services
99+
----------------
100+
101+
.. note:: It is useful to tail the n-api service logs in another screen to
102+
watch for errors / warnings in the logs due to down cells:
103+
104+
.. code-block:: console
105+
106+
$ sudo journalctl -f -a -u [email protected]
107+
108+
Restart controller services to flush the cell cache:
109+
110+
.. code-block:: console
111+
112+
113+
114+
115+
Test cases
116+
==========
117+
118+
1. Try to create a server which should fail and go to cell0.
119+
120+
.. code-block:: console
121+
122+
$ openstack server create --wait --flavor m1.tiny --image $IMAGE cell0-server
123+
124+
You can expect to see errors like this in the n-api logs:
125+
126+
.. code-block:: console
127+
128+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context [None req-fdaff415-48b9-44a7-b4c3-015214e80b90 None None] Error gathering result from cell 4f495a21-294a-4051-9a3d-8b34a250bbb4: DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on u'192.0.0.1' ([Errno 101] ENETUNREACH)") (Background on this error at: http://sqlalche.me/e/e3q8)
129+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context Traceback (most recent call last):
130+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/opt/stack/nova/nova/context.py", line 441, in gather_result
131+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context result = fn(cctxt, *args, **kwargs)
132+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 211, in wrapper
133+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context with reader_mode.using(context):
134+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
135+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context return self.gen.next()
136+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 1061, in _transaction_scope
137+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context context=context) as resource:
138+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
139+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context return self.gen.next()
140+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 659, in _session
141+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context bind=self.connection, mode=self.mode)
142+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 418, in _create_session
143+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context self._start()
144+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 510, in _start
145+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context engine_args, maker_args)
146+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 534, in _setup_for_connection
147+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context sql_connection=sql_connection, **engine_kwargs)
148+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/debtcollector/renames.py", line 43, in decorator
149+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context return wrapped(*args, **kwargs)
150+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py", line 201, in create_engine
151+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context test_conn = _test_connection(engine, max_retries, retry_interval)
152+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py", line 387, in _test_connection
153+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context six.reraise(type(de_ref), de_ref)
154+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "<string>", line 3, in reraise
155+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on u'192.0.0.1' ([Errno 101] ENETUNREACH)") (Background on this error at: http://sqlalche.me/e/e3q8)
156+
Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context
157+
Apr 04 20:48:22 train [email protected][10884]: WARNING nova.objects.service [None req-1cf4bf5c-2f74-4be0-a18d-51ff81df57dd admin admin] Failed to get minimum service version for cell 4f495a21-294a-4051-9a3d-8b34a250bbb4
158+
159+
2. List servers with the 2.69 microversion for down cells.
160+
161+
.. note:: Requires python-openstackclient >= 3.18.0 for v2.69 support.
162+
163+
The server in cell1 (which is down) will show up with status UNKNOWN:
164+
165+
.. code-block:: console
166+
167+
$ openstack --os-compute-api-version 2.69 server list
168+
+--------------------------------------+--------------+---------+----------+--------------------------+--------+
169+
| ID | Name | Status | Networks | Image | Flavor |
170+
+--------------------------------------+--------------+---------+----------+--------------------------+--------+
171+
| 8e90f1f0-e8dd-4783-8bb3-ec8d594e60f1 | | UNKNOWN | | | |
172+
| afd45d84-2bd7-4e49-9dff-93359f742bc1 | cell0-server | ERROR | | cirros-0.4.0-x86_64-disk | |
173+
+--------------------------------------+--------------+---------+----------+--------------------------+--------+
174+
175+
3. Using v2.1 the UNKNOWN server is filtered out by default due to
176+
:oslo.config:option:`api.list_records_by_skipping_down_cells`:
177+
178+
.. code-block:: console
179+
180+
$ openstack --os-compute-api-version 2.1 server list
181+
+--------------------------------------+--------------+--------+----------+--------------------------+---------+
182+
| ID | Name | Status | Networks | Image | Flavor |
183+
+--------------------------------------+--------------+--------+----------+--------------------------+---------+
184+
| afd45d84-2bd7-4e49-9dff-93359f742bc1 | cell0-server | ERROR | | cirros-0.4.0-x86_64-disk | m1.tiny |
185+
+--------------------------------------+--------------+--------+----------+--------------------------+---------+
186+
187+
4. Configure nova-api with ``list_records_by_skipping_down_cells=False``
188+
189+
.. code-block:: ini
190+
191+
[api]
192+
list_records_by_skipping_down_cells = False
193+
194+
5. Restart nova-api and then listing servers should fail:
195+
196+
.. code-block:: console
197+
198+
$ sudo systemctl restart [email protected]
199+
$ openstack --os-compute-api-version 2.1 server list
200+
Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
201+
<class 'nova.exception.NovaException'> (HTTP 500) (Request-ID: req-e2264d67-5b6c-4f17-ae3d-16c7562f1b69)
202+
203+
6. Try listing compute services with a down cell.
204+
205+
The services from the down cell are skipped:
206+
207+
.. code-block:: console
208+
209+
$ openstack --os-compute-api-version 2.1 compute service list
210+
+----+------------------+-------+----------+---------+-------+----------------------------+
211+
| ID | Binary | Host | Zone | Status | State | Updated At |
212+
+----+------------------+-------+----------+---------+-------+----------------------------+
213+
| 2 | nova-scheduler | train | internal | enabled | up | 2019-04-04T21:12:47.000000 |
214+
| 6 | nova-consoleauth | train | internal | enabled | up | 2019-04-04T21:12:38.000000 |
215+
| 7 | nova-conductor | train | internal | enabled | up | 2019-04-04T21:12:47.000000 |
216+
+----+------------------+-------+----------+---------+-------+----------------------------+
217+
218+
With 2.69 the nova-compute service from cell1 is shown with status UNKNOWN:
219+
220+
.. code-block:: console
221+
222+
$ openstack --os-compute-api-version 2.69 compute service list
223+
+--------------------------------------+------------------+-------+----------+---------+-------+----------------------------+
224+
| ID | Binary | Host | Zone | Status | State | Updated At |
225+
+--------------------------------------+------------------+-------+----------+---------+-------+----------------------------+
226+
| f68a96d9-d994-4122-a8f9-1b0f68ed69c2 | nova-scheduler | train | internal | enabled | up | 2019-04-04T21:13:47.000000 |
227+
| 70cd668a-6d60-4a9a-ad83-f863920d4c44 | nova-consoleauth | train | internal | enabled | up | 2019-04-04T21:13:38.000000 |
228+
| ca88f023-1de4-49e0-90b0-581e16bebaed | nova-conductor | train | internal | enabled | up | 2019-04-04T21:13:47.000000 |
229+
| | nova-compute | train | | UNKNOWN | | |
230+
+--------------------------------------+------------------+-------+----------+---------+-------+----------------------------+
231+
232+
233+
Future
234+
======
235+
236+
This guide could be expanded for having multiple non-cell0 cells where one
237+
cell is down while the other is available and go through scenarios where the
238+
down cell is marked as disabled to take it out of scheduling consideration.

doc/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -215,6 +215,7 @@ looking parts of our architecture. These are collected below.
215215
contributor/testing/libvirt-numa
216216
contributor/testing/serial-console
217217
contributor/testing/zero-downtime-upgrade
218+
contributor/testing/down-cell
218219
contributor/how-to-get-involved
219220
contributor/process
220221
contributor/project-scope

0 commit comments

Comments
 (0)