|
| 1 | +================== |
| 2 | +Testing Down Cells |
| 3 | +================== |
| 4 | + |
| 5 | +This document describes how to recreate a down-cell scenario in a single-node |
| 6 | +devstack environment. This can be useful for testing the reliability of the |
| 7 | +controller services when a cell in the deployment is down. |
| 8 | + |
| 9 | + |
| 10 | +Setup |
| 11 | +===== |
| 12 | + |
| 13 | +DevStack config |
| 14 | +--------------- |
| 15 | + |
| 16 | +This guide is based on a devstack install from the Train release using |
| 17 | +an Ubuntu Bionic 18.04 VM with 8 VCPU, 8 GB RAM and 200 GB of disk following |
| 18 | +the `All-In-One Single Machine`_ guide. |
| 19 | + |
| 20 | +The following minimal local.conf was used: |
| 21 | + |
| 22 | +.. code-block:: ini |
| 23 | +
|
| 24 | + [[local|localrc]] |
| 25 | + # Define passwords |
| 26 | + OS_PASSWORD=openstack1 |
| 27 | + SERVICE_TOKEN=$OS_PASSWORD |
| 28 | + ADMIN_PASSWORD=$OS_PASSWORD |
| 29 | + MYSQL_PASSWORD=$OS_PASSWORD |
| 30 | + RABBIT_PASSWORD=$OS_PASSWORD |
| 31 | + SERVICE_PASSWORD=$OS_PASSWORD |
| 32 | + # Logging config |
| 33 | + LOGFILE=$DEST/logs/stack.sh.log |
| 34 | + LOGDAYS=2 |
| 35 | + # Disable non-essential services |
| 36 | + disable_service horizon tempest |
| 37 | +
|
| 38 | +.. _All-In-One Single Machine: https://docs.openstack.org/devstack/latest/guides/single-machine.html |
| 39 | + |
| 40 | +Populate cell1 |
| 41 | +-------------- |
| 42 | + |
| 43 | +Create a test server first so there is something in cell1: |
| 44 | + |
| 45 | +.. code-block:: console |
| 46 | +
|
| 47 | + $ source openrc admin admin |
| 48 | + $ IMAGE=$(openstack image list -f value -c ID) |
| 49 | + $ openstack server create --wait --flavor m1.tiny --image $IMAGE cell1-server |
| 50 | +
|
| 51 | +
|
| 52 | +Take down cell1 |
| 53 | +=============== |
| 54 | + |
| 55 | +Break the connection to the cell1 database by changing the |
| 56 | +``database_connection`` URL, in this case with an invalid host IP: |
| 57 | + |
| 58 | +.. code-block:: console |
| 59 | +
|
| 60 | + mysql> select database_connection from cell_mappings where name='cell1'; |
| 61 | + +-------------------------------------------------------------------+ |
| 62 | + | database_connection | |
| 63 | + +-------------------------------------------------------------------+ |
| 64 | + | mysql+pymysql://root:[email protected]/nova_cell1?charset=utf8 | |
| 65 | + +-------------------------------------------------------------------+ |
| 66 | + 1 row in set (0.00 sec) |
| 67 | +
|
| 68 | + mysql> update cell_mappings set database_connection='mysql+pymysql://root:[email protected]/nova_cell1?charset=utf8' where name='cell1'; |
| 69 | + Query OK, 1 row affected (0.01 sec) |
| 70 | + Rows matched: 1 Changed: 1 Warnings: 0 |
| 71 | +
|
| 72 | +
|
| 73 | +Update controller services |
| 74 | +========================== |
| 75 | + |
| 76 | +Prepare the controller services for the down cell. See |
| 77 | +:ref:`Handling cell failures <handling-cell-failures>` for details. |
| 78 | + |
| 79 | +Modify nova.conf |
| 80 | +---------------- |
| 81 | + |
| 82 | +Configure the API to avoid long timeouts and slow start times due to |
| 83 | +`bug 1815697`_ by modifying ``/etc/nova/nova.conf``: |
| 84 | + |
| 85 | +.. code-block:: ini |
| 86 | +
|
| 87 | + [database] |
| 88 | + ... |
| 89 | + max_retries = 1 |
| 90 | + retry_interval = 1 |
| 91 | +
|
| 92 | + [upgrade_levels] |
| 93 | + ... |
| 94 | + compute = stein # N-1 from train release, just something other than "auto" |
| 95 | +
|
| 96 | +.. _bug 1815697: https://bugs.launchpad.net/nova/+bug/1815697 |
| 97 | + |
| 98 | +Restart services |
| 99 | +---------------- |
| 100 | + |
| 101 | +.. note:: It is useful to tail the n-api service logs in another screen to |
| 102 | + watch for errors / warnings in the logs due to down cells: |
| 103 | + |
| 104 | + .. code-block:: console |
| 105 | +
|
| 106 | + $ sudo journalctl -f -a -u [email protected] |
| 107 | +
|
| 108 | +Restart controller services to flush the cell cache: |
| 109 | + |
| 110 | +.. code-block:: console |
| 111 | +
|
| 112 | + |
| 113 | +
|
| 114 | +
|
| 115 | +Test cases |
| 116 | +========== |
| 117 | + |
| 118 | +1. Try to create a server which should fail and go to cell0. |
| 119 | + |
| 120 | + .. code-block:: console |
| 121 | +
|
| 122 | + $ openstack server create --wait --flavor m1.tiny --image $IMAGE cell0-server |
| 123 | +
|
| 124 | + You can expect to see errors like this in the n-api logs: |
| 125 | + |
| 126 | + .. code-block:: console |
| 127 | +
|
| 128 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context [None req-fdaff415-48b9-44a7-b4c3-015214e80b90 None None] Error gathering result from cell 4f495a21-294a-4051-9a3d-8b34a250bbb4: DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on u'192.0.0.1' ([Errno 101] ENETUNREACH)") (Background on this error at: http://sqlalche.me/e/e3q8) |
| 129 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context Traceback (most recent call last): |
| 130 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/opt/stack/nova/nova/context.py", line 441, in gather_result |
| 131 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context result = fn(cctxt, *args, **kwargs) |
| 132 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 211, in wrapper |
| 133 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context with reader_mode.using(context): |
| 134 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ |
| 135 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context return self.gen.next() |
| 136 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 1061, in _transaction_scope |
| 137 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context context=context) as resource: |
| 138 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ |
| 139 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context return self.gen.next() |
| 140 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 659, in _session |
| 141 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context bind=self.connection, mode=self.mode) |
| 142 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 418, in _create_session |
| 143 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context self._start() |
| 144 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 510, in _start |
| 145 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context engine_args, maker_args) |
| 146 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/enginefacade.py", line 534, in _setup_for_connection |
| 147 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context sql_connection=sql_connection, **engine_kwargs) |
| 148 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/debtcollector/renames.py", line 43, in decorator |
| 149 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context return wrapped(*args, **kwargs) |
| 150 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py", line 201, in create_engine |
| 151 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context test_conn = _test_connection(engine, max_retries, retry_interval) |
| 152 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py", line 387, in _test_connection |
| 153 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context six.reraise(type(de_ref), de_ref) |
| 154 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context File "<string>", line 3, in reraise |
| 155 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context DBConnectionError: (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on u'192.0.0.1' ([Errno 101] ENETUNREACH)") (Background on this error at: http://sqlalche.me/e/e3q8) |
| 156 | + Apr 04 20:48:22 train [email protected][10884]: ERROR nova.context |
| 157 | + Apr 04 20:48:22 train [email protected][10884]: WARNING nova.objects.service [None req-1cf4bf5c-2f74-4be0-a18d-51ff81df57dd admin admin] Failed to get minimum service version for cell 4f495a21-294a-4051-9a3d-8b34a250bbb4 |
| 158 | +
|
| 159 | +2. List servers with the 2.69 microversion for down cells. |
| 160 | + |
| 161 | + .. note:: Requires python-openstackclient >= 3.18.0 for v2.69 support. |
| 162 | + |
| 163 | + The server in cell1 (which is down) will show up with status UNKNOWN: |
| 164 | + |
| 165 | + .. code-block:: console |
| 166 | +
|
| 167 | + $ openstack --os-compute-api-version 2.69 server list |
| 168 | + +--------------------------------------+--------------+---------+----------+--------------------------+--------+ |
| 169 | + | ID | Name | Status | Networks | Image | Flavor | |
| 170 | + +--------------------------------------+--------------+---------+----------+--------------------------+--------+ |
| 171 | + | 8e90f1f0-e8dd-4783-8bb3-ec8d594e60f1 | | UNKNOWN | | | | |
| 172 | + | afd45d84-2bd7-4e49-9dff-93359f742bc1 | cell0-server | ERROR | | cirros-0.4.0-x86_64-disk | | |
| 173 | + +--------------------------------------+--------------+---------+----------+--------------------------+--------+ |
| 174 | +
|
| 175 | +3. Using v2.1 the UNKNOWN server is filtered out by default due to |
| 176 | + :oslo.config:option:`api.list_records_by_skipping_down_cells`: |
| 177 | + |
| 178 | + .. code-block:: console |
| 179 | +
|
| 180 | + $ openstack --os-compute-api-version 2.1 server list |
| 181 | + +--------------------------------------+--------------+--------+----------+--------------------------+---------+ |
| 182 | + | ID | Name | Status | Networks | Image | Flavor | |
| 183 | + +--------------------------------------+--------------+--------+----------+--------------------------+---------+ |
| 184 | + | afd45d84-2bd7-4e49-9dff-93359f742bc1 | cell0-server | ERROR | | cirros-0.4.0-x86_64-disk | m1.tiny | |
| 185 | + +--------------------------------------+--------------+--------+----------+--------------------------+---------+ |
| 186 | +
|
| 187 | +4. Configure nova-api with ``list_records_by_skipping_down_cells=False`` |
| 188 | + |
| 189 | + .. code-block:: ini |
| 190 | +
|
| 191 | + [api] |
| 192 | + list_records_by_skipping_down_cells = False |
| 193 | +
|
| 194 | +5. Restart nova-api and then listing servers should fail: |
| 195 | + |
| 196 | + .. code-block:: console |
| 197 | +
|
| 198 | + $ sudo systemctl restart [email protected] |
| 199 | + $ openstack --os-compute-api-version 2.1 server list |
| 200 | + Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible. |
| 201 | + <class 'nova.exception.NovaException'> (HTTP 500) (Request-ID: req-e2264d67-5b6c-4f17-ae3d-16c7562f1b69) |
| 202 | +
|
| 203 | +6. Try listing compute services with a down cell. |
| 204 | + |
| 205 | + The services from the down cell are skipped: |
| 206 | + |
| 207 | + .. code-block:: console |
| 208 | +
|
| 209 | + $ openstack --os-compute-api-version 2.1 compute service list |
| 210 | + +----+------------------+-------+----------+---------+-------+----------------------------+ |
| 211 | + | ID | Binary | Host | Zone | Status | State | Updated At | |
| 212 | + +----+------------------+-------+----------+---------+-------+----------------------------+ |
| 213 | + | 2 | nova-scheduler | train | internal | enabled | up | 2019-04-04T21:12:47.000000 | |
| 214 | + | 6 | nova-consoleauth | train | internal | enabled | up | 2019-04-04T21:12:38.000000 | |
| 215 | + | 7 | nova-conductor | train | internal | enabled | up | 2019-04-04T21:12:47.000000 | |
| 216 | + +----+------------------+-------+----------+---------+-------+----------------------------+ |
| 217 | +
|
| 218 | + With 2.69 the nova-compute service from cell1 is shown with status UNKNOWN: |
| 219 | + |
| 220 | + .. code-block:: console |
| 221 | +
|
| 222 | + $ openstack --os-compute-api-version 2.69 compute service list |
| 223 | + +--------------------------------------+------------------+-------+----------+---------+-------+----------------------------+ |
| 224 | + | ID | Binary | Host | Zone | Status | State | Updated At | |
| 225 | + +--------------------------------------+------------------+-------+----------+---------+-------+----------------------------+ |
| 226 | + | f68a96d9-d994-4122-a8f9-1b0f68ed69c2 | nova-scheduler | train | internal | enabled | up | 2019-04-04T21:13:47.000000 | |
| 227 | + | 70cd668a-6d60-4a9a-ad83-f863920d4c44 | nova-consoleauth | train | internal | enabled | up | 2019-04-04T21:13:38.000000 | |
| 228 | + | ca88f023-1de4-49e0-90b0-581e16bebaed | nova-conductor | train | internal | enabled | up | 2019-04-04T21:13:47.000000 | |
| 229 | + | | nova-compute | train | | UNKNOWN | | | |
| 230 | + +--------------------------------------+------------------+-------+----------+---------+-------+----------------------------+ |
| 231 | +
|
| 232 | +
|
| 233 | +Future |
| 234 | +====== |
| 235 | + |
| 236 | +This guide could be expanded for having multiple non-cell0 cells where one |
| 237 | +cell is down while the other is available and go through scenarios where the |
| 238 | +down cell is marked as disabled to take it out of scheduling consideration. |
0 commit comments