Skip to content

Commit 11cb42f

Browse files
mriedemcdent
authored andcommitted
Restore RT.old_resources if ComputeNode.save() fails
When starting nova-compute for the first time with a new node, the ResourceTracker will create a new ComputeNode record in _init_compute_node but without all of the fields set on the ComputeNode, for example "free_disk_gb". Later _update_usage_from_instances will set some fields on the ComputeNode record (even if there are no instances on the node, why - I don't know) like free_disk_gb. This will make the eventual call from _update() to _resource_change() update the value in the old_resouces dict and return True, and then _update() will try to update those ComputeNode changes to the database. If that update fails, for example due to a DBConnectionError, the value in old_resources will still be for the current version of the node in memory but not what is actually in the database. Note that this failure does not result in the compute service failing to start because ComputeManager._update_available_resource_for_node traps the Exception and just logs it. A subsequent trip through the RT._update() method - because of the update_available_resource periodic task - will call _resource_change but because old_resource matches the current state of the node, it returns False and the RT does not attempt to persist the changes to the DB. _update() will then go on to call _update_to_placement which will create the resource provider in placement along with its inventory, making it potentially a candidate for scheduling. This can be a problem later in the scheduler because the HostState._update_from_compute_node method may skip setting fields on the HostState object if free_disk_gb is not set in the ComputeNode record - which can then break filters and weighers later in the scheduling process (see bug 1834691 and bug 1834694). The fix proposed here is simple: if the ComputeNode.save() in RT._update() fails, restore the previous value in old_resources so that the subsequent run through _resource_change will compare the correct state of the object and retry the update. An alternative to this would be killing the compute service on startup if there is a DB error but that could have unintended side effects, especially if the DB error is transient and can be fixed on the next try. Obviously the scheduler code needs to be more robust also, but those improvements are left for separate changes related to the other bugs mentioned above. Also, ComputeNode.update_from_virt_driver could be updated to set free_disk_gb if possible to workaround the tight coupling in the HostState._update_from_compute_node code, but that's also sort of a whack-a-mole type change best made separately. Change-Id: Id3c847be32d8a1037722d08bf52e4b88dc5adc97 Closes-Bug: #1834712
1 parent b7c98be commit 11cb42f

File tree

2 files changed

+40
-1
lines changed

2 files changed

+40
-1
lines changed

nova/compute/resource_tracker.py

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
import os_traits
2727
from oslo_log import log as logging
2828
from oslo_serialization import jsonutils
29+
from oslo_utils import excutils
2930
import retrying
3031

3132
from nova.compute import claims
@@ -1052,12 +1053,23 @@ def _update_to_placement(self, context, compute_node, startup):
10521053

10531054
def _update(self, context, compute_node, startup=False):
10541055
"""Update partial stats locally and populate them to Scheduler."""
1056+
# _resource_change will update self.old_resources if it detects changes
1057+
# but we want to restore those if compute_node.save() fails.
1058+
nodename = compute_node.hypervisor_hostname
1059+
old_compute = self.old_resources[nodename]
10551060
if self._resource_change(compute_node):
10561061
# If the compute_node's resource changed, update to DB.
10571062
# NOTE(jianghuaw): Once we completely move to use get_inventory()
10581063
# for all resource provider's inv data. We can remove this check.
10591064
# At the moment we still need this check and save compute_node.
1060-
compute_node.save()
1065+
try:
1066+
compute_node.save()
1067+
except Exception:
1068+
# Restore the previous state in self.old_resources so that on
1069+
# the next trip through here _resource_change does not have
1070+
# stale data to compare.
1071+
with excutils.save_and_reraise_exception(logger=LOG):
1072+
self.old_resources[nodename] = old_compute
10611073

10621074
self._update_to_placement(context, compute_node, startup)
10631075

nova/tests/unit/compute/test_resource_tracker.py

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1655,6 +1655,33 @@ def test_sync_compute_service_disabled_trait_service_not_found(
16551655
self.assertIn('Unable to find services table record for nova-compute',
16561656
mock_log_error.call_args[0][0])
16571657

1658+
def test_update_compute_node_save_fails_restores_old_resources(self):
1659+
"""Tests the scenario that compute_node.save() fails and the
1660+
old_resources value for the node is restored to its previous value
1661+
before calling _resource_change updated it.
1662+
"""
1663+
self._setup_rt()
1664+
orig_compute = _COMPUTE_NODE_FIXTURES[0].obj_clone()
1665+
# Pretend the ComputeNode was just created in the DB but not yet saved
1666+
# with the free_disk_gb field.
1667+
delattr(orig_compute, 'free_disk_gb')
1668+
nodename = orig_compute.hypervisor_hostname
1669+
self.rt.old_resources[nodename] = orig_compute
1670+
# Now have an updated compute node with free_disk_gb set which should
1671+
# make _resource_change modify old_resources and return True.
1672+
updated_compute = _COMPUTE_NODE_FIXTURES[0].obj_clone()
1673+
ctxt = context.get_admin_context()
1674+
# Mock ComputeNode.save() to trigger some failure (realistically this
1675+
# could be a DBConnectionError).
1676+
with mock.patch.object(updated_compute, 'save',
1677+
side_effect=test.TestingException('db error')):
1678+
self.assertRaises(test.TestingException,
1679+
self.rt._update,
1680+
ctxt, updated_compute, startup=True)
1681+
# Make sure that the old_resources entry for the node has not changed
1682+
# from the original.
1683+
self.assertTrue(self.rt._resource_change(updated_compute))
1684+
16581685
def test_copy_resources_no_update_allocation_ratios(self):
16591686
"""Tests that a ComputeNode object's allocation ratio fields are
16601687
not set if the configured allocation ratio values are default None.

0 commit comments

Comments
 (0)