Skip to content

Fix for bugs 1839560 and 1839674 #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 27, 2019
Merged

Fix for bugs 1839560 and 1839674 #2

merged 3 commits into from
Sep 27, 2019

Conversation

priteau
Copy link
Member

@priteau priteau commented Sep 26, 2019

No description provided.

@priteau priteau requested a review from markgoddard September 26, 2019 14:51
Copy link

@markgoddard markgoddard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine, we'll need to install in the nova-conductor image in addition to nova-compute.

@priteau
Copy link
Member Author

priteau commented Sep 26, 2019

We probably want https://review.opendev.org/#/c/676280/ as well.

If ComputeNode.create() fails, the update_available_resource
periodic will not try to create it again because it will be
mapped in the compute_nodes dict and _init_compute_node will
return early but trying to save changes to that ComputeNode
object later will fail because there is no id on the object,
since we failed to create it in the DB.

This simply reverses the logic such that we only map the
compute node if we successfully created it.

Some existing _init_compute_node testing had to be changed
since it relied on the order of when the ComputeNode object
is created and put into the compute_nodes dict in order
to pass the object along to some much lower-level PCI
tracker code, which was arguably mocking too deep for a unit
test. That is changed to avoid the low-level mocking and
assertions and just assert that _setup_pci_tracker is called
as expected.

Conflicts:
      nova/compute/resource_tracker.py
      nova/tests/unit/compute/test_resource_tracker.py

NOTE(mriedem): The resource_tracker.py conflict is due to
not having I14a310b20bd9892e7b34464e6baad49bf5928ece in Rocky.
The test conflicts are due to not having change
I37e8ad5b14262d801702411c2c87e73550adda70 in Rocky.

Change-Id: I9fa1d509a3de405d6246fb8670612c65c10cc93b
Closes-Bug: #1839674
(cherry picked from commit f578146)
(cherry picked from commit 648770b)
(cherry picked from commit 35273a8)
This adds a functional test which recreates bug 1839560
where the driver reports a node, then no longer reports
it so the compute manager deletes it, and then the driver
reports it again later (this can be common with ironic
nodes as they undergo maintenance). The issue is that since
Ia69fabce8e7fd7de101e291fe133c6f5f5f7056a in Rocky, the
ironic node uuid is re-used for the compute node uuid but
there is a unique constraint on the compute node uuid so
when trying to create the compute node once the ironic node
is available again, the compute node create fails with a
duplicate entry error due to the duplicate uuid. To recreate
this in the functional test, a new fake virt driver is added
which provides a predictable uuid per node like the ironic
driver. The test also shows that archiving the database is
a way to workaround the bug until it's properly fixed.

NOTE(mriedem): Since change Idaed39629095f86d24a54334c699a26c218c6593
is not in Rocky the PlacementFixture still comes from nova_fixtures.

Change-Id: If822509e906d5094f13a8700b2b9ed3c40580431
Related-Bug: #1839560
(cherry picked from commit 89dd74a)
(cherry picked from commit e7109d4)
(cherry picked from commit ecd1e04)
There is a unique index on the compute_nodes.uuid column which
means we can't have more than one compute_nodes record in the
same DB with the same UUID even if one is soft deleted because
the deleted column is not part of that unique index constraint.

This is a problem with ironic nodes where the node is 1:1 with
the compute node record, and when a node is undergoing maintenance
the driver doesn't return it from get_available_nodes() so the
ComputeManager.update_available_resource periodic task (soft)
deletes the compute node record, but when the node is no longer
under maintenance in ironic and the driver reports it, the
ResourceTracker._init_compute_node code will fail to create the
ComputeNode record again because of the duplicate uuid.

This change handles the DBDuplicateEntry error in compute_node_create
by finding the soft-deleted compute node with the same uuid and
simply updating it to no longer be (soft) deleted.

Closes-Bug: #1839560

Change-Id: Iafba419fe86446ffe636721f523fb619f8f787b3
(cherry picked from commit 8b00726)
(cherry picked from commit 1b02166)
(cherry picked from commit 9ce9484)
@priteau priteau requested a review from markgoddard September 27, 2019 09:59
@priteau priteau changed the title Fix for bug 1839560 Fix for bugs 1839560 and 1839674 Sep 27, 2019
@priteau priteau self-assigned this Sep 27, 2019
@priteau priteau added the bug Something isn't working label Sep 27, 2019
@priteau priteau merged commit 22eb001 into stackhpc/rocky Sep 27, 2019
@priteau priteau deleted the bug-1839560 branch September 27, 2019 10:41
jovial pushed a commit that referenced this pull request Oct 22, 2019
This addresses review comments from the following changes:

  I61a3e8902a891bac36911812e4e7c080570e3850

  I48e6db9693e470b177bf4c75211d8b883c768433

  Ic70d2bb781b6a844849a5cf2fe4d271b5a81093d

  I5a956513f3485074023e027430cc52ee7a3f92e4

  Ica6152ccb97dce805969d964d6ed032bfe22a33f

Part of blueprint bandwidth-resource-provider

Change-Id: Idffaa6d206cda3f507e6be095356537f22302ad7
github-actions bot pushed a commit that referenced this pull request Jan 25, 2023
In py310 unittest.mock does not allow to mock the same function twice as
the second mocking will fail to autospec the Mock object created by the
first mocking.

This patch manually fixes the double mocking.

Fixed cases:
1) one of the mock was totally unnecessary so it was removed
2) the second mock specialized the behavior of the first generic mock.
   In this case the second mock is replaced with the configuration of
   the first mock
3) a test case with two test steps mocked the same function for each
   step with overlapping mocks. Here the overlap was removed to have
   the two mock exists independently

The get_connection injection in the libvirt functional test needed a
further tweak (yeah I know it has many already) to act like a single
mock (basically case #2) instead of a temporary re-mocking. Still the
globalness of the get_connection mocking warrant the special set / reset
logic there.

Conflicts:
    nova/tests/unit/api/openstack/compute/test_limits.py
    nova/tests/unit/virt/libvirt/volume/test_lightos.py
    nova/tests/unit/virt/test_block_device.py

Change-Id: I3998d0d49583806ac1c3ae64f1b1fe343cefd20d
(cherry picked from commit f8cf050)
(cherry picked from commit b40bd1b)
github-actions bot pushed a commit that referenced this pull request Jan 25, 2023
In py310 unittest.mock does not allow to mock the same function twice as
the second mocking will fail to autospec the Mock object created by the
first mocking.

This patch manually fixes the double mocking.

Fixed cases:
1) one of the mock was totally unnecessary so it was removed
2) the second mock specialized the behavior of the first generic mock.
   In this case the second mock is replaced with the configuration of
   the first mock
3) a test case with two test steps mocked the same function for each
   step with overlapping mocks. Here the overlap was removed to have
   the two mock exists independently

The get_connection injection in the libvirt functional test needed a
further tweak (yeah I know it has many already) to act like a single
mock (basically case #2) instead of a temporary re-mocking. Still the
globalness of the get_connection mocking warrant the special set / reset
logic there.

Conflicts:
    nova/tests/functional/regressions/test_bug_1781286.py
    nova/tests/unit/api/openstack/compute/test_shelve.py

Change-Id: I3998d0d49583806ac1c3ae64f1b1fe343cefd20d
(cherry picked from commit f8cf050)
github-actions bot pushed a commit that referenced this pull request Dec 18, 2023
In py310 unittest.mock does not allow to mock the same function twice as
the second mocking will fail to autospec the Mock object created by the
first mocking.

This patch manually fixes the double mocking.

Fixed cases:
1) one of the mock was totally unnecessary so it was removed
2) the second mock specialized the behavior of the first generic mock.
   In this case the second mock is replaced with the configuration of
   the first mock
3) a test case with two test steps mocked the same function for each
   step with overlapping mocks. Here the overlap was removed to have
   the two mock exists independently

The get_connection injection in the libvirt functional test needed a
further tweak (yeah I know it has many already) to act like a single
mock (basically case #2) instead of a temporary re-mocking. Still the
globalness of the get_connection mocking warrant the special set / reset
logic there.

Change-Id: I3998d0d49583806ac1c3ae64f1b1fe343cefd20d
github-actions bot pushed a commit that referenced this pull request Dec 18, 2023
The PciPassthroughFilter logic checks each InstancePCIRequest
individually against the available PCI pools of a given host and given
boot request. So it is possible that the scheduler accepts a host that
has a single PCI device available even if two devices are requested for
a single instance via two separate PCI aliases. Then the PCI claim on
the compute detects this but does not stop the boot just logs an ERROR.
This results in the instance booted without any PCI device.

This patch does two things:
1) changes the PCI claim to fail with an exception and trigger a
   re-schedule instead of just logging an ERROR.
2) change the PciDeviceStats.support_requests that is called during
   scheduling to not just filter pools for individual requests but also
   consume the request from the pool within the scope of a single boot
   request.

The fix in #2) would not be enough alone as two parallel scheduling
request could race for a single device on the same host. #1) is the
ultimate place where we consume devices under a compute global lock so
we need the fix there too.

Closes-Bug: #1986838
Change-Id: Iea477be57ae4e95dfc03acc9368f31d4be895343
github-actions bot pushed a commit that referenced this pull request Dec 18, 2023
Each PCI device and each PF is a separate RP in Placement and the
scheduler allocate them specifically so the PCI filtering and claiming
also needs to handle these devices individually. Nova pooled PCI devices
together if they had the same device_spec and same device type and numa
node. Now this is changed that only pool VFs from the same parent PF.
Fortunately nova already handled consuming devices for a single
InstancePCIRequest from multiple PCI pools, so this change does not
affect the device consumption code path.

The test_live_migrate_server_with_neutron test needed to be changed.
Originally this test used a compute with the following config:
* PF 81.00.0
** VFs 81.00.[1-4]
* PF 81.01.0
** VFs 81.01.[1-4]
* PF 82.00.0

And booted a VM that needed one VF and one PF. This request has two
widely different solutions:
1) allocate the VF from under 81.00 and therefore consume 81.00.0 and
   allocate the 82.00.0 PF
   This was what the test asserted to happen.
2) allocate the VF from under 81.00 and therefore consume 81.00.0 and
   allocate the 81.00.0 PF and therefore consume all the VFs under it
   This results in a different amount of free devices than #1)

AFAIK nova does not have any implemented preference for consuming PFs
without VFs. The test just worked by chance (some internal device and
pool ordering made it that way). However when the PCI pools are split
nova started choosing solution #2) making the test fail. As both
solution is equally good from nova's scheduling contract perspective I
don't consider this as a behavior change. Therefore the test is updated
not to create a situation where two different scheduling solutions are
possible.

blueprint: pci-device-tracking-in-placement
Change-Id: I4b67cca3807fbda9e9b07b220a28e331def57624
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants