Skip to content

[ironic] Don't remove instance info twice in destroy #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 17, 2019

Conversation

markgoddard
Copy link

During teardown, at the end of the ironic virt driver's destroy method,
we call _cleanup_deploy, which since
https://review.opendev.org/#/c/563722/ removes instance_info from the
ironic node. Given that we're destroying the node, the instance_info
will have been cleared from the node anyway, so we don't need to do
this.

Further, it can cause tear down to fail if automated cleaning is
enabled in ironic. This happens because ironic prevents updates to nodes
that are undergoing a state transition, as is the case during cleaning.
The virt driver will retry this request by default every 2 seconds with
60 attempts. Typically this is not long enough for cleaning to complete,
so tear down fails with the following error:

Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated
while a state transition is in progress. (HTTP 409)

This change skips the instance info update in _cleanup_deploy in the
case of tear down.

Change-Id: Iea337f73c231db2cb9d9f639b92475daaede6793
Closes-Bug: #1815799

During teardown, at the end of the ironic virt driver's destroy method,
we call _cleanup_deploy, which since
https://review.opendev.org/#/c/563722/ removes instance_info from the
ironic node. Given that we're destroying the node, the instance_info
will have been cleared from the node anyway, so we don't need to do
this.

Further, it can cause tear down to fail if automated cleaning is
enabled in ironic. This happens because ironic prevents updates to nodes
that are undergoing a state transition, as is the case during cleaning.
The virt driver will retry this request by default every 2 seconds with
60 attempts. Typically this is not long enough for cleaning to complete,
so tear down fails with the following error:

  Conflict: Node a00696d5-32ba-475e-9528-59bf11cffea6 can not be updated
  while a state transition is in progress. (HTTP 409)

This change skips the instance info update in _cleanup_deploy in the
case of tear down.

Change-Id: Iea337f73c231db2cb9d9f639b92475daaede6793
Closes-Bug: #1815799
@markgoddard markgoddard self-assigned this May 17, 2019
@markgoddard
Copy link
Author

Upstream review here: https://review.opendev.org/#/c/659809

Seems to address the issue, so merging.

@markgoddard markgoddard merged commit 826e11d into stackhpc/rocky May 17, 2019
@markgoddard markgoddard deleted the bug/1815799-rocky branch May 17, 2019 15:49
github-actions bot pushed a commit that referenced this pull request Dec 18, 2023
The PciPassthroughFilter logic checks each InstancePCIRequest
individually against the available PCI pools of a given host and given
boot request. So it is possible that the scheduler accepts a host that
has a single PCI device available even if two devices are requested for
a single instance via two separate PCI aliases. Then the PCI claim on
the compute detects this but does not stop the boot just logs an ERROR.
This results in the instance booted without any PCI device.

This patch does two things:
1) changes the PCI claim to fail with an exception and trigger a
   re-schedule instead of just logging an ERROR.
2) change the PciDeviceStats.support_requests that is called during
   scheduling to not just filter pools for individual requests but also
   consume the request from the pool within the scope of a single boot
   request.

The fix in #2) would not be enough alone as two parallel scheduling
request could race for a single device on the same host. #1) is the
ultimate place where we consume devices under a compute global lock so
we need the fix there too.

Closes-Bug: #1986838
Change-Id: Iea477be57ae4e95dfc03acc9368f31d4be895343
github-actions bot pushed a commit that referenced this pull request Dec 18, 2023
Each PCI device and each PF is a separate RP in Placement and the
scheduler allocate them specifically so the PCI filtering and claiming
also needs to handle these devices individually. Nova pooled PCI devices
together if they had the same device_spec and same device type and numa
node. Now this is changed that only pool VFs from the same parent PF.
Fortunately nova already handled consuming devices for a single
InstancePCIRequest from multiple PCI pools, so this change does not
affect the device consumption code path.

The test_live_migrate_server_with_neutron test needed to be changed.
Originally this test used a compute with the following config:
* PF 81.00.0
** VFs 81.00.[1-4]
* PF 81.01.0
** VFs 81.01.[1-4]
* PF 82.00.0

And booted a VM that needed one VF and one PF. This request has two
widely different solutions:
1) allocate the VF from under 81.00 and therefore consume 81.00.0 and
   allocate the 82.00.0 PF
   This was what the test asserted to happen.
2) allocate the VF from under 81.00 and therefore consume 81.00.0 and
   allocate the 81.00.0 PF and therefore consume all the VFs under it
   This results in a different amount of free devices than #1)

AFAIK nova does not have any implemented preference for consuming PFs
without VFs. The test just worked by chance (some internal device and
pool ordering made it that way). However when the PCI pools are split
nova started choosing solution #2) making the test fail. As both
solution is equally good from nova's scheduling contract perspective I
don't consider this as a behavior change. Therefore the test is updated
not to create a situation where two different scheduling solutions are
possible.

blueprint: pci-device-tracking-in-placement
Change-Id: I4b67cca3807fbda9e9b07b220a28e331def57624
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant