|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +Hibernating Guest VMs |
| 4 | +===================== |
| 5 | + |
| 6 | +Background |
| 7 | +---------- |
| 8 | +Linux supports the ability to hibernate itself in order to save power. |
| 9 | +Hibernation is sometimes called suspend-to-disk, as it writes a memory |
| 10 | +image to disk and puts the hardware into the lowest possible power |
| 11 | +state. Upon resume from hibernation, the hardware is restarted and the |
| 12 | +memory image is restored from disk so that it can resume execution |
| 13 | +where it left off. See the "Hibernation" section of |
| 14 | +Documentation/admin-guide/pm/sleep-states.rst. |
| 15 | + |
| 16 | +Hibernation is usually done on devices with a single user, such as a |
| 17 | +personal laptop. For example, the laptop goes into hibernation when |
| 18 | +the cover is closed, and resumes when the cover is opened again. |
| 19 | +Hibernation and resume happen on the same hardware, and Linux kernel |
| 20 | +code orchestrating the hibernation steps assumes that the hardware |
| 21 | +configuration is not changed while in the hibernated state. |
| 22 | + |
| 23 | +Hibernation can be initiated within Linux by writing "disk" to |
| 24 | +/sys/power/state or by invoking the reboot system call with the |
| 25 | +appropriate arguments. This functionality may be wrapped by user space |
| 26 | +commands such "systemctl hibernate" that are run directly from a |
| 27 | +command line or in response to events such as the laptop lid closing. |
| 28 | + |
| 29 | +Considerations for Guest VM Hibernation |
| 30 | +--------------------------------------- |
| 31 | +Linux guests on Hyper-V can also be hibernated, in which case the |
| 32 | +hardware is the virtual hardware provided by Hyper-V to the guest VM. |
| 33 | +Only the targeted guest VM is hibernated, while other guest VMs and |
| 34 | +the underlying Hyper-V host continue to run normally. While the |
| 35 | +underlying Windows Hyper-V and physical hardware on which it is |
| 36 | +running might also be hibernated using hibernation functionality in |
| 37 | +the Windows host, host hibernation and its impact on guest VMs is not |
| 38 | +in scope for this documentation. |
| 39 | + |
| 40 | +Resuming a hibernated guest VM can be more challenging than with |
| 41 | +physical hardware because VMs make it very easy to change the hardware |
| 42 | +configuration between the hibernation and resume. Even when the resume |
| 43 | +is done on the same VM that hibernated, the memory size might be |
| 44 | +changed, or virtual NICs or SCSI controllers might be added or |
| 45 | +removed. Virtual PCI devices assigned to the VM might be added or |
| 46 | +removed. Most such changes cause the resume steps to fail, though |
| 47 | +adding a new virtual NIC, SCSI controller, or vPCI device should work. |
| 48 | + |
| 49 | +Additional complexity can ensue because the disks of the hibernated VM |
| 50 | +can be moved to another newly created VM that otherwise has the same |
| 51 | +virtual hardware configuration. While it is desirable for resume from |
| 52 | +hibernation to succeed after such a move, there are challenges. See |
| 53 | +details on this scenario and its limitations in the "Resuming on a |
| 54 | +Different VM" section below. |
| 55 | + |
| 56 | +Hyper-V also provides ways to move a VM from one Hyper-V host to |
| 57 | +another. Hyper-V tries to ensure processor model and Hyper-V version |
| 58 | +compatibility using VM Configuration Versions, and prevents moves to |
| 59 | +a host that isn't compatible. Linux adapts to host and processor |
| 60 | +differences by detecting them at boot time, but such detection is not |
| 61 | +done when resuming execution in the hibernation image. If a VM is |
| 62 | +hibernated on one host, then resumed on a host with a different processor |
| 63 | +model or Hyper-V version, settings recorded in the hibernation image |
| 64 | +may not match the new host. Because Linux does not detect such |
| 65 | +mismatches when resuming the hibernation image, undefined behavior |
| 66 | +and failures could result. |
| 67 | + |
| 68 | + |
| 69 | +Enabling Guest VM Hibernation |
| 70 | +----------------------------- |
| 71 | +Hibernation of a Hyper-V guest VM is disabled by default because |
| 72 | +hibernation is incompatible with memory hot-add, as provided by the |
| 73 | +Hyper-V balloon driver. If hot-add is used and the VM hibernates, it |
| 74 | +hibernates with more memory than it started with. But when the VM |
| 75 | +resumes from hibernation, Hyper-V gives the VM only the originally |
| 76 | +assigned memory, and the memory size mismatch causes resume to fail. |
| 77 | + |
| 78 | +To enable a Hyper-V VM for hibernation, the Hyper-V administrator must |
| 79 | +enable the ACPI virtual S4 sleep state in the ACPI configuration that |
| 80 | +Hyper-V provides to the guest VM. Such enablement is accomplished by |
| 81 | +modifying a WMI property of the VM, the steps for which are outside |
| 82 | +the scope of this documentation but are available on the web. |
| 83 | +Enablement is treated as the indicator that the administrator |
| 84 | +prioritizes Linux hibernation in the VM over hot-add, so the Hyper-V |
| 85 | +balloon driver in Linux disables hot-add. Enablement is indicated if |
| 86 | +the contents of /sys/power/disk contains "platform" as an option. The |
| 87 | +enablement is also visible in /sys/bus/vmbus/hibernation. See function |
| 88 | +hv_is_hibernation_supported(). |
| 89 | + |
| 90 | +Linux supports ACPI sleep states on x86, but not on arm64. So Linux |
| 91 | +guest VM hibernation is not available on Hyper-V for arm64. |
| 92 | + |
| 93 | +Initiating Guest VM Hibernation |
| 94 | +------------------------------- |
| 95 | +Guest VMs can self-initiate hibernation using the standard Linux |
| 96 | +methods of writing "disk" to /sys/power/state or the reboot system |
| 97 | +call. As an additional layer, Linux guests on Hyper-V support the |
| 98 | +"Shutdown" integration service, via which a Hyper-V administrator can |
| 99 | +tell a Linux VM to hibernate using a command outside the VM. The |
| 100 | +command generates a request to the Hyper-V shutdown driver in Linux, |
| 101 | +which sends the uevent "EVENT=hibernate". See kernel functions |
| 102 | +shutdown_onchannelcallback() and send_hibernate_uevent(). A udev rule |
| 103 | +must be provided in the VM that handles this event and initiates |
| 104 | +hibernation. |
| 105 | + |
| 106 | +Handling VMBus Devices During Hibernation & Resume |
| 107 | +-------------------------------------------------- |
| 108 | +The VMBus bus driver, and the individual VMBus device drivers, |
| 109 | +implement suspend and resume functions that are called as part of the |
| 110 | +Linux orchestration of hibernation and of resuming from hibernation. |
| 111 | +The overall approach is to leave in place the data structures for the |
| 112 | +primary VMBus channels and their associated Linux devices, such as |
| 113 | +SCSI controllers and others, so that they are captured in the |
| 114 | +hibernation image. This approach allows any state associated with the |
| 115 | +device to be persisted across the hibernation/resume. When the VM |
| 116 | +resumes, the devices are re-offered by Hyper-V and are connected to |
| 117 | +the data structures that already exist in the resumed hibernation |
| 118 | +image. |
| 119 | + |
| 120 | +VMBus devices are identified by class and instance GUID. (See section |
| 121 | +"VMBus device creation/deletion" in |
| 122 | +Documentation/virt/hyperv/vmbus.rst.) Upon resume from hibernation, |
| 123 | +the resume functions expect that the devices offered by Hyper-V have |
| 124 | +the same class/instance GUIDs as the devices present at the time of |
| 125 | +hibernation. Having the same class/instance GUIDs allows the offered |
| 126 | +devices to be matched to the primary VMBus channel data structures in |
| 127 | +the memory of the now resumed hibernation image. If any devices are |
| 128 | +offered that don't match primary VMBus channel data structures that |
| 129 | +already exist, they are processed normally as newly added devices. If |
| 130 | +primary VMBus channels that exist in the resumed hibernation image are |
| 131 | +not matched with a device offered in the resumed VM, the resume |
| 132 | +sequence waits for 10 seconds, then proceeds. But the unmatched device |
| 133 | +is likely to cause errors in the resumed VM. |
| 134 | + |
| 135 | +When resuming existing primary VMBus channels, the newly offered |
| 136 | +relids might be different because relids can change on each VM boot, |
| 137 | +even if the VM configuration hasn't changed. The VMBus bus driver |
| 138 | +resume function matches the class/instance GUIDs, and updates the |
| 139 | +relids in case they have changed. |
| 140 | + |
| 141 | +VMBus sub-channels are not persisted in the hibernation image. Each |
| 142 | +VMBus device driver's suspend function must close any sub-channels |
| 143 | +prior to hibernation. Closing a sub-channel causes Hyper-V to send a |
| 144 | +RESCIND_CHANNELOFFER message, which Linux processes by freeing the |
| 145 | +channel data structures so that all vestiges of the sub-channel are |
| 146 | +removed. By contrast, primary channels are marked closed and their |
| 147 | +ring buffers are freed, but Hyper-V does not send a rescind message, |
| 148 | +so the channel data structure continues to exist. Upon resume, the |
| 149 | +device driver's resume function re-allocates the ring buffer and |
| 150 | +re-opens the existing channel. It then communicates with Hyper-V to |
| 151 | +re-open sub-channels from scratch. |
| 152 | + |
| 153 | +The Linux ends of Hyper-V sockets are forced closed at the time of |
| 154 | +hibernation. The guest can't force closing the host end of the socket, |
| 155 | +but any host-side actions on the host end will produce an error. |
| 156 | + |
| 157 | +VMBus devices use the same suspend function for the "freeze" and the |
| 158 | +"poweroff" phases, and the same resume function for the "thaw" and |
| 159 | +"restore" phases. See the "Entering Hibernation" section of |
| 160 | +Documentation/driver-api/pm/devices.rst for the sequencing of the |
| 161 | +phases. |
| 162 | + |
| 163 | +Detailed Hibernation Sequence |
| 164 | +----------------------------- |
| 165 | +1. The Linux power management (PM) subsystem prepares for |
| 166 | + hibernation by freezing user space processes and allocating |
| 167 | + memory to hold the hibernation image. |
| 168 | +2. As part of the "freeze" phase, Linux PM calls the "suspend" |
| 169 | + function for each VMBus device in turn. As described above, this |
| 170 | + function removes sub-channels, and leaves the primary channel in |
| 171 | + a closed state. |
| 172 | +3. Linux PM calls the "suspend" function for the VMBus bus, which |
| 173 | + closes any Hyper-V socket channels and unloads the top-level |
| 174 | + VMBus connection with the Hyper-V host. |
| 175 | +4. Linux PM disables non-boot CPUs, creates the hibernation image in |
| 176 | + the previously allocated memory, then re-enables non-boot CPUs. |
| 177 | + The hibernation image contains the memory data structures for the |
| 178 | + closed primary channels, but no sub-channels. |
| 179 | +5. As part of the "thaw" phase, Linux PM calls the "resume" function |
| 180 | + for the VMBus bus, which re-establishes the top-level VMBus |
| 181 | + connection and requests that Hyper-V re-offer the VMBus devices. |
| 182 | + As offers are received for the primary channels, the relids are |
| 183 | + updated as previously described. |
| 184 | +6. Linux PM calls the "resume" function for each VMBus device. Each |
| 185 | + device re-opens its primary channel, and communicates with Hyper-V |
| 186 | + to re-establish sub-channels if appropriate. The sub-channels |
| 187 | + are re-created as new channels since they were previously removed |
| 188 | + entirely in Step 2. |
| 189 | +7. With VMBus devices now working again, Linux PM writes the |
| 190 | + hibernation image from memory to disk. |
| 191 | +8. Linux PM repeats Steps 2 and 3 above as part of the "poweroff" |
| 192 | + phase. VMBus channels are closed and the top-level VMBus |
| 193 | + connection is unloaded. |
| 194 | +9. Linux PM disables non-boot CPUs, and then enters ACPI sleep state |
| 195 | + S4. Hibernation is now complete. |
| 196 | + |
| 197 | +Detailed Resume Sequence |
| 198 | +------------------------ |
| 199 | +1. The guest VM boots into a fresh Linux OS instance. During boot, |
| 200 | + the top-level VMBus connection is established, and synthetic |
| 201 | + devices are enabled. This happens via the normal paths that don't |
| 202 | + involve hibernation. |
| 203 | +2. Linux PM hibernation code reads swap space is to find and read |
| 204 | + the hibernation image into memory. If there is no hibernation |
| 205 | + image, then this boot becomes a normal boot. |
| 206 | +3. If this is a resume from hibernation, the "freeze" phase is used |
| 207 | + to shutdown VMBus devices and unload the top-level VMBus |
| 208 | + connection in the running fresh OS instance, just like Steps 2 |
| 209 | + and 3 in the hibernation sequence. |
| 210 | +4. Linux PM disables non-boot CPUs, and transfers control to the |
| 211 | + read-in hibernation image. In the now-running hibernation image, |
| 212 | + non-boot CPUs are restarted. |
| 213 | +5. As part of the "resume" phase, Linux PM repeats Steps 5 and 6 |
| 214 | + from the hibernation sequence. The top-level VMBus connection is |
| 215 | + re-established, and offers are received and matched to primary |
| 216 | + channels in the image. Relids are updated. VMBus device resume |
| 217 | + functions re-open primary channels and re-create sub-channels. |
| 218 | +6. Linux PM exits the hibernation resume sequence and the VM is now |
| 219 | + running normally from the hibernation image. |
| 220 | + |
| 221 | +Key-Value Pair (KVP) Pseudo-Device Anomalies |
| 222 | +-------------------------------------------- |
| 223 | +The VMBus KVP device behaves differently from other pseudo-devices |
| 224 | +offered by Hyper-V. When the KVP primary channel is closed, Hyper-V |
| 225 | +sends a rescind message, which causes all vestiges of the device to be |
| 226 | +removed. But Hyper-V then re-offers the device, causing it to be newly |
| 227 | +re-created. The removal and re-creation occurs during the "freeze" |
| 228 | +phase of hibernation, so the hibernation image contains the re-created |
| 229 | +KVP device. Similar behavior occurs during the "freeze" phase of the |
| 230 | +resume sequence while still in the fresh OS instance. But in both |
| 231 | +cases, the top-level VMBus connection is subsequently unloaded, which |
| 232 | +causes the device to be discarded on the Hyper-V side. So no harm is |
| 233 | +done and everything still works. |
| 234 | + |
| 235 | +Virtual PCI devices |
| 236 | +------------------- |
| 237 | +Virtual PCI devices are physical PCI devices that are mapped directly |
| 238 | +into the VM's physical address space so the VM can interact directly |
| 239 | +with the hardware. vPCI devices include those accessed via what Hyper-V |
| 240 | +calls "Discrete Device Assignment" (DDA), as well as SR-IOV NIC |
| 241 | +Virtual Functions (VF) devices. See Documentation/virt/hyperv/vpci.rst. |
| 242 | + |
| 243 | +Hyper-V DDA devices are offered to guest VMs after the top-level VMBus |
| 244 | +connection is established, just like VMBus synthetic devices. They are |
| 245 | +statically assigned to the VM, and their instance GUIDs don't change |
| 246 | +unless the Hyper-V administrator makes changes to the configuration. |
| 247 | +DDA devices are represented in Linux as virtual PCI devices that have |
| 248 | +a VMBus identity as well as a PCI identity. Consequently, Linux guest |
| 249 | +hibernation first handles DDA devices as VMBus devices in order to |
| 250 | +manage the VMBus channel. But then they are also handled as PCI |
| 251 | +devices using the hibernation functions implemented by their native |
| 252 | +PCI driver. |
| 253 | + |
| 254 | +SR-IOV NIC VFs also have a VMBus identity as well as a PCI |
| 255 | +identity, and overall are processed similarly to DDA devices. A |
| 256 | +difference is that VFs are not offered to the VM during initial boot |
| 257 | +of the VM. Instead, the VMBus synthetic NIC driver first starts |
| 258 | +operating and communicates to Hyper-V that it is prepared to accept a |
| 259 | +VF, and then the VF offer is made. However, the VMBus connection |
| 260 | +might later be unloaded and then re-established without the VM being |
| 261 | +rebooted, as happens in Steps 3 and 5 in the Detailed Hibernation |
| 262 | +Sequence above and in the Detailed Resume Sequence. In such a case, |
| 263 | +the VFs likely became part of the VM during initial boot, so when the |
| 264 | +VMBus connection is re-established, the VFs are offered on the |
| 265 | +re-established connection without intervention by the synthetic NIC driver. |
| 266 | + |
| 267 | +UIO Devices |
| 268 | +----------- |
| 269 | +A VMBus device can be exposed to user space using the Hyper-V UIO |
| 270 | +driver (uio_hv_generic.c) so that a user space driver can control and |
| 271 | +operate the device. However, the VMBus UIO driver does not support the |
| 272 | +suspend and resume operations needed for hibernation. If a VMBus |
| 273 | +device is configured to use the UIO driver, hibernating the VM fails |
| 274 | +and Linux continues to run normally. The most common use of the Hyper-V |
| 275 | +UIO driver is for DPDK networking, but there are other uses as well. |
| 276 | + |
| 277 | +Resuming on a Different VM |
| 278 | +-------------------------- |
| 279 | +This scenario occurs in the Azure public cloud in that a hibernated |
| 280 | +customer VM only exists as saved configuration and disks -- the VM no |
| 281 | +longer exists on any Hyper-V host. When the customer VM is resumed, a |
| 282 | +new Hyper-V VM with identical configuration is created, likely on a |
| 283 | +different Hyper-V host. That new Hyper-V VM becomes the resumed |
| 284 | +customer VM, and the steps the Linux kernel takes to resume from the |
| 285 | +hibernation image must work in that new VM. |
| 286 | + |
| 287 | +While the disks and their contents are preserved from the original VM, |
| 288 | +the Hyper-V-provided VMBus instance GUIDs of the disk controllers and |
| 289 | +other synthetic devices would typically be different. The difference |
| 290 | +would cause the resume from hibernation to fail, so several things are |
| 291 | +done to solve this problem: |
| 292 | + |
| 293 | +* For VMBus synthetic devices that support only a single instance, |
| 294 | + Hyper-V always assigns the same instance GUIDs. For example, the |
| 295 | + Hyper-V mouse, the shutdown pseudo-device, the time sync pseudo |
| 296 | + device, etc., always have the same instance GUID, both for local |
| 297 | + Hyper-V installs as well as in the Azure cloud. |
| 298 | + |
| 299 | +* VMBus synthetic SCSI controllers may have multiple instances in a |
| 300 | + VM, and in the general case instance GUIDs vary from VM to VM. |
| 301 | + However, Azure VMs always have exactly two synthetic SCSI |
| 302 | + controllers, and Azure code overrides the normal Hyper-V behavior |
| 303 | + so these controllers are always assigned the same two instance |
| 304 | + GUIDs. Consequently, when a customer VM is resumed on a newly |
| 305 | + created VM, the instance GUIDs match. But this guarantee does not |
| 306 | + hold for local Hyper-V installs. |
| 307 | + |
| 308 | +* Similarly, VMBus synthetic NICs may have multiple instances in a |
| 309 | + VM, and the instance GUIDs vary from VM to VM. Again, Azure code |
| 310 | + overrides the normal Hyper-V behavior so that the instance GUID |
| 311 | + of a synthetic NIC in a customer VM does not change, even if the |
| 312 | + customer VM is deallocated or hibernated, and then re-constituted |
| 313 | + on a newly created VM. As with SCSI controllers, this behavior |
| 314 | + does not hold for local Hyper-V installs. |
| 315 | + |
| 316 | +* vPCI devices do not have the same instance GUIDs when resuming |
| 317 | + from hibernation on a newly created VM. Consequently, Azure does |
| 318 | + not support hibernation for VMs that have DDA devices such as |
| 319 | + NVMe controllers or GPUs. For SR-IOV NIC VFs, Azure removes the |
| 320 | + VF from the VM before it hibernates so that the hibernation image |
| 321 | + does not contain a VF device. When the VM is resumed it |
| 322 | + instantiates a new VF, rather than trying to match against a VF |
| 323 | + that is present in the hibernation image. Because Azure must |
| 324 | + remove any VFs before initiating hibernation, Azure VM |
| 325 | + hibernation must be initiated externally from the Azure Portal or |
| 326 | + Azure CLI, which in turn uses the Shutdown integration service to |
| 327 | + tell Linux to do the hibernation. If hibernation is self-initiated |
| 328 | + within the Azure VM, VFs remain in the hibernation image, and are |
| 329 | + not resumed properly. |
| 330 | + |
| 331 | +In summary, Azure takes special actions to remove VFs and to ensure |
| 332 | +that VMBus device instance GUIDs match on a new/different VM, allowing |
| 333 | +hibernation to work for most general-purpose Azure VMs sizes. While |
| 334 | +similar special actions could be taken when resuming on a different VM |
| 335 | +on a local Hyper-V install, orchestrating such actions is not provided |
| 336 | +out-of-the-box by local Hyper-V and so requires custom scripting. |
0 commit comments