Skip to content

Commit 2e03358

Browse files
mhklinuxliuw
authored andcommitted
Documentation: hyperv: Add overview of guest VM hibernation
Add documentation on how hibernation works in a guest VM on Hyper-V. Describe how VMBus devices and the VMBus itself are hibernated and resumed, along with various limitations. Signed-off-by: Michael Kelley <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Wei Liu <[email protected]> Message-ID: <[email protected]>
1 parent f285d99 commit 2e03358

File tree

2 files changed

+337
-0
lines changed

2 files changed

+337
-0
lines changed
Lines changed: 336 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,336 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
Hibernating Guest VMs
4+
=====================
5+
6+
Background
7+
----------
8+
Linux supports the ability to hibernate itself in order to save power.
9+
Hibernation is sometimes called suspend-to-disk, as it writes a memory
10+
image to disk and puts the hardware into the lowest possible power
11+
state. Upon resume from hibernation, the hardware is restarted and the
12+
memory image is restored from disk so that it can resume execution
13+
where it left off. See the "Hibernation" section of
14+
Documentation/admin-guide/pm/sleep-states.rst.
15+
16+
Hibernation is usually done on devices with a single user, such as a
17+
personal laptop. For example, the laptop goes into hibernation when
18+
the cover is closed, and resumes when the cover is opened again.
19+
Hibernation and resume happen on the same hardware, and Linux kernel
20+
code orchestrating the hibernation steps assumes that the hardware
21+
configuration is not changed while in the hibernated state.
22+
23+
Hibernation can be initiated within Linux by writing "disk" to
24+
/sys/power/state or by invoking the reboot system call with the
25+
appropriate arguments. This functionality may be wrapped by user space
26+
commands such "systemctl hibernate" that are run directly from a
27+
command line or in response to events such as the laptop lid closing.
28+
29+
Considerations for Guest VM Hibernation
30+
---------------------------------------
31+
Linux guests on Hyper-V can also be hibernated, in which case the
32+
hardware is the virtual hardware provided by Hyper-V to the guest VM.
33+
Only the targeted guest VM is hibernated, while other guest VMs and
34+
the underlying Hyper-V host continue to run normally. While the
35+
underlying Windows Hyper-V and physical hardware on which it is
36+
running might also be hibernated using hibernation functionality in
37+
the Windows host, host hibernation and its impact on guest VMs is not
38+
in scope for this documentation.
39+
40+
Resuming a hibernated guest VM can be more challenging than with
41+
physical hardware because VMs make it very easy to change the hardware
42+
configuration between the hibernation and resume. Even when the resume
43+
is done on the same VM that hibernated, the memory size might be
44+
changed, or virtual NICs or SCSI controllers might be added or
45+
removed. Virtual PCI devices assigned to the VM might be added or
46+
removed. Most such changes cause the resume steps to fail, though
47+
adding a new virtual NIC, SCSI controller, or vPCI device should work.
48+
49+
Additional complexity can ensue because the disks of the hibernated VM
50+
can be moved to another newly created VM that otherwise has the same
51+
virtual hardware configuration. While it is desirable for resume from
52+
hibernation to succeed after such a move, there are challenges. See
53+
details on this scenario and its limitations in the "Resuming on a
54+
Different VM" section below.
55+
56+
Hyper-V also provides ways to move a VM from one Hyper-V host to
57+
another. Hyper-V tries to ensure processor model and Hyper-V version
58+
compatibility using VM Configuration Versions, and prevents moves to
59+
a host that isn't compatible. Linux adapts to host and processor
60+
differences by detecting them at boot time, but such detection is not
61+
done when resuming execution in the hibernation image. If a VM is
62+
hibernated on one host, then resumed on a host with a different processor
63+
model or Hyper-V version, settings recorded in the hibernation image
64+
may not match the new host. Because Linux does not detect such
65+
mismatches when resuming the hibernation image, undefined behavior
66+
and failures could result.
67+
68+
69+
Enabling Guest VM Hibernation
70+
-----------------------------
71+
Hibernation of a Hyper-V guest VM is disabled by default because
72+
hibernation is incompatible with memory hot-add, as provided by the
73+
Hyper-V balloon driver. If hot-add is used and the VM hibernates, it
74+
hibernates with more memory than it started with. But when the VM
75+
resumes from hibernation, Hyper-V gives the VM only the originally
76+
assigned memory, and the memory size mismatch causes resume to fail.
77+
78+
To enable a Hyper-V VM for hibernation, the Hyper-V administrator must
79+
enable the ACPI virtual S4 sleep state in the ACPI configuration that
80+
Hyper-V provides to the guest VM. Such enablement is accomplished by
81+
modifying a WMI property of the VM, the steps for which are outside
82+
the scope of this documentation but are available on the web.
83+
Enablement is treated as the indicator that the administrator
84+
prioritizes Linux hibernation in the VM over hot-add, so the Hyper-V
85+
balloon driver in Linux disables hot-add. Enablement is indicated if
86+
the contents of /sys/power/disk contains "platform" as an option. The
87+
enablement is also visible in /sys/bus/vmbus/hibernation. See function
88+
hv_is_hibernation_supported().
89+
90+
Linux supports ACPI sleep states on x86, but not on arm64. So Linux
91+
guest VM hibernation is not available on Hyper-V for arm64.
92+
93+
Initiating Guest VM Hibernation
94+
-------------------------------
95+
Guest VMs can self-initiate hibernation using the standard Linux
96+
methods of writing "disk" to /sys/power/state or the reboot system
97+
call. As an additional layer, Linux guests on Hyper-V support the
98+
"Shutdown" integration service, via which a Hyper-V administrator can
99+
tell a Linux VM to hibernate using a command outside the VM. The
100+
command generates a request to the Hyper-V shutdown driver in Linux,
101+
which sends the uevent "EVENT=hibernate". See kernel functions
102+
shutdown_onchannelcallback() and send_hibernate_uevent(). A udev rule
103+
must be provided in the VM that handles this event and initiates
104+
hibernation.
105+
106+
Handling VMBus Devices During Hibernation & Resume
107+
--------------------------------------------------
108+
The VMBus bus driver, and the individual VMBus device drivers,
109+
implement suspend and resume functions that are called as part of the
110+
Linux orchestration of hibernation and of resuming from hibernation.
111+
The overall approach is to leave in place the data structures for the
112+
primary VMBus channels and their associated Linux devices, such as
113+
SCSI controllers and others, so that they are captured in the
114+
hibernation image. This approach allows any state associated with the
115+
device to be persisted across the hibernation/resume. When the VM
116+
resumes, the devices are re-offered by Hyper-V and are connected to
117+
the data structures that already exist in the resumed hibernation
118+
image.
119+
120+
VMBus devices are identified by class and instance GUID. (See section
121+
"VMBus device creation/deletion" in
122+
Documentation/virt/hyperv/vmbus.rst.) Upon resume from hibernation,
123+
the resume functions expect that the devices offered by Hyper-V have
124+
the same class/instance GUIDs as the devices present at the time of
125+
hibernation. Having the same class/instance GUIDs allows the offered
126+
devices to be matched to the primary VMBus channel data structures in
127+
the memory of the now resumed hibernation image. If any devices are
128+
offered that don't match primary VMBus channel data structures that
129+
already exist, they are processed normally as newly added devices. If
130+
primary VMBus channels that exist in the resumed hibernation image are
131+
not matched with a device offered in the resumed VM, the resume
132+
sequence waits for 10 seconds, then proceeds. But the unmatched device
133+
is likely to cause errors in the resumed VM.
134+
135+
When resuming existing primary VMBus channels, the newly offered
136+
relids might be different because relids can change on each VM boot,
137+
even if the VM configuration hasn't changed. The VMBus bus driver
138+
resume function matches the class/instance GUIDs, and updates the
139+
relids in case they have changed.
140+
141+
VMBus sub-channels are not persisted in the hibernation image. Each
142+
VMBus device driver's suspend function must close any sub-channels
143+
prior to hibernation. Closing a sub-channel causes Hyper-V to send a
144+
RESCIND_CHANNELOFFER message, which Linux processes by freeing the
145+
channel data structures so that all vestiges of the sub-channel are
146+
removed. By contrast, primary channels are marked closed and their
147+
ring buffers are freed, but Hyper-V does not send a rescind message,
148+
so the channel data structure continues to exist. Upon resume, the
149+
device driver's resume function re-allocates the ring buffer and
150+
re-opens the existing channel. It then communicates with Hyper-V to
151+
re-open sub-channels from scratch.
152+
153+
The Linux ends of Hyper-V sockets are forced closed at the time of
154+
hibernation. The guest can't force closing the host end of the socket,
155+
but any host-side actions on the host end will produce an error.
156+
157+
VMBus devices use the same suspend function for the "freeze" and the
158+
"poweroff" phases, and the same resume function for the "thaw" and
159+
"restore" phases. See the "Entering Hibernation" section of
160+
Documentation/driver-api/pm/devices.rst for the sequencing of the
161+
phases.
162+
163+
Detailed Hibernation Sequence
164+
-----------------------------
165+
1. The Linux power management (PM) subsystem prepares for
166+
hibernation by freezing user space processes and allocating
167+
memory to hold the hibernation image.
168+
2. As part of the "freeze" phase, Linux PM calls the "suspend"
169+
function for each VMBus device in turn. As described above, this
170+
function removes sub-channels, and leaves the primary channel in
171+
a closed state.
172+
3. Linux PM calls the "suspend" function for the VMBus bus, which
173+
closes any Hyper-V socket channels and unloads the top-level
174+
VMBus connection with the Hyper-V host.
175+
4. Linux PM disables non-boot CPUs, creates the hibernation image in
176+
the previously allocated memory, then re-enables non-boot CPUs.
177+
The hibernation image contains the memory data structures for the
178+
closed primary channels, but no sub-channels.
179+
5. As part of the "thaw" phase, Linux PM calls the "resume" function
180+
for the VMBus bus, which re-establishes the top-level VMBus
181+
connection and requests that Hyper-V re-offer the VMBus devices.
182+
As offers are received for the primary channels, the relids are
183+
updated as previously described.
184+
6. Linux PM calls the "resume" function for each VMBus device. Each
185+
device re-opens its primary channel, and communicates with Hyper-V
186+
to re-establish sub-channels if appropriate. The sub-channels
187+
are re-created as new channels since they were previously removed
188+
entirely in Step 2.
189+
7. With VMBus devices now working again, Linux PM writes the
190+
hibernation image from memory to disk.
191+
8. Linux PM repeats Steps 2 and 3 above as part of the "poweroff"
192+
phase. VMBus channels are closed and the top-level VMBus
193+
connection is unloaded.
194+
9. Linux PM disables non-boot CPUs, and then enters ACPI sleep state
195+
S4. Hibernation is now complete.
196+
197+
Detailed Resume Sequence
198+
------------------------
199+
1. The guest VM boots into a fresh Linux OS instance. During boot,
200+
the top-level VMBus connection is established, and synthetic
201+
devices are enabled. This happens via the normal paths that don't
202+
involve hibernation.
203+
2. Linux PM hibernation code reads swap space is to find and read
204+
the hibernation image into memory. If there is no hibernation
205+
image, then this boot becomes a normal boot.
206+
3. If this is a resume from hibernation, the "freeze" phase is used
207+
to shutdown VMBus devices and unload the top-level VMBus
208+
connection in the running fresh OS instance, just like Steps 2
209+
and 3 in the hibernation sequence.
210+
4. Linux PM disables non-boot CPUs, and transfers control to the
211+
read-in hibernation image. In the now-running hibernation image,
212+
non-boot CPUs are restarted.
213+
5. As part of the "resume" phase, Linux PM repeats Steps 5 and 6
214+
from the hibernation sequence. The top-level VMBus connection is
215+
re-established, and offers are received and matched to primary
216+
channels in the image. Relids are updated. VMBus device resume
217+
functions re-open primary channels and re-create sub-channels.
218+
6. Linux PM exits the hibernation resume sequence and the VM is now
219+
running normally from the hibernation image.
220+
221+
Key-Value Pair (KVP) Pseudo-Device Anomalies
222+
--------------------------------------------
223+
The VMBus KVP device behaves differently from other pseudo-devices
224+
offered by Hyper-V. When the KVP primary channel is closed, Hyper-V
225+
sends a rescind message, which causes all vestiges of the device to be
226+
removed. But Hyper-V then re-offers the device, causing it to be newly
227+
re-created. The removal and re-creation occurs during the "freeze"
228+
phase of hibernation, so the hibernation image contains the re-created
229+
KVP device. Similar behavior occurs during the "freeze" phase of the
230+
resume sequence while still in the fresh OS instance. But in both
231+
cases, the top-level VMBus connection is subsequently unloaded, which
232+
causes the device to be discarded on the Hyper-V side. So no harm is
233+
done and everything still works.
234+
235+
Virtual PCI devices
236+
-------------------
237+
Virtual PCI devices are physical PCI devices that are mapped directly
238+
into the VM's physical address space so the VM can interact directly
239+
with the hardware. vPCI devices include those accessed via what Hyper-V
240+
calls "Discrete Device Assignment" (DDA), as well as SR-IOV NIC
241+
Virtual Functions (VF) devices. See Documentation/virt/hyperv/vpci.rst.
242+
243+
Hyper-V DDA devices are offered to guest VMs after the top-level VMBus
244+
connection is established, just like VMBus synthetic devices. They are
245+
statically assigned to the VM, and their instance GUIDs don't change
246+
unless the Hyper-V administrator makes changes to the configuration.
247+
DDA devices are represented in Linux as virtual PCI devices that have
248+
a VMBus identity as well as a PCI identity. Consequently, Linux guest
249+
hibernation first handles DDA devices as VMBus devices in order to
250+
manage the VMBus channel. But then they are also handled as PCI
251+
devices using the hibernation functions implemented by their native
252+
PCI driver.
253+
254+
SR-IOV NIC VFs also have a VMBus identity as well as a PCI
255+
identity, and overall are processed similarly to DDA devices. A
256+
difference is that VFs are not offered to the VM during initial boot
257+
of the VM. Instead, the VMBus synthetic NIC driver first starts
258+
operating and communicates to Hyper-V that it is prepared to accept a
259+
VF, and then the VF offer is made. However, the VMBus connection
260+
might later be unloaded and then re-established without the VM being
261+
rebooted, as happens in Steps 3 and 5 in the Detailed Hibernation
262+
Sequence above and in the Detailed Resume Sequence. In such a case,
263+
the VFs likely became part of the VM during initial boot, so when the
264+
VMBus connection is re-established, the VFs are offered on the
265+
re-established connection without intervention by the synthetic NIC driver.
266+
267+
UIO Devices
268+
-----------
269+
A VMBus device can be exposed to user space using the Hyper-V UIO
270+
driver (uio_hv_generic.c) so that a user space driver can control and
271+
operate the device. However, the VMBus UIO driver does not support the
272+
suspend and resume operations needed for hibernation. If a VMBus
273+
device is configured to use the UIO driver, hibernating the VM fails
274+
and Linux continues to run normally. The most common use of the Hyper-V
275+
UIO driver is for DPDK networking, but there are other uses as well.
276+
277+
Resuming on a Different VM
278+
--------------------------
279+
This scenario occurs in the Azure public cloud in that a hibernated
280+
customer VM only exists as saved configuration and disks -- the VM no
281+
longer exists on any Hyper-V host. When the customer VM is resumed, a
282+
new Hyper-V VM with identical configuration is created, likely on a
283+
different Hyper-V host. That new Hyper-V VM becomes the resumed
284+
customer VM, and the steps the Linux kernel takes to resume from the
285+
hibernation image must work in that new VM.
286+
287+
While the disks and their contents are preserved from the original VM,
288+
the Hyper-V-provided VMBus instance GUIDs of the disk controllers and
289+
other synthetic devices would typically be different. The difference
290+
would cause the resume from hibernation to fail, so several things are
291+
done to solve this problem:
292+
293+
* For VMBus synthetic devices that support only a single instance,
294+
Hyper-V always assigns the same instance GUIDs. For example, the
295+
Hyper-V mouse, the shutdown pseudo-device, the time sync pseudo
296+
device, etc., always have the same instance GUID, both for local
297+
Hyper-V installs as well as in the Azure cloud.
298+
299+
* VMBus synthetic SCSI controllers may have multiple instances in a
300+
VM, and in the general case instance GUIDs vary from VM to VM.
301+
However, Azure VMs always have exactly two synthetic SCSI
302+
controllers, and Azure code overrides the normal Hyper-V behavior
303+
so these controllers are always assigned the same two instance
304+
GUIDs. Consequently, when a customer VM is resumed on a newly
305+
created VM, the instance GUIDs match. But this guarantee does not
306+
hold for local Hyper-V installs.
307+
308+
* Similarly, VMBus synthetic NICs may have multiple instances in a
309+
VM, and the instance GUIDs vary from VM to VM. Again, Azure code
310+
overrides the normal Hyper-V behavior so that the instance GUID
311+
of a synthetic NIC in a customer VM does not change, even if the
312+
customer VM is deallocated or hibernated, and then re-constituted
313+
on a newly created VM. As with SCSI controllers, this behavior
314+
does not hold for local Hyper-V installs.
315+
316+
* vPCI devices do not have the same instance GUIDs when resuming
317+
from hibernation on a newly created VM. Consequently, Azure does
318+
not support hibernation for VMs that have DDA devices such as
319+
NVMe controllers or GPUs. For SR-IOV NIC VFs, Azure removes the
320+
VF from the VM before it hibernates so that the hibernation image
321+
does not contain a VF device. When the VM is resumed it
322+
instantiates a new VF, rather than trying to match against a VF
323+
that is present in the hibernation image. Because Azure must
324+
remove any VFs before initiating hibernation, Azure VM
325+
hibernation must be initiated externally from the Azure Portal or
326+
Azure CLI, which in turn uses the Shutdown integration service to
327+
tell Linux to do the hibernation. If hibernation is self-initiated
328+
within the Azure VM, VFs remain in the hibernation image, and are
329+
not resumed properly.
330+
331+
In summary, Azure takes special actions to remove VFs and to ensure
332+
that VMBus device instance GUIDs match on a new/different VM, allowing
333+
hibernation to work for most general-purpose Azure VMs sizes. While
334+
similar special actions could be taken when resuming on a different VM
335+
on a local Hyper-V install, orchestrating such actions is not provided
336+
out-of-the-box by local Hyper-V and so requires custom scripting.

Documentation/virt/hyperv/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,5 @@ Hyper-V Enlightenments
1111
vmbus
1212
clocks
1313
vpci
14+
hibernation
1415
coco

0 commit comments

Comments
 (0)