Skip to content

Commit 8e0aa6d

Browse files
maheshsalozbenh
authored andcommitted
fadump: Add documentation for firmware-assisted dump.
Documentation for firmware-assisted dump. This document is based on the original documentation written for phyp assisted dump by Linas Vepstas and Manish Ahuja, with few changes to reflect the current implementation. Signed-off-by: Mahesh Salgaonkar <[email protected]> Signed-off-by: Benjamin Herrenschmidt <[email protected]>
1 parent e55d7f7 commit 8e0aa6d

File tree

1 file changed

+270
-0
lines changed

1 file changed

+270
-0
lines changed
Lines changed: 270 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,270 @@
1+
2+
Firmware-Assisted Dump
3+
------------------------
4+
July 2011
5+
6+
The goal of firmware-assisted dump is to enable the dump of
7+
a crashed system, and to do so from a fully-reset system, and
8+
to minimize the total elapsed time until the system is back
9+
in production use.
10+
11+
- Firmware assisted dump (fadump) infrastructure is intended to replace
12+
the existing phyp assisted dump.
13+
- Fadump uses the same firmware interfaces and memory reservation model
14+
as phyp assisted dump.
15+
- Unlike phyp dump, fadump exports the memory dump through /proc/vmcore
16+
in the ELF format in the same way as kdump. This helps us reuse the
17+
kdump infrastructure for dump capture and filtering.
18+
- Unlike phyp dump, userspace tool does not need to refer any sysfs
19+
interface while reading /proc/vmcore.
20+
- Unlike phyp dump, fadump allows user to release all the memory reserved
21+
for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem.
22+
- Once enabled through kernel boot parameter, fadump can be
23+
started/stopped through /sys/kernel/fadump_registered interface (see
24+
sysfs files section below) and can be easily integrated with kdump
25+
service start/stop init scripts.
26+
27+
Comparing with kdump or other strategies, firmware-assisted
28+
dump offers several strong, practical advantages:
29+
30+
-- Unlike kdump, the system has been reset, and loaded
31+
with a fresh copy of the kernel. In particular,
32+
PCI and I/O devices have been reinitialized and are
33+
in a clean, consistent state.
34+
-- Once the dump is copied out, the memory that held the dump
35+
is immediately available to the running kernel. And therefore,
36+
unlike kdump, fadump doesn't need a 2nd reboot to get back
37+
the system to the production configuration.
38+
39+
The above can only be accomplished by coordination with,
40+
and assistance from the Power firmware. The procedure is
41+
as follows:
42+
43+
-- The first kernel registers the sections of memory with the
44+
Power firmware for dump preservation during OS initialization.
45+
These registered sections of memory are reserved by the first
46+
kernel during early boot.
47+
48+
-- When a system crashes, the Power firmware will save
49+
the low memory (boot memory of size larger of 5% of system RAM
50+
or 256MB) of RAM to the previous registered region. It will
51+
also save system registers, and hardware PTE's.
52+
53+
NOTE: The term 'boot memory' means size of the low memory chunk
54+
that is required for a kernel to boot successfully when
55+
booted with restricted memory. By default, the boot memory
56+
size will be the larger of 5% of system RAM or 256MB.
57+
Alternatively, user can also specify boot memory size
58+
through boot parameter 'fadump_reserve_mem=' which will
59+
override the default calculated size. Use this option
60+
if default boot memory size is not sufficient for second
61+
kernel to boot successfully.
62+
63+
-- After the low memory (boot memory) area has been saved, the
64+
firmware will reset PCI and other hardware state. It will
65+
*not* clear the RAM. It will then launch the bootloader, as
66+
normal.
67+
68+
-- The freshly booted kernel will notice that there is a new
69+
node (ibm,dump-kernel) in the device tree, indicating that
70+
there is crash data available from a previous boot. During
71+
the early boot OS will reserve rest of the memory above
72+
boot memory size effectively booting with restricted memory
73+
size. This will make sure that the second kernel will not
74+
touch any of the dump memory area.
75+
76+
-- User-space tools will read /proc/vmcore to obtain the contents
77+
of memory, which holds the previous crashed kernel dump in ELF
78+
format. The userspace tools may copy this info to disk, or
79+
network, nas, san, iscsi, etc. as desired.
80+
81+
-- Once the userspace tool is done saving dump, it will echo
82+
'1' to /sys/kernel/fadump_release_mem to release the reserved
83+
memory back to general use, except the memory required for
84+
next firmware-assisted dump registration.
85+
86+
e.g.
87+
# echo 1 > /sys/kernel/fadump_release_mem
88+
89+
Please note that the firmware-assisted dump feature
90+
is only available on Power6 and above systems with recent
91+
firmware versions.
92+
93+
Implementation details:
94+
----------------------
95+
96+
During boot, a check is made to see if firmware supports
97+
this feature on that particular machine. If it does, then
98+
we check to see if an active dump is waiting for us. If yes
99+
then everything but boot memory size of RAM is reserved during
100+
early boot (See Fig. 2). This area is released once we finish
101+
collecting the dump from user land scripts (e.g. kdump scripts)
102+
that are run. If there is dump data, then the
103+
/sys/kernel/fadump_release_mem file is created, and the reserved
104+
memory is held.
105+
106+
If there is no waiting dump data, then only the memory required
107+
to hold CPU state, HPTE region, boot memory dump and elfcore
108+
header, is reserved at the top of memory (see Fig. 1). This area
109+
is *not* released: this region will be kept permanently reserved,
110+
so that it can act as a receptacle for a copy of the boot memory
111+
content in addition to CPU state and HPTE region, in the case a
112+
crash does occur.
113+
114+
o Memory Reservation during first kernel
115+
116+
Low memory Top of memory
117+
0 boot memory size |
118+
| | |<--Reserved dump area -->|
119+
V V | Permanent Reservation V
120+
+-----------+----------/ /----------+---+----+-----------+----+
121+
| | |CPU|HPTE| DUMP |ELF |
122+
+-----------+----------/ /----------+---+----+-----------+----+
123+
| ^
124+
| |
125+
\ /
126+
-------------------------------------------
127+
Boot memory content gets transferred to
128+
reserved area by firmware at the time of
129+
crash
130+
Fig. 1
131+
132+
o Memory Reservation during second kernel after crash
133+
134+
Low memory Top of memory
135+
0 boot memory size |
136+
| |<------------- Reserved dump area ----------- -->|
137+
V V V
138+
+-----------+----------/ /----------+---+----+-----------+----+
139+
| | |CPU|HPTE| DUMP |ELF |
140+
+-----------+----------/ /----------+---+----+-----------+----+
141+
| |
142+
V V
143+
Used by second /proc/vmcore
144+
kernel to boot
145+
Fig. 2
146+
147+
Currently the dump will be copied from /proc/vmcore to a
148+
a new file upon user intervention. The dump data available through
149+
/proc/vmcore will be in ELF format. Hence the existing kdump
150+
infrastructure (kdump scripts) to save the dump works fine with
151+
minor modifications.
152+
153+
The tools to examine the dump will be same as the ones
154+
used for kdump.
155+
156+
How to enable firmware-assisted dump (fadump):
157+
-------------------------------------
158+
159+
1. Set config option CONFIG_FA_DUMP=y and build kernel.
160+
2. Boot into linux kernel with 'fadump=on' kernel cmdline option.
161+
3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline
162+
to specify size of the memory to reserve for boot memory dump
163+
preservation.
164+
165+
NOTE: If firmware-assisted dump fails to reserve memory then it will
166+
fallback to existing kdump mechanism if 'crashkernel=' option
167+
is set at kernel cmdline.
168+
169+
Sysfs/debugfs files:
170+
------------
171+
172+
Firmware-assisted dump feature uses sysfs file system to hold
173+
the control files and debugfs file to display memory reserved region.
174+
175+
Here is the list of files under kernel sysfs:
176+
177+
/sys/kernel/fadump_enabled
178+
179+
This is used to display the fadump status.
180+
0 = fadump is disabled
181+
1 = fadump is enabled
182+
183+
This interface can be used by kdump init scripts to identify if
184+
fadump is enabled in the kernel and act accordingly.
185+
186+
/sys/kernel/fadump_registered
187+
188+
This is used to display the fadump registration status as well
189+
as to control (start/stop) the fadump registration.
190+
0 = fadump is not registered.
191+
1 = fadump is registered and ready to handle system crash.
192+
193+
To register fadump echo 1 > /sys/kernel/fadump_registered and
194+
echo 0 > /sys/kernel/fadump_registered for un-register and stop the
195+
fadump. Once the fadump is un-registered, the system crash will not
196+
be handled and vmcore will not be captured. This interface can be
197+
easily integrated with kdump service start/stop.
198+
199+
/sys/kernel/fadump_release_mem
200+
201+
This file is available only when fadump is active during
202+
second kernel. This is used to release the reserved memory
203+
region that are held for saving crash dump. To release the
204+
reserved memory echo 1 to it:
205+
206+
echo 1 > /sys/kernel/fadump_release_mem
207+
208+
After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region
209+
file will change to reflect the new memory reservations.
210+
211+
The existing userspace tools (kdump infrastructure) can be easily
212+
enhanced to use this interface to release the memory reserved for
213+
dump and continue without 2nd reboot.
214+
215+
Here is the list of files under powerpc debugfs:
216+
(Assuming debugfs is mounted on /sys/kernel/debug directory.)
217+
218+
/sys/kernel/debug/powerpc/fadump_region
219+
220+
This file shows the reserved memory regions if fadump is
221+
enabled otherwise this file is empty. The output format
222+
is:
223+
<region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size>
224+
225+
e.g.
226+
Contents when fadump is registered during first kernel
227+
228+
# cat /sys/kernel/debug/powerpc/fadump_region
229+
CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0
230+
HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0
231+
DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0
232+
233+
Contents when fadump is active during second kernel
234+
235+
# cat /sys/kernel/debug/powerpc/fadump_region
236+
CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020
237+
HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000
238+
DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000
239+
: [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000
240+
241+
NOTE: Please refer to Documentation/filesystems/debugfs.txt on
242+
how to mount the debugfs filesystem.
243+
244+
245+
TODO:
246+
-----
247+
o Need to come up with the better approach to find out more
248+
accurate boot memory size that is required for a kernel to
249+
boot successfully when booted with restricted memory.
250+
o The fadump implementation introduces a fadump crash info structure
251+
in the scratch area before the ELF core header. The idea of introducing
252+
this structure is to pass some important crash info data to the second
253+
kernel which will help second kernel to populate ELF core header with
254+
correct data before it gets exported through /proc/vmcore. The current
255+
design implementation does not address a possibility of introducing
256+
additional fields (in future) to this structure without affecting
257+
compatibility. Need to come up with the better approach to address this.
258+
The possible approaches are:
259+
1. Introduce version field for version tracking, bump up the version
260+
whenever a new field is added to the structure in future. The version
261+
field can be used to find out what fields are valid for the current
262+
version of the structure.
263+
2. Reserve the area of predefined size (say PAGE_SIZE) for this
264+
structure and have unused area as reserved (initialized to zero)
265+
for future field additions.
266+
The advantage of approach 1 over 2 is we don't need to reserve extra space.
267+
---
268+
Author: Mahesh Salgaonkar <[email protected]>
269+
This document is based on the original documentation written for phyp
270+
assisted dump by Linas Vepstas and Manish Ahuja.

0 commit comments

Comments
 (0)