|
| 1 | + |
| 2 | + Firmware-Assisted Dump |
| 3 | + ------------------------ |
| 4 | + July 2011 |
| 5 | + |
| 6 | +The goal of firmware-assisted dump is to enable the dump of |
| 7 | +a crashed system, and to do so from a fully-reset system, and |
| 8 | +to minimize the total elapsed time until the system is back |
| 9 | +in production use. |
| 10 | + |
| 11 | +- Firmware assisted dump (fadump) infrastructure is intended to replace |
| 12 | + the existing phyp assisted dump. |
| 13 | +- Fadump uses the same firmware interfaces and memory reservation model |
| 14 | + as phyp assisted dump. |
| 15 | +- Unlike phyp dump, fadump exports the memory dump through /proc/vmcore |
| 16 | + in the ELF format in the same way as kdump. This helps us reuse the |
| 17 | + kdump infrastructure for dump capture and filtering. |
| 18 | +- Unlike phyp dump, userspace tool does not need to refer any sysfs |
| 19 | + interface while reading /proc/vmcore. |
| 20 | +- Unlike phyp dump, fadump allows user to release all the memory reserved |
| 21 | + for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem. |
| 22 | +- Once enabled through kernel boot parameter, fadump can be |
| 23 | + started/stopped through /sys/kernel/fadump_registered interface (see |
| 24 | + sysfs files section below) and can be easily integrated with kdump |
| 25 | + service start/stop init scripts. |
| 26 | + |
| 27 | +Comparing with kdump or other strategies, firmware-assisted |
| 28 | +dump offers several strong, practical advantages: |
| 29 | + |
| 30 | +-- Unlike kdump, the system has been reset, and loaded |
| 31 | + with a fresh copy of the kernel. In particular, |
| 32 | + PCI and I/O devices have been reinitialized and are |
| 33 | + in a clean, consistent state. |
| 34 | +-- Once the dump is copied out, the memory that held the dump |
| 35 | + is immediately available to the running kernel. And therefore, |
| 36 | + unlike kdump, fadump doesn't need a 2nd reboot to get back |
| 37 | + the system to the production configuration. |
| 38 | + |
| 39 | +The above can only be accomplished by coordination with, |
| 40 | +and assistance from the Power firmware. The procedure is |
| 41 | +as follows: |
| 42 | + |
| 43 | +-- The first kernel registers the sections of memory with the |
| 44 | + Power firmware for dump preservation during OS initialization. |
| 45 | + These registered sections of memory are reserved by the first |
| 46 | + kernel during early boot. |
| 47 | + |
| 48 | +-- When a system crashes, the Power firmware will save |
| 49 | + the low memory (boot memory of size larger of 5% of system RAM |
| 50 | + or 256MB) of RAM to the previous registered region. It will |
| 51 | + also save system registers, and hardware PTE's. |
| 52 | + |
| 53 | + NOTE: The term 'boot memory' means size of the low memory chunk |
| 54 | + that is required for a kernel to boot successfully when |
| 55 | + booted with restricted memory. By default, the boot memory |
| 56 | + size will be the larger of 5% of system RAM or 256MB. |
| 57 | + Alternatively, user can also specify boot memory size |
| 58 | + through boot parameter 'fadump_reserve_mem=' which will |
| 59 | + override the default calculated size. Use this option |
| 60 | + if default boot memory size is not sufficient for second |
| 61 | + kernel to boot successfully. |
| 62 | + |
| 63 | +-- After the low memory (boot memory) area has been saved, the |
| 64 | + firmware will reset PCI and other hardware state. It will |
| 65 | + *not* clear the RAM. It will then launch the bootloader, as |
| 66 | + normal. |
| 67 | + |
| 68 | +-- The freshly booted kernel will notice that there is a new |
| 69 | + node (ibm,dump-kernel) in the device tree, indicating that |
| 70 | + there is crash data available from a previous boot. During |
| 71 | + the early boot OS will reserve rest of the memory above |
| 72 | + boot memory size effectively booting with restricted memory |
| 73 | + size. This will make sure that the second kernel will not |
| 74 | + touch any of the dump memory area. |
| 75 | + |
| 76 | +-- User-space tools will read /proc/vmcore to obtain the contents |
| 77 | + of memory, which holds the previous crashed kernel dump in ELF |
| 78 | + format. The userspace tools may copy this info to disk, or |
| 79 | + network, nas, san, iscsi, etc. as desired. |
| 80 | + |
| 81 | +-- Once the userspace tool is done saving dump, it will echo |
| 82 | + '1' to /sys/kernel/fadump_release_mem to release the reserved |
| 83 | + memory back to general use, except the memory required for |
| 84 | + next firmware-assisted dump registration. |
| 85 | + |
| 86 | + e.g. |
| 87 | + # echo 1 > /sys/kernel/fadump_release_mem |
| 88 | + |
| 89 | +Please note that the firmware-assisted dump feature |
| 90 | +is only available on Power6 and above systems with recent |
| 91 | +firmware versions. |
| 92 | + |
| 93 | +Implementation details: |
| 94 | +---------------------- |
| 95 | + |
| 96 | +During boot, a check is made to see if firmware supports |
| 97 | +this feature on that particular machine. If it does, then |
| 98 | +we check to see if an active dump is waiting for us. If yes |
| 99 | +then everything but boot memory size of RAM is reserved during |
| 100 | +early boot (See Fig. 2). This area is released once we finish |
| 101 | +collecting the dump from user land scripts (e.g. kdump scripts) |
| 102 | +that are run. If there is dump data, then the |
| 103 | +/sys/kernel/fadump_release_mem file is created, and the reserved |
| 104 | +memory is held. |
| 105 | + |
| 106 | +If there is no waiting dump data, then only the memory required |
| 107 | +to hold CPU state, HPTE region, boot memory dump and elfcore |
| 108 | +header, is reserved at the top of memory (see Fig. 1). This area |
| 109 | +is *not* released: this region will be kept permanently reserved, |
| 110 | +so that it can act as a receptacle for a copy of the boot memory |
| 111 | +content in addition to CPU state and HPTE region, in the case a |
| 112 | +crash does occur. |
| 113 | + |
| 114 | + o Memory Reservation during first kernel |
| 115 | + |
| 116 | + Low memory Top of memory |
| 117 | + 0 boot memory size | |
| 118 | + | | |<--Reserved dump area -->| |
| 119 | + V V | Permanent Reservation V |
| 120 | + +-----------+----------/ /----------+---+----+-----------+----+ |
| 121 | + | | |CPU|HPTE| DUMP |ELF | |
| 122 | + +-----------+----------/ /----------+---+----+-----------+----+ |
| 123 | + | ^ |
| 124 | + | | |
| 125 | + \ / |
| 126 | + ------------------------------------------- |
| 127 | + Boot memory content gets transferred to |
| 128 | + reserved area by firmware at the time of |
| 129 | + crash |
| 130 | + Fig. 1 |
| 131 | + |
| 132 | + o Memory Reservation during second kernel after crash |
| 133 | + |
| 134 | + Low memory Top of memory |
| 135 | + 0 boot memory size | |
| 136 | + | |<------------- Reserved dump area ----------- -->| |
| 137 | + V V V |
| 138 | + +-----------+----------/ /----------+---+----+-----------+----+ |
| 139 | + | | |CPU|HPTE| DUMP |ELF | |
| 140 | + +-----------+----------/ /----------+---+----+-----------+----+ |
| 141 | + | | |
| 142 | + V V |
| 143 | + Used by second /proc/vmcore |
| 144 | + kernel to boot |
| 145 | + Fig. 2 |
| 146 | + |
| 147 | +Currently the dump will be copied from /proc/vmcore to a |
| 148 | +a new file upon user intervention. The dump data available through |
| 149 | +/proc/vmcore will be in ELF format. Hence the existing kdump |
| 150 | +infrastructure (kdump scripts) to save the dump works fine with |
| 151 | +minor modifications. |
| 152 | + |
| 153 | +The tools to examine the dump will be same as the ones |
| 154 | +used for kdump. |
| 155 | + |
| 156 | +How to enable firmware-assisted dump (fadump): |
| 157 | +------------------------------------- |
| 158 | + |
| 159 | +1. Set config option CONFIG_FA_DUMP=y and build kernel. |
| 160 | +2. Boot into linux kernel with 'fadump=on' kernel cmdline option. |
| 161 | +3. Optionally, user can also set 'fadump_reserve_mem=' kernel cmdline |
| 162 | + to specify size of the memory to reserve for boot memory dump |
| 163 | + preservation. |
| 164 | + |
| 165 | +NOTE: If firmware-assisted dump fails to reserve memory then it will |
| 166 | + fallback to existing kdump mechanism if 'crashkernel=' option |
| 167 | + is set at kernel cmdline. |
| 168 | + |
| 169 | +Sysfs/debugfs files: |
| 170 | +------------ |
| 171 | + |
| 172 | +Firmware-assisted dump feature uses sysfs file system to hold |
| 173 | +the control files and debugfs file to display memory reserved region. |
| 174 | + |
| 175 | +Here is the list of files under kernel sysfs: |
| 176 | + |
| 177 | + /sys/kernel/fadump_enabled |
| 178 | + |
| 179 | + This is used to display the fadump status. |
| 180 | + 0 = fadump is disabled |
| 181 | + 1 = fadump is enabled |
| 182 | + |
| 183 | + This interface can be used by kdump init scripts to identify if |
| 184 | + fadump is enabled in the kernel and act accordingly. |
| 185 | + |
| 186 | + /sys/kernel/fadump_registered |
| 187 | + |
| 188 | + This is used to display the fadump registration status as well |
| 189 | + as to control (start/stop) the fadump registration. |
| 190 | + 0 = fadump is not registered. |
| 191 | + 1 = fadump is registered and ready to handle system crash. |
| 192 | + |
| 193 | + To register fadump echo 1 > /sys/kernel/fadump_registered and |
| 194 | + echo 0 > /sys/kernel/fadump_registered for un-register and stop the |
| 195 | + fadump. Once the fadump is un-registered, the system crash will not |
| 196 | + be handled and vmcore will not be captured. This interface can be |
| 197 | + easily integrated with kdump service start/stop. |
| 198 | + |
| 199 | + /sys/kernel/fadump_release_mem |
| 200 | + |
| 201 | + This file is available only when fadump is active during |
| 202 | + second kernel. This is used to release the reserved memory |
| 203 | + region that are held for saving crash dump. To release the |
| 204 | + reserved memory echo 1 to it: |
| 205 | + |
| 206 | + echo 1 > /sys/kernel/fadump_release_mem |
| 207 | + |
| 208 | + After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region |
| 209 | + file will change to reflect the new memory reservations. |
| 210 | + |
| 211 | + The existing userspace tools (kdump infrastructure) can be easily |
| 212 | + enhanced to use this interface to release the memory reserved for |
| 213 | + dump and continue without 2nd reboot. |
| 214 | + |
| 215 | +Here is the list of files under powerpc debugfs: |
| 216 | +(Assuming debugfs is mounted on /sys/kernel/debug directory.) |
| 217 | + |
| 218 | + /sys/kernel/debug/powerpc/fadump_region |
| 219 | + |
| 220 | + This file shows the reserved memory regions if fadump is |
| 221 | + enabled otherwise this file is empty. The output format |
| 222 | + is: |
| 223 | + <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size> |
| 224 | + |
| 225 | + e.g. |
| 226 | + Contents when fadump is registered during first kernel |
| 227 | + |
| 228 | + # cat /sys/kernel/debug/powerpc/fadump_region |
| 229 | + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 |
| 230 | + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 |
| 231 | + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 |
| 232 | + |
| 233 | + Contents when fadump is active during second kernel |
| 234 | + |
| 235 | + # cat /sys/kernel/debug/powerpc/fadump_region |
| 236 | + CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 |
| 237 | + HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 |
| 238 | + DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 |
| 239 | + : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 |
| 240 | + |
| 241 | +NOTE: Please refer to Documentation/filesystems/debugfs.txt on |
| 242 | + how to mount the debugfs filesystem. |
| 243 | + |
| 244 | + |
| 245 | +TODO: |
| 246 | +----- |
| 247 | + o Need to come up with the better approach to find out more |
| 248 | + accurate boot memory size that is required for a kernel to |
| 249 | + boot successfully when booted with restricted memory. |
| 250 | + o The fadump implementation introduces a fadump crash info structure |
| 251 | + in the scratch area before the ELF core header. The idea of introducing |
| 252 | + this structure is to pass some important crash info data to the second |
| 253 | + kernel which will help second kernel to populate ELF core header with |
| 254 | + correct data before it gets exported through /proc/vmcore. The current |
| 255 | + design implementation does not address a possibility of introducing |
| 256 | + additional fields (in future) to this structure without affecting |
| 257 | + compatibility. Need to come up with the better approach to address this. |
| 258 | + The possible approaches are: |
| 259 | + 1. Introduce version field for version tracking, bump up the version |
| 260 | + whenever a new field is added to the structure in future. The version |
| 261 | + field can be used to find out what fields are valid for the current |
| 262 | + version of the structure. |
| 263 | + 2. Reserve the area of predefined size (say PAGE_SIZE) for this |
| 264 | + structure and have unused area as reserved (initialized to zero) |
| 265 | + for future field additions. |
| 266 | + The advantage of approach 1 over 2 is we don't need to reserve extra space. |
| 267 | +--- |
| 268 | +Author: Mahesh Salgaonkar < [email protected]> |
| 269 | +This document is based on the original documentation written for phyp |
| 270 | +assisted dump by Linas Vepstas and Manish Ahuja. |
0 commit comments