|
| 1 | +============== |
| 2 | +BPF Design Q&A |
| 3 | +============== |
| 4 | + |
| 5 | +BPF extensibility and applicability to networking, tracing, security |
| 6 | +in the linux kernel and several user space implementations of BPF |
| 7 | +virtual machine led to a number of misunderstanding on what BPF actually is. |
| 8 | +This short QA is an attempt to address that and outline a direction |
| 9 | +of where BPF is heading long term. |
| 10 | + |
| 11 | +.. contents:: |
| 12 | + :local: |
| 13 | + :depth: 3 |
| 14 | + |
| 15 | +Questions and Answers |
| 16 | +===================== |
| 17 | + |
| 18 | +Q: Is BPF a generic instruction set similar to x64 and arm64? |
| 19 | +------------------------------------------------------------- |
| 20 | +A: NO. |
| 21 | + |
| 22 | +Q: Is BPF a generic virtual machine ? |
| 23 | +------------------------------------- |
| 24 | +A: NO. |
| 25 | + |
| 26 | +BPF is generic instruction set *with* C calling convention. |
| 27 | +----------------------------------------------------------- |
| 28 | + |
| 29 | +Q: Why C calling convention was chosen? |
| 30 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 31 | + |
| 32 | +A: Because BPF programs are designed to run in the linux kernel |
| 33 | +which is written in C, hence BPF defines instruction set compatible |
| 34 | +with two most used architectures x64 and arm64 (and takes into |
| 35 | +consideration important quirks of other architectures) and |
| 36 | +defines calling convention that is compatible with C calling |
| 37 | +convention of the linux kernel on those architectures. |
| 38 | + |
| 39 | +Q: can multiple return values be supported in the future? |
| 40 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 41 | +A: NO. BPF allows only register R0 to be used as return value. |
| 42 | + |
| 43 | +Q: can more than 5 function arguments be supported in the future? |
| 44 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 45 | +A: NO. BPF calling convention only allows registers R1-R5 to be used |
| 46 | +as arguments. BPF is not a standalone instruction set. |
| 47 | +(unlike x64 ISA that allows msft, cdecl and other conventions) |
| 48 | + |
| 49 | +Q: can BPF programs access instruction pointer or return address? |
| 50 | +----------------------------------------------------------------- |
| 51 | +A: NO. |
| 52 | + |
| 53 | +Q: can BPF programs access stack pointer ? |
| 54 | +------------------------------------------ |
| 55 | +A: NO. |
| 56 | + |
| 57 | +Only frame pointer (register R10) is accessible. |
| 58 | +From compiler point of view it's necessary to have stack pointer. |
| 59 | +For example LLVM defines register R11 as stack pointer in its |
| 60 | +BPF backend, but it makes sure that generated code never uses it. |
| 61 | + |
| 62 | +Q: Does C-calling convention diminishes possible use cases? |
| 63 | +----------------------------------------------------------- |
| 64 | +A: YES. |
| 65 | + |
| 66 | +BPF design forces addition of major functionality in the form |
| 67 | +of kernel helper functions and kernel objects like BPF maps with |
| 68 | +seamless interoperability between them. It lets kernel call into |
| 69 | +BPF programs and programs call kernel helpers with zero overhead. |
| 70 | +As all of them were native C code. That is particularly the case |
| 71 | +for JITed BPF programs that are indistinguishable from |
| 72 | +native kernel C code. |
| 73 | + |
| 74 | +Q: Does it mean that 'innovative' extensions to BPF code are disallowed? |
| 75 | +------------------------------------------------------------------------ |
| 76 | +A: Soft yes. |
| 77 | + |
| 78 | +At least for now until BPF core has support for |
| 79 | +bpf-to-bpf calls, indirect calls, loops, global variables, |
| 80 | +jump tables, read only sections and all other normal constructs |
| 81 | +that C code can produce. |
| 82 | + |
| 83 | +Q: Can loops be supported in a safe way? |
| 84 | +---------------------------------------- |
| 85 | +A: It's not clear yet. |
| 86 | + |
| 87 | +BPF developers are trying to find a way to |
| 88 | +support bounded loops where the verifier can guarantee that |
| 89 | +the program terminates in less than 4096 instructions. |
| 90 | + |
| 91 | +Instruction level questions |
| 92 | +--------------------------- |
| 93 | + |
| 94 | +Q: LD_ABS and LD_IND instructions vs C code |
| 95 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 96 | + |
| 97 | +Q: How come LD_ABS and LD_IND instruction are present in BPF whereas |
| 98 | +C code cannot express them and has to use builtin intrinsics? |
| 99 | + |
| 100 | +A: This is artifact of compatibility with classic BPF. Modern |
| 101 | +networking code in BPF performs better without them. |
| 102 | +See 'direct packet access'. |
| 103 | + |
| 104 | +Q: BPF instructions mapping not one-to-one to native CPU |
| 105 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 106 | +Q: It seems not all BPF instructions are one-to-one to native CPU. |
| 107 | +For example why BPF_JNE and other compare and jumps are not cpu-like? |
| 108 | + |
| 109 | +A: This was necessary to avoid introducing flags into ISA which are |
| 110 | +impossible to make generic and efficient across CPU architectures. |
| 111 | + |
| 112 | +Q: why BPF_DIV instruction doesn't map to x64 div? |
| 113 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 114 | +A: Because if we picked one-to-one relationship to x64 it would have made |
| 115 | +it more complicated to support on arm64 and other archs. Also it |
| 116 | +needs div-by-zero runtime check. |
| 117 | + |
| 118 | +Q: why there is no BPF_SDIV for signed divide operation? |
| 119 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 120 | +A: Because it would be rarely used. llvm errors in such case and |
| 121 | +prints a suggestion to use unsigned divide instead |
| 122 | + |
| 123 | +Q: Why BPF has implicit prologue and epilogue? |
| 124 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 125 | +A: Because architectures like sparc have register windows and in general |
| 126 | +there are enough subtle differences between architectures, so naive |
| 127 | +store return address into stack won't work. Another reason is BPF has |
| 128 | +to be safe from division by zero (and legacy exception path |
| 129 | +of LD_ABS insn). Those instructions need to invoke epilogue and |
| 130 | +return implicitly. |
| 131 | + |
| 132 | +Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning? |
| 133 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 134 | +A: Because classic BPF didn't have them and BPF authors felt that compiler |
| 135 | +workaround would be acceptable. Turned out that programs lose performance |
| 136 | +due to lack of these compare instructions and they were added. |
| 137 | +These two instructions is a perfect example what kind of new BPF |
| 138 | +instructions are acceptable and can be added in the future. |
| 139 | +These two already had equivalent instructions in native CPUs. |
| 140 | +New instructions that don't have one-to-one mapping to HW instructions |
| 141 | +will not be accepted. |
| 142 | + |
| 143 | +Q: BPF 32-bit subregister requirements |
| 144 | +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 145 | +Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF |
| 146 | +registers which makes BPF inefficient virtual machine for 32-bit |
| 147 | +CPU architectures and 32-bit HW accelerators. Can true 32-bit registers |
| 148 | +be added to BPF in the future? |
| 149 | + |
| 150 | +A: NO. The first thing to improve performance on 32-bit archs is to teach |
| 151 | +LLVM to generate code that uses 32-bit subregisters. Then second step |
| 152 | +is to teach verifier to mark operations where zero-ing upper bits |
| 153 | +is unnecessary. Then JITs can take advantage of those markings and |
| 154 | +drastically reduce size of generated code and improve performance. |
| 155 | + |
| 156 | +Q: Does BPF have a stable ABI? |
| 157 | +------------------------------ |
| 158 | +A: YES. BPF instructions, arguments to BPF programs, set of helper |
| 159 | +functions and their arguments, recognized return codes are all part |
| 160 | +of ABI. However when tracing programs are using bpf_probe_read() helper |
| 161 | +to walk kernel internal datastructures and compile with kernel |
| 162 | +internal headers these accesses can and will break with newer |
| 163 | +kernels. The union bpf_attr -> kern_version is checked at load time |
| 164 | +to prevent accidentally loading kprobe-based bpf programs written |
| 165 | +for a different kernel. Networking programs don't do kern_version check. |
| 166 | + |
| 167 | +Q: How much stack space a BPF program uses? |
| 168 | +------------------------------------------- |
| 169 | +A: Currently all program types are limited to 512 bytes of stack |
| 170 | +space, but the verifier computes the actual amount of stack used |
| 171 | +and both interpreter and most JITed code consume necessary amount. |
| 172 | + |
| 173 | +Q: Can BPF be offloaded to HW? |
| 174 | +------------------------------ |
| 175 | +A: YES. BPF HW offload is supported by NFP driver. |
| 176 | + |
| 177 | +Q: Does classic BPF interpreter still exist? |
| 178 | +-------------------------------------------- |
| 179 | +A: NO. Classic BPF programs are converted into extend BPF instructions. |
| 180 | + |
| 181 | +Q: Can BPF call arbitrary kernel functions? |
| 182 | +------------------------------------------- |
| 183 | +A: NO. BPF programs can only call a set of helper functions which |
| 184 | +is defined for every program type. |
| 185 | + |
| 186 | +Q: Can BPF overwrite arbitrary kernel memory? |
| 187 | +--------------------------------------------- |
| 188 | +A: NO. |
| 189 | + |
| 190 | +Tracing bpf programs can *read* arbitrary memory with bpf_probe_read() |
| 191 | +and bpf_probe_read_str() helpers. Networking programs cannot read |
| 192 | +arbitrary memory, since they don't have access to these helpers. |
| 193 | +Programs can never read or write arbitrary memory directly. |
| 194 | + |
| 195 | +Q: Can BPF overwrite arbitrary user memory? |
| 196 | +------------------------------------------- |
| 197 | +A: Sort-of. |
| 198 | + |
| 199 | +Tracing BPF programs can overwrite the user memory |
| 200 | +of the current task with bpf_probe_write_user(). Every time such |
| 201 | +program is loaded the kernel will print warning message, so |
| 202 | +this helper is only useful for experiments and prototypes. |
| 203 | +Tracing BPF programs are root only. |
| 204 | + |
| 205 | +Q: bpf_trace_printk() helper warning |
| 206 | +------------------------------------ |
| 207 | +Q: When bpf_trace_printk() helper is used the kernel prints nasty |
| 208 | +warning message. Why is that? |
| 209 | + |
| 210 | +A: This is done to nudge program authors into better interfaces when |
| 211 | +programs need to pass data to user space. Like bpf_perf_event_output() |
| 212 | +can be used to efficiently stream data via perf ring buffer. |
| 213 | +BPF maps can be used for asynchronous data sharing between kernel |
| 214 | +and user space. bpf_trace_printk() should only be used for debugging. |
| 215 | + |
| 216 | +Q: New functionality via kernel modules? |
| 217 | +---------------------------------------- |
| 218 | +Q: Can BPF functionality such as new program or map types, new |
| 219 | +helpers, etc be added out of kernel module code? |
| 220 | + |
| 221 | +A: NO. |
0 commit comments