@@ -793,7 +793,7 @@ Some core changes of the new internal format:
793
793
bpf_exit
794
794
795
795
After the call the registers R1-R5 contain junk values and cannot be read.
796
- In the future an eBPF verifier can be used to validate internal BPF programs.
796
+ An in-kernel eBPF verifier is used to validate internal BPF programs.
797
797
798
798
Also in the new design, eBPF is limited to 4096 insns, which means that any
799
799
program will terminate quickly and will only call a fixed number of kernel
@@ -1017,7 +1017,7 @@ At the start of the program the register R1 contains a pointer to context
1017
1017
and has type PTR_TO_CTX.
1018
1018
If verifier sees an insn that does R2=R1, then R2 has now type
1019
1019
PTR_TO_CTX as well and can be used on the right hand side of expression.
1020
- If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=UNKNOWN_VALUE ,
1020
+ If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE ,
1021
1021
since addition of two valid pointers makes invalid pointer.
1022
1022
(In 'secure' mode verifier will reject any type of pointer arithmetic to make
1023
1023
sure that kernel addresses don't leak to unprivileged users)
@@ -1039,7 +1039,7 @@ is a correct program. If there was R1 instead of R6, it would have
1039
1039
been rejected.
1040
1040
1041
1041
load/store instructions are allowed only with registers of valid types, which
1042
- are PTR_TO_CTX, PTR_TO_MAP, FRAME_PTR . They are bounds and alignment checked.
1042
+ are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK . They are bounds and alignment checked.
1043
1043
For example:
1044
1044
bpf_mov R1 = 1
1045
1045
bpf_mov R2 = 2
@@ -1058,7 +1058,7 @@ intends to load a word from address R6 + 8 and store it into R0
1058
1058
If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
1059
1059
that offset 8 of size 4 bytes can be accessed for reading, otherwise
1060
1060
the verifier will reject the program.
1061
- If R6=FRAME_PTR , then access should be aligned and be within
1061
+ If R6=PTR_TO_STACK , then access should be aligned and be within
1062
1062
stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
1063
1063
so it will fail verification, since it's out of bounds.
1064
1064
@@ -1069,7 +1069,7 @@ For example:
1069
1069
bpf_ld R0 = *(u32 *)(R10 - 4)
1070
1070
bpf_exit
1071
1071
is invalid program.
1072
- Though R10 is correct read-only register and has type FRAME_PTR
1072
+ Though R10 is correct read-only register and has type PTR_TO_STACK
1073
1073
and R10 - 4 is within stack bounds, there were no stores into that location.
1074
1074
1075
1075
Pointer register spill/fill is tracked as well, since four (R6-R9)
@@ -1094,6 +1094,71 @@ all use cases.
1094
1094
1095
1095
See details of eBPF verifier in kernel/bpf/verifier.c
1096
1096
1097
+ Register value tracking
1098
+ -----------------------
1099
+ In order to determine the safety of an eBPF program, the verifier must track
1100
+ the range of possible values in each register and also in each stack slot.
1101
+ This is done with 'struct bpf_reg_state', defined in include/linux/
1102
+ bpf_verifier.h, which unifies tracking of scalar and pointer values. Each
1103
+ register state has a type, which is either NOT_INIT (the register has not been
1104
+ written to), SCALAR_VALUE (some value which is not usable as a pointer), or a
1105
+ pointer type. The types of pointers describe their base, as follows:
1106
+ PTR_TO_CTX Pointer to bpf_context.
1107
+ CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic
1108
+ on these pointers is forbidden.
1109
+ PTR_TO_MAP_VALUE Pointer to the value stored in a map element.
1110
+ PTR_TO_MAP_VALUE_OR_NULL
1111
+ Either a pointer to a map value, or NULL; map accesses
1112
+ (see section 'eBPF maps', below) return this type,
1113
+ which becomes a PTR_TO_MAP_VALUE when checked != NULL.
1114
+ Arithmetic on these pointers is forbidden.
1115
+ PTR_TO_STACK Frame pointer.
1116
+ PTR_TO_PACKET skb->data.
1117
+ PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden.
1118
+ However, a pointer may be offset from this base (as a result of pointer
1119
+ arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
1120
+ offset'. The former is used when an exactly-known value (e.g. an immediate
1121
+ operand) is added to a pointer, while the latter is used for values which are
1122
+ not exactly known. The variable offset is also used in SCALAR_VALUEs, to track
1123
+ the range of possible values in the register.
1124
+ The verifier's knowledge about the variable offset consists of:
1125
+ * minimum and maximum values as unsigned
1126
+ * minimum and maximum values as signed
1127
+ * knowledge of the values of individual bits, in the form of a 'tnum': a u64
1128
+ 'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown;
1129
+ 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both
1130
+ mask and value; no bit should ever be 1 in both. For example, if a byte is read
1131
+ into a register from memory, the register's top 56 bits are known zero, while
1132
+ the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we
1133
+ then OR this with 0x40, we get (0x40; 0xcf), then if we add 1 we get (0x0;
1134
+ 0x1ff), because of potential carries.
1135
+ Besides arithmetic, the register state can also be updated by conditional
1136
+ branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch
1137
+ it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false'
1138
+ branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or
1139
+ BPF_JSGE) would instead update the signed minimum/maximum values. Information
1140
+ from the signed and unsigned bounds can be combined; for instance if a value is
1141
+ first tested < 8 and then tested s> 4, the verifier will conclude that the value
1142
+ is also > 4 and s< 8, since the bounds prevent crossing the sign boundary.
1143
+ PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all
1144
+ pointers sharing that same variable offset. This is important for packet range
1145
+ checks: after adding some variable to a packet pointer, if you then copy it to
1146
+ another register and (say) add a constant 4, both registers will share the same
1147
+ 'id' but one will have a fixed offset of +4. Then if it is bounds-checked and
1148
+ found to be less than a PTR_TO_PACKET_END, the other register is now known to
1149
+ have a safe range of at least 4 bytes. See 'Direct packet access', below, for
1150
+ more on PTR_TO_PACKET ranges.
1151
+ The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of
1152
+ the pointer returned from a map lookup. This means that when one copy is
1153
+ checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.
1154
+ As well as range-checking, the tracked information is also used for enforcing
1155
+ alignment of pointer accesses. For instance, on most systems the packet pointer
1156
+ is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump
1157
+ over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting
1158
+ pointer will have a variable offset known to be 4n+2 for some n, so adding the 2
1159
+ bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through
1160
+ that pointer are safe.
1161
+
1097
1162
Direct packet access
1098
1163
--------------------
1099
1164
In cls_bpf and act_bpf programs the verifier allows direct access to the packet
@@ -1121,7 +1186,7 @@ it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14)
1121
1186
which is zero bytes.
1122
1187
1123
1188
More complex packet access may look like:
1124
- R0=imm1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
1189
+ R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
1125
1190
6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */
1126
1191
7: r4 = *(u8 *)(r3 +12)
1127
1192
8: r4 *= 14
@@ -1135,26 +1200,31 @@ More complex packet access may look like:
1135
1200
16: r2 += 8
1136
1201
17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */
1137
1202
18: if r2 > r1 goto pc+2
1138
- R0=inv56 R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv52 R5=pkt(id=0,off=14,r=14) R10=fp
1203
+ R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp
1139
1204
19: r1 = *(u8 *)(r3 +4)
1140
1205
The state of the register R3 is R3=pkt(id=2,off=0,r=8)
1141
1206
id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some
1142
1207
offset within a packet and since the program author did
1143
1208
'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8).
1144
- The verifier only allows 'add' operation on packet registers. Any other
1145
- operation will set the register state to 'unknown_value ' and it won't be
1209
+ The verifier only allows 'add'/'sub' operations on packet registers. Any other
1210
+ operation will set the register state to 'SCALAR_VALUE ' and it won't be
1146
1211
available for direct packet access.
1147
1212
Operation 'r3 += rX' may overflow and become less than original skb->data,
1148
- therefore the verifier has to prevent that. So it tracks the number of
1149
- upper zero bits in all 'uknown_value' registers, so when it sees
1150
- 'r3 += rX' instruction and rX is more than 16-bit value, it will error as:
1151
- "cannot add integer value with N upper zero bits to ptr_to_packet"
1213
+ therefore the verifier has to prevent that. So when it sees 'r3 += rX'
1214
+ instruction and rX is more than 16-bit value, any subsequent bounds-check of r3
1215
+ against skb->data_end will not give us 'range' information, so attempts to read
1216
+ through the pointer will give "invalid access to packet" error.
1152
1217
Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is
1153
- R4=inv56 which means that upper 56 bits on the register are guaranteed
1154
- to be zero. After insn 'r4 *= 14' the state becomes R4=inv52, since
1155
- multiplying 8-bit value by constant 14 will keep upper 52 bits as zero.
1156
- Similarly 'r2 >>= 48' will make R2=inv48, since the shift is not sign
1157
- extending. This logic is implemented in evaluate_reg_alu() function.
1218
+ R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits
1219
+ of the register are guaranteed to be zero, and nothing is known about the lower
1220
+ 8 bits. After insn 'r4 *= 14' the state becomes
1221
+ R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit
1222
+ value by constant 14 will keep upper 52 bits as zero, also the least significant
1223
+ bit will be zero as 14 is even. Similarly 'r2 >>= 48' will make
1224
+ R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign
1225
+ extending. This logic is implemented in adjust_reg_min_max_vals() function,
1226
+ which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice
1227
+ versa) and adjust_scalar_min_max_vals() for operations on two scalars.
1158
1228
1159
1229
The end result is that bpf program author can access packet directly
1160
1230
using normal C code as:
@@ -1214,6 +1284,22 @@ The map is defined by:
1214
1284
. key size in bytes
1215
1285
. value size in bytes
1216
1286
1287
+ Pruning
1288
+ -------
1289
+ The verifier does not actually walk all possible paths through the program. For
1290
+ each new branch to analyse, the verifier looks at all the states it's previously
1291
+ been in when at this instruction. If any of them contain the current state as a
1292
+ subset, the branch is 'pruned' - that is, the fact that the previous state was
1293
+ accepted implies the current state would be as well. For instance, if in the
1294
+ previous state, r1 held a packet-pointer, and in the current state, r1 holds a
1295
+ packet-pointer with a range as long or longer and at least as strict an
1296
+ alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't
1297
+ have been used by any path from that point, so any value in r2 (including
1298
+ another NOT_INIT) is safe. The implementation is in the function regsafe().
1299
+ Pruning considers not only the registers but also the stack (and any spilled
1300
+ registers it may hold). They must all be safe for the branch to be pruned.
1301
+ This is implemented in states_equal().
1302
+
1217
1303
Understanding eBPF verifier messages
1218
1304
------------------------------------
1219
1305
0 commit comments