DTrace based on BPF Implementation Plan

Introduction

This document provides details on the implementation of DTrace on top of existing Linux kernel tracing facilities. It is meant as a guideline for project planning, and provides (to the extent possible) the order of implementation of various components, features, and capabilities of DTrace on BPF.

The document is organized by high level features, providing more detailed implementation information for each of these features. Generally, the implementation will follow this break-up of functionality, but it is certainly anticipated that some deviations will occur. One concrete example would be the string datatype. While the implementation of values of this datatype can be isolated to specific parts of the DTrace on Linux source code, support for this datatype is present in many different areas. Therefore, the implementation of the string datatype can either be done by providing all functionality as a single deliverable, or it can be done by implementing parts of the string datatype support for features that are planned as a specific deliverable, thereby providing a partial string datatype implementation. This document will be updated as these practical decisions about deliverables are made.

User Visible Features

Built-in variables

There are two approaches possible for implementing built-in variables. We can either generate code directly from the compiler whenever a built-in variable is encountered (which makes it impossible to cache the values unless we make the generated code even more fancy), or we can implement a get_bvar() function to which the compiler will generate calls whenever a built-in variable value is needed. That function could provide for caching of values. The latter approach is currently preferred.

arg0 .. arg9: DONE in release 2.0.0-1.2
Notes: Raw probe argument dctx.argv[0] .. dctx.argv[9]
args
Notes: Mapped (possibly translated) arguments
caller
Notes: Function that called the one the probe fired in
curcpu: DONE in release 2.0.0-1.2
Notes: CPU on which the probe fired. Use BPF helper: bpf_get_smp_processor_id()
curthread: DONE in release 2.0.0-1.2
Notes: Task in which the probe fired. Use BPF helper: bpf_get_current_task()
epid : DONE in release 2.0.0-1.2
Notes: Enabled probe ID. dctx->mst→epid (resolved at program load time)
errno
execname
Notes: Executable name of the current task. Use BPF helper: bpf_get_current_comm()
gid: DONE in release 2.0.0-1.2
Notes: Group ID of the current task. Use BPF helper: bpf_get_current_uid_gid()
id : TARGET in release 2.0.0-1.3
Notes: Probe ID. dctx->mst→prid (resolved at program load time)
ipl
Notes: Interrupt level
pid : DONE in release 2.0.0-1.2
Notes: Process ID of current task. Use BPF helper: bpf_get_current_pid_tgid()
ppid : TARGET in release 2.0.0-1.3
Notes: Parent process ID of current task BPF code to read task->real_parent→pid
probefunc
Notes: Probe description: function name. The plan is to provide a BPF map that maps EPID to probe ID and probe description names
probemod
Notes: Probe description: module name. The plan is to provide a BPF map that maps EPID to probe ID and probe description names
probename
Notes: Probe description: probe name. The plan is to provide a BPF map that maps EPID to probe ID and probe description names
probeprov
Notes: Probe description: provider name. The plan is to provide a BPF map that maps EPID to probe ID and probe description names
stackdepth
tid : DONE in release 2.0.0-1.2
Notes: Use BPF helper: bpf_get_current_pid_tgid()
timestamp : DONE in release 2.0.0-1.2
Notes: Use BPF helper: bpf_ktime_get_ns()
ucaller
Notes: Userspace function calling function where probe fires
uid : DONE in release 2.0.0-1.2
Notes: Use BPF helper: bpf_get_current_uid_gid()
uregs
ustackdepth
vtimestamp
walltimestamp

Clause-local variables

Initial value: DONE in release 2.0.0-1.0
Notes: DTrace on Linux documentation specifies that the initial value of clause-local variables is unspecified. BPF enforces a strict store-before-load policy. In DTrace on BPF clause-local variables are always initialized as 0.
Allocate space
(1) size <= 8 bytes : DONE in release 2.0.0-1.0
Notes: Clause-local variables of 8 bytes or less are allocated on the stack as a 64-bit (8 bytes) value.
(2) size > 8 bytes
Notes: Clause-local variables of more than 8 bytes are allocated as a sequence of bytes in the mem BPF map value, aligned at a 64-bit (8 byte) boundary.
Get value : DONE in release 2.0.0-1.0
(1) by value: DONE in release 2.0.0-1.0
Notes: Only for values of 8 bytes or less.
(2) by reference
(a) size <= 8
Notes: Reference to the stack location of the clause-local variable.
(b) size > 8
Notes: Reference to the memory block in the mem BPF map value for the clause-local variable.
Set value
(1) size <= 8 : DONE in release 2.0.0-1.0
Notes: Direct store to the stack location.
(2) size > bytes
Notes: Sequence of stores to the memory block in the mem BPF map value for the clause-local variable.

Global variables

Initial value: DONE in release 2.0.0-1.0
Notes: Initial value is 0.
Allocate space
Notes: Global variables are allocated as a sequence of bytes in the gvar BPF map value, aligned at a 64-bit (8 byte) boundary.
(1) size <= 8 bytes : DONE in release 2.0.0-1.0
(2) size > 8 bytes
Get value: DONE in release 2.0.0-1.0
(1) by value: DONE in release 2.0.0-1.0
(2) by reference
Notes: Reference to the memory block in the gvar BPF map value for the global variable.
(a) size <= 8
(b) size > 8
Set value
(1) size <= 8 : DONE in release 2.0.0-1.0
(2) size > bytes

Thread-local-storage variables

Key calculation
Notes: This requires a formula that converts the tuple of key values into a unique index into the thread-local storage variable table.
Initial value
Notes: Initial value is 0.
Allocate space
(1) size <= 8 bytes
(2) size > 8 bytes
Get value
(1) by value:
(2) by reference
(a) size <= 8
(b) size > 8
Set value
(a) size <= 8
(b) size > 8

Strings

Associative arrays

Aggregations

Implement aggregations as variables : TARGET release 2.0.0-1.4
Notes: Aggregations were implemented as a special kind of action. In the new design, they are a special kind of variables. This requires compiler and disassembler changes.
Allocate space : TARGET release 2.0.0-1.4
Notes: Aggregations are allocated as one or more 8-byte chunks (each storing a uint64_t) in the agg BPF per-cpu map value.
Consumer processing : TARGET release 2.0.0-1.4
Notes: Processing aggregation data requires retrieving the agg BPF map and aggregating the data across the active CPUs.
Functions
avg() : TARGET release 2.0.0-1.4
count() : TARGET release 2.0.0-1.4
llquantize() : TARGET release 2.0.0-1.4
lquantize() : TARGET release 2.0.0-1.4
max() : TARGET release 2.0.0-1.4
min() : TARGET release 2.0.0-1.4
quantize() : TARGET release 2.0.0-1.4
stddev() : TARGET release 2.0.0-1.4
sum() : TARGET release 2.0.0-1.4
Implement tuple indexed aggregations

Aggregations used to be implemented with their own specific per-cpu output buffer (alongside the regular per-cpu trace data output buffer). Space was allocated in the per-cpu buffer for the different aggregation data items (one or more uint64_t values), and the execution of aggregation actions caused these values to be updated. When the consumer retrieved the aggregation buffer for a specific cpu, a buffer swap would take place to ensure that further probe firing would record aggregation data in a cleared buffer while the consumer processed the data in the now inactive buffer.

In the BPF-based design, aggregations are stored in the singleton map value of a per-CPU BPF map (index 0). The map value will be allocated with enough space to hold all aggregations that are in use. Aggregation actions will update the values allocated to the aggregation they operate on. When the consumer is ready to process aggregation data, it will perform a bpf_map_lookup_elem() to retrieve the data for all CPUs at once.

There is a potential complexity: the legacy implementation for DTrace could ensure that the buffer swap did not happen in the midst of probe processing on that specific CPU. The new design needs to guard against processing incomplete data (e.g. for aggregation functions that use 2 data items, one has been updated while the other has not).

Speculations

Actions

breakpoint()
chill(int)
clear(@)
commit(int)
denormalize(@)
discard(int)
exit(int): DONE in release 2.0.0-1.0
freopen(@, ...): TARGET in release 2.0.0-1.3
Notes: Variant of the printf() action.
ftruncate()
func(uintptr_t)
jstack([uint32_t], [uint32_t])
mod(uintptr_t)
normalize(@, [uint64])
panic()
pcap(void*, int)
Notes: It is unclear how we can do this without kernel support, unless BPF already provides access to this data.
printa(@, ...)
printf(string, ...) DONE in release 2.0.0-1.2
Notes: Numeric values only.
raise(int) DONE in release 2.0.0-1.2
Notes: Send a signal to the running task. Use BPF helper: bpf_send_signal()
setopt(const char *, [const char *])
speculate(int)
Notes: The speculative tracing support is still under design.
stack([uint32_t])
stop()
sym(uintptr_t)
system(@, ...) : Target in release 2.0.0-1.3
Notes: Variant of the printf() action.
trace(@) DONE in release 2.0.0-1.0 and 2.0.0-1.2
Notes: 2.0.0-1.0: Raw numeric values as signed 64-bit integers
Notes: 2.0.0-1.2: Enhancement of DTrace (first implementation) behaviour by providing consistent output based on sign and bit-width of the value.
Notes: Numeric values only.
tracemem(@, size_t, [size_t])
trunc(@, [uint64_t])
uaddr(uintptr_t)
ufunc(uintptr_t)
umod(uintptr_t)
ustack([uint32_t]. [uint32_t]))
usym(uintptr_t)

Subroutines

alloca
basename
bcopy Notes: See BPF function memcpy().
cleanpath
copyin
copyinstr
copyinto
d_path Notes: Added for Linux - we need to investigate if it is still needed.
dirname
getmajor
getminor
htonl
htonll
htons
index Notes: Same as strchr().
inet_ntoa
inet_ntoa6
inet_ntop
lltostr
msgdsize: Specific to Solaris - won't get implemented.
msgsize: Specific to Solaris - won't get implemented.
mutex_owned
mutex_owner
mutex_type_adaptive
mutex_type_spin
ntohl
ntohll
ntohs
progenyof
rand
rindex Notes: Same as strrchr().
rw_iswriter
rw_read_held
rw_write_held
speculation
strchr Notes: Optimization: strchr() on constant string and char.
strjoin Notes: A sequence of 2 memcpy()s ought to be able to do this. Optimization: strjoin() on constant strings.
strlen Notes: See BPF function strnlen(). Optimization; strlen() on constant strings.
strrchr Notes: Optimization; strrchr() on constant string and char.
strstr Notes: Optimization: strstr() on constant strings.
strtok
substr Notes: A memcpy() ought to be able to do this. Optimization: substr() on a constant string.

Output formatting

Converting dtrace_eprobedesc_t into dtrace_datadesc_t

Description needed DONE (2.0.0-1.2)

Flow indent

Description needed DONE (2.0.0-1.2)

Providers

dtrace (BEGIN, END, and ERROR)

BEGIN and END probes

Create probes: DONE in release 2.0.0-1.0
Determine probe info: DONE in release 2.0.0-1.0
Implement trampoline: DONE in release 2.0.0-1.0
Probe cleanup: DONE in release 2.0.0-1.0
BEGIN probe semantics: TARGET in release 2.0.0-1.3 (BEGIN probe must fire before all other probes)
END probe semantics: TARGET in release 2.0.0-1.3 (END probe must fire after all other probes)

ERROR probe

Create probe: TARGET release 2.0.0-1.4
Determine probe info: TARGET Release 2.0.0-1.4
Implement trampoline: TARGET release 2.0.0-1.4

Timer based tracing (profile and tick)

profile probes

Create probe: DONE in release 2.0.0-1.2
Determine probe info: DONE in release 2.0.0-1.2
Implement trampoline: DONE in release 2.0.0-1.2

tick probes

Create probe: DONE in release 2.0.0-1.2
Determine probe info: DONE in release 2.0.0-1.2
Implement trampoline: DONE in release 2.0.0-1.2

System call tracing (syscall)

Discover probe points: DONE in release 2.0.0-0.8
Create probes: DONE in release 2.0.0-0.8
Determine probe info: DONE in release 2.0.0-0.8
Implement trampoline: DONE in release 2.0.0-0.8

Function Boundary Tracing (FBT)

Discover probe points: DONE in release 2.0.0-0.8
Create probes: DONE in release 2.0.0-0.8
Determine probe info: DONE in release 2.0.0-0.8
Implement trampoline: DONE in release 2.0.0-0.8
Probe cleanup: DONE in release 2.0.0-1.0

Statically Defined Tracing (SDT)

Discover probe points: DONE in release 2.0.0-0.8
Create probes: DONE in release 2.0.0-0.8
Determine probe info: DONE in release 2.0.0-0.8
Implement trampoline: DONE in release 2.0.0-0.8
Standard DTrace probes: Provide the standard set of DTrace SDT probes. This requires a kernel patch. Where the first implementation of DTrace on Linux provided a DTrace-specific way to define SDT probes and kernel tracepoints were exposed using that same mechanism, DTrace in the current form will need to do the opposite: implement the DTrace SDT probes as kernel tracepoints. It may be possible to re-use some of the existing kernel tracepoints (possibly requiring argument mapping and translators).

Userspace Process Tracing (pid and USDT)

Discover probe points
Create probes: DONE in release 2.0.0-1.2
Determine probe info
Implement trampoline
Probe cleanup

Language implementation features

Compilation of D clauses into BPF code: DONE in release 2.0.0-0.1
BPF probe program as trampoline to compiled D clauses: DONE in release 2.0.0-0.1
Linking generated BPF code with pre-compiled BPF support functions: DONE in release 2.0.0-1.0
Integration of the clause predicate as conditional into the main clause code: DONE in release 2.0.0-1.2
Support probe specifications with wildcards: DONE in release 2.0.0-1.2
Support more than one clause for a given probe: DONE in release 2.0.0-1.2
Store the DTrace BPF context somewhere other than the BPF stack: DONE in release 2.0.0-1.2
Provide correct buffer consumption semantics: TARGET release 2.0.0-1.3
Use default action only for empty clause: TARGET release 2.0.0-1.3

Compilation of D clauses into BPF code

DTrace on Linux is no longer using the in-kernel support code for DTrace and instead uses existing Linux kernel tracing features such as BPF. The D compiler (primarily the code generator and the assembler) will now generate BPF code. Whereas before clauses were divided into actions and each action was compiled into its own D program, the entire clause is now being compiled into BPF code.

The assembler requires modifications as well to handle BPF specific relocations and to account for the compilation of complete clauses. Also, the disassembler needs to be modified to support and output BPF code. DONE (2.0.0-0.1)

BPF probe program as trampoline to compiled D clauses

BPF programs are invoked from trace event triggers with a probe type specific BPF context. DTrace D clauses are expected to execute in a consistent context, especially because a single clause can be associated with probes of different types. In order to support this configuration, a probe-specific trampoline is generated by the provider code, using the BPF context to populate a generic DTrace context. Then it calls a predicate function (if the clause has a predicate), and if the predicate function returns true, the generated clause function is called.

DONE (2.0.0-0.1)

Linking generated BPF code with pre-compiled BPF support functions

There are common code sequences that are necessary for the implementation of D code using BPF. Things like retrieving the value of a variable of a particular kind, setting the value of a variable of a particular kind, string manipulation functions, etc. Many of these can be implemented in C and compiled into BPF code. These functions are linked together into the bpf_dlib.o ELF object. These pre-compiled BPF support functions are referenced by dynamically generated BPF code as function calls to external functions. This means that entries are added into the BPF symbol relocation table. These are resolved during a special linking stage that is executed prior to loading the BPF programs into the kernel.

During this linking stage the relocation records that reference external symbols are processed. The following high level mechanism is implemented:

1. For each relocation record, determine the pre-compiled BPF support function symbol it references
2. If the symbol has already been seen, continue with the next relocation record (goto step 1)
3. If the symbol is new, do the following:
   a. Mark the symbol as seen
   b. Append the executable code for the symbol to the BPF program we are processing
   c. Process the relocation records for this symbol per the algorithm described here

At the conclusion of this recursive process, there should not be any unresolved symbols left. The BPF program will contain all executable code necessary for executing the entire program.

DONE (2.0.0-1.0)

Integration of the clause predicate as conditional into the main clause code

The predicate can perform most operations that the main clause code can do, including accessing variables of all three kinds and calling subroutines. Many of these operations require access to the DTrace BPF context. However, only the main program (dt_program) was being generated with the appropriate prologue code to ensure that BPF support functions would be able to operate in a consistent execution context. While it would be possible to resolve this issue by generating the predicate as a function with prologue and epilogue as well, the more logical solution is to simply include the predicate in the main clause code as a conditional. The main program becomes:

PROLOGUE if (!predicate), goto exit MAIN PROGRAM CODE EPILOGUE exit: return 0

DONE (2.0.0-1.2)

Support probe specifications with wildcards

(See: DTrace D Compiler redesign) DONE (2.0.0-1.2)

Support more than one clause for a given probe

(See: DTrace D Compiler redesign) DONE (2.0.0-1.2)

Store the DTrace BPF context somewhere other than the BPF stack

The BPF JIT compiler generates code that depends on a single BPF stack (maximum size right now limited to 512 bytes) whereas the BPF interpreter provides a BPF stack for each function that is executing in a call chain. So, for interpreted code, the BPF stack for each function in the call chain The quite limited stack size for JIT-compiled code caused stack overrun conditions when executing D clauses because the DTrace BPF context (stored on the stack) was consuming too much precious stack space.

The solution is to decouple the DTrace BPF context from the DTrace machine state. While the DTrace BPF context must still reside on the stack, the DTrace machine state can be stored elsewhere. The choice has been made to use the 'mem' BPF map that is used to provide memory for the output buffer. The first portion of the memory used for the map value is set aside to hole the DTrace machine state, and the remainder is for the output buffer.

DONE (2.0.0-1.2)

Provide correct buffer consumption semantics

The processing of the trace buffers must adhere to specific DTrace buffer consumption semantics. The buffer that contains the BEGIN probe trace data must be processed first, so that the BEGIN probe will always be reported first in output. This means that we need to know what CPU the BEGIN probe fired on. Similarly, the buffer that contains the END probe trace data must be processed last, so that the END probe will always be reported last in output. This means we need to know what CPU the END probe fired on.

When processing the buffer for the BEGIN probe trace data it is important to also process any ERROR probe firings that relate to the BEGIN probe. On the other hand, ERROR probe trace data that is found in that same buffer but that does not correspond to the BEGIN probe firing must be processed after the BEGIN probe processing completes.

ONGOING (Target: 2.0.0-1.3)

Use default action only for empty clause

The "default action" is used for empty clauses. After the BPF port, it was also used for clauses that were not empty but simply had no data-recording actions. One can argue whether or not that is reasonable, but it is a significant departure from established precedent. Rather reasonable scripts would perform very differently with the new behavior. So, revert to the legacy behavior.

ONGOING (Target: 2.0.0-1.3)

DTrace based on BPF Implementation Plan

Introduction

User Visible Features

Built-in variables

Clause-local variables

Global variables

Thread-local-storage variables

Strings

Associative arrays

Aggregations

Speculations

Actions

Subroutines

Output formatting

Converting dtrace_eprobedesc_t into dtrace_datadesc_t

Flow indent

Providers

dtrace (BEGIN, END, and ERROR)

Timer based tracing (profile and tick)

System call tracing (syscall)

Function Boundary Tracing (FBT)

Statically Defined Tracing (SDT)

Userspace Process Tracing (pid and USDT)

Language implementation features

Compilation of D clauses into BPF code

BPF probe program as trampoline to compiled D clauses

Linking generated BPF code with pre-compiled BPF support functions

Integration of the clause predicate as conditional into the main clause code

Support probe specifications with wildcards

Support more than one clause for a given probe

Store the DTrace BPF context somewhere other than the BPF stack

Provide correct buffer consumption semantics

Use default action only for empty clause

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally