Skip to content

apm: Document tail-based sampling performance #770

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 33 commits into from
Mar 18, 2025
Merged
Changes from 11 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
fb34a66
WIP
carsonip Mar 13, 2025
fc76572
Add requirements
carsonip Mar 13, 2025
dda1d3e
Add fast disks
carsonip Mar 13, 2025
4b716c6
Add a note about insufficient storage
carsonip Mar 13, 2025
fedba8d
Disk rw requirements
carsonip Mar 13, 2025
d18725a
Merge branch 'main' into tbs-perf
carsonip Mar 13, 2025
048f807
Fix link
carsonip Mar 13, 2025
4bee1a1
Mention disk
carsonip Mar 14, 2025
a41a664
Add table for numbers
carsonip Mar 14, 2025
8ec6ad0
Language
carsonip Mar 14, 2025
af28057
Grammar
carsonip Mar 14, 2025
fffe207
Update table
carsonip Mar 17, 2025
8012133
Polish
carsonip Mar 17, 2025
4b9f80c
Add 8.18 numbers
carsonip Mar 17, 2025
4ae01e2
Shorten gp3 description
carsonip Mar 17, 2025
728fe1a
Add document indexing rate
carsonip Mar 17, 2025
7bc36c1
Rename
carsonip Mar 17, 2025
c48269a
Fix numbers
carsonip Mar 17, 2025
2a55116
Explain difference
carsonip Mar 17, 2025
105d9b5
Merge branch 'main' into tbs-perf
carsonip Mar 17, 2025
8df9fe8
polish
carsonip Mar 17, 2025
2fda2cf
Clean up headers
carsonip Mar 17, 2025
0949062
Fix align
carsonip Mar 17, 2025
86f0266
Grammar
carsonip Mar 17, 2025
b513a87
Fix incorrect number
carsonip Mar 18, 2025
bc5ca17
Update solutions/observability/apps/transaction-sampling.md
carsonip Mar 18, 2025
d7b6dfa
Add note on how to interpret numbers
carsonip Mar 18, 2025
2aaff33
Add note about event indexing rate
carsonip Mar 18, 2025
ff0f896
Apply suggestions from code review
carsonip Mar 18, 2025
6345c96
Spell out SSD
carsonip Mar 18, 2025
1fdedef
SSD with high IOPS
carsonip Mar 18, 2025
12ada46
Split version to header
carsonip Mar 18, 2025
3c667a8
Merge branch 'main' into tbs-perf
carsonip Mar 18, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions solutions/observability/apps/transaction-sampling.md
Original file line number Diff line number Diff line change
Expand Up @@ -133,6 +133,24 @@ Tail-based sampling is implemented entirely in APM Server, and will work with tr

Due to [OpenTelemetry tail-based sampling limitations](../../../solutions/observability/apps/limitations.md#apm-open-telemetry-tbs) when using [tailsamplingprocessor](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/tailsamplingprocessor), we recommend using APM Server tail-based sampling instead.

### Tail-based sampling performance and requirements [_tail_based_sampling_performance_and_requirements]

Tail-based sampling, by definition, requires storing events locally temporarily, such that they can be retrieved and forwarded once sampling decision is made.

In APM Server implementation, the events are stored temporarily on disk instead of memory for better scalability. Therefore, it requires local disk storage proportional to APM event ingestion rate, and additional memory to facilitate disk reads and writes. Insufficient [storage limit](../../../solutions/observability/apps/transaction-sampling.md#sampling-tail-storage_limit) causes sampling to be bypassed.

It is recommended to use fast disks, for example, NVMe SSDs, when enabling tail-based sampling, as disk throughput and IO may be the performance bottleneck to tail-based sampling, and APM event ingestion as a whole. Disk writes are proportional to event ingest rate, and disk reads are proportional to event ingest rate and sampling rate.

To demonstrate the performance overhead and requirements, here are some numbers from a standalone APM Server 9.0 deployed on AWS EC2, under full load receiving APM events containing only traces, assuming no backpressure from Elasticsearch, and 10% sample rate in tail sampling policy. They are for reference only, and may vary depending on factors like sampling rate, average event size, and average number of events per distributed trace.

| APM Server EC2 instance size | TBS enabled, Disk | Event ingestion rate (throughput from APM agent to APM Server) in events/s | Memory usage (max Resident Set Size) in GB | Disk usage in GB |
|:-----------------------------|:------------------------------------------------------------|----------------------------------------------------------------------------|--------------------------------------------|------------------|
| c6i.2xlarge or c6id.2xlarge | TBS disabled | 47220 | 0.98 | 0 |
| .. | TBS enabled, gp3 volume with the baseline IOPS of 3000 IOPS | 21310 | 1.41 | 13.1 |
| .. | TBS enabled, local NVMe SSD from c6id instance | 21210 | 1.34 | 12.9 |
| c6i.4xlarge or c6id.4xlarge | TBS disabled | 142200 | 1.12 | 0 |
| .. | TBS enabled, gp3 volume with the baseline IOPS of 3000 IOPS | 32410 | 1.71 | 19.4 |
| .. | TBS enabled, local NVMe SSD from c6id instance | 47370 | 1.73 | 23.6 |

## Sampled data and visualizations [_sampled_data_and_visualizations]

Expand Down