Skip to content

feat(replay): Keep 30-60s instead of 0-60s of recording in replay errror mode #6924

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

mydea
Copy link
Member

@mydea mydea commented Jan 24, 2023

Currently, we keep up to 60s of recording data in buffer, for when an error occurs. This means that if you have bad luck, you may only get a second of replay recording when an error happens.

This PR changes this by making sure we keep one full checkout worth of data in the cache, meaning you should get 30-60s of recording data instead (by also reducing the checkout time to every 30s instead of every 60s).

This was a bit more involved than anticipated, but in the process I also streamlined the worker code a bit. Now, the worker is stateless. Since we already keep the pending events in the main script memory anyhow, we can just send all events along when we want to compress them. This allows us to simplify this code quite a bit, especially with the changes here where the buffer has to hold a bit more state so we can clear only the desired cache.

The only potential downside of this is that theoretically, the local cache could be very large, and we may transfer a single very large payload to the worker. However, my research showed that payloads up to 100-200MB should not be a problem, and I think we're unlikely to exceed this in 60 seconds. Worst case, compression would fail, in which case we fall back to just use the uncompressed payload.

This change also means that addEvent and clear can be sync again, as we always just write to local memory. Only finish has to be async, which also simplifies some things a bit!

This also fixes an issue where the initial event sent is usually uncompressed. This is because we immediately send the initial snapshot, at which time the worker is usually not loaded. This triggers the fallback behavior of sending the event uncompressed. With this PR, for the initial payload we wait for the worker to be loaded to send, to avoid this.

Closes #6908

ref #6923

@mydea mydea added the Package: replay Issues related to the Sentry Replay SDK label Jan 24, 2023
@mydea mydea requested review from billyvg and Lms24 January 24, 2023 19:05
@mydea mydea self-assigned this Jan 24, 2023
@mydea mydea force-pushed the fn/replay-error-mode-cache branch 2 times, most recently from 02a0131 to 751ec4a Compare January 24, 2023 19:10
this.clear();

try {
return await this._compressEvents(this._getAndIncrementId(), pendingEvents);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change means that if something goes wrong when processing, we don't error out but just send the events uncompressed.

@@ -321,11 +322,10 @@ describe('Integration | errorSampleRate', () => {
});
});

it('has correct timestamps when error occurs much later than initial pageload/checkout', async () => {
it('keeps up to the last two checkout events xxx', async () => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I tried to also add a test here that makes sure we clear out previous stuff after the previous checkout, but ran into many test issues here. So I ended up testing this in different places, mostly in addEvent.test as well as in eventBuffer.test.

@mydea mydea force-pushed the fn/replay-error-mode-cache branch from 751ec4a to a1807d0 Compare January 24, 2023 19:14
@github-actions
Copy link
Contributor

github-actions bot commented Jan 24, 2023

size-limit report 📦

Path Size
@sentry/browser - ES5 CDN Bundle (gzipped + minified) 19.85 KB (+0.01% 🔺)
@sentry/browser - ES5 CDN Bundle (minified) 61.55 KB (0%)
@sentry/browser - ES6 CDN Bundle (gzipped + minified) 18.52 KB (+0.01% 🔺)
@sentry/browser - ES6 CDN Bundle (minified) 54.85 KB (0%)
@sentry/browser - Webpack (gzipped + minified) 20.24 KB (0%)
@sentry/browser - Webpack (minified) 66.25 KB (0%)
@sentry/react - Webpack (gzipped + minified) 20.27 KB (0%)
@sentry/nextjs Client - Webpack (gzipped + minified) 47.56 KB (0%)
@sentry/browser + @sentry/tracing - ES5 CDN Bundle (gzipped + minified) 26.77 KB (+0.01% 🔺)
@sentry/browser + @sentry/tracing - ES6 CDN Bundle (gzipped + minified) 25.06 KB (+0.01% 🔺)
@sentry/replay ES6 CDN Bundle (gzipped + minified) 44.15 KB (+0.22% 🔺)
@sentry/replay - Webpack (gzipped + minified) 38.87 KB (+0.16% 🔺)
@sentry/browser + @sentry/tracing + @sentry/replay - ES6 CDN Bundle (gzipped + minified) 61.47 KB (+0.16% 🔺)

@mydea mydea changed the title feat(replay): Keep min. 60s instead of max. 60s of recording in replay errror mode feat(replay): Keep 30-60s instead of 0-60s of recording in replay errror mode Jan 24, 2023
@mydea mydea force-pushed the fn/replay-error-mode-cache branch 3 times, most recently from 5238091 to 3a8f9f0 Compare January 24, 2023 21:00
Copy link
Member

@billyvg billyvg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still want to think about this some more, but a lot of these changes make sense.

The one thing I'm not sure about yet is compressing the entire buffer at once instead of having streaming updates. It will make it difficult for us to later break-up large segments into multiple envelopes (e.g. if we wanted to ensure that segments are under 64KB so that we can use keepalive).


// Ensure we have the correct first checkout timestamp when an error occurs
if (!session.segmentId) {
replay.getContext().earliestEvent = eventBuffer.getFirstCheckoutTimestamp();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timestamp of the first checkout is not necessarily the earliest event. This is because we add performance entries at flush time and events there can happen before rrweb checkout.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, good point! I wonder what's the best way to solve this is then... I guess most correct is to check this at first flush time, and just get the earliest timestamp from the events array? WDYT?

const pendingEvents = this.pendingEvents.slice();
this.clear();

return Promise.resolve(this._finishRecording(pendingEvents));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make function async?

Suggested change
return Promise.resolve(this._finishRecording(pendingEvents));
return this._finishRecording(pendingEvents);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to return a promise here to satisfy the interface, so need to wrap this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it's async, it will return an promise in any case :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's currently not async, though!

@mydea mydea force-pushed the fn/replay-error-mode-cache branch from 3a8f9f0 to 23828b2 Compare January 30, 2023 10:45
@mydea
Copy link
Member Author

mydea commented Feb 1, 2023

Superseded this by new PR: #7025

@mydea mydea closed this Feb 1, 2023
@AbhiPrasad AbhiPrasad deleted the fn/replay-error-mode-cache branch March 27, 2025 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Package: replay Issues related to the Sentry Replay SDK
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keep min. 30 seconds for replay in error mode
3 participants