-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat(replay): Keep 30-60s instead of 0-60s of recording in replay errror mode #6924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
02a0131
to
751ec4a
Compare
this.clear(); | ||
|
||
try { | ||
return await this._compressEvents(this._getAndIncrementId(), pendingEvents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change means that if something goes wrong when processing, we don't error out but just send the events uncompressed.
@@ -321,11 +322,10 @@ describe('Integration | errorSampleRate', () => { | |||
}); | |||
}); | |||
|
|||
it('has correct timestamps when error occurs much later than initial pageload/checkout', async () => { | |||
it('keeps up to the last two checkout events xxx', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I tried to also add a test here that makes sure we clear out previous stuff after the previous checkout, but ran into many test issues here. So I ended up testing this in different places, mostly in addEvent.test as well as in eventBuffer.test.
751ec4a
to
a1807d0
Compare
size-limit report 📦
|
5238091
to
3a8f9f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still want to think about this some more, but a lot of these changes make sense.
The one thing I'm not sure about yet is compressing the entire buffer at once instead of having streaming updates. It will make it difficult for us to later break-up large segments into multiple envelopes (e.g. if we wanted to ensure that segments are under 64KB so that we can use keepalive
).
packages/replay/src/util/addEvent.ts
Outdated
|
||
// Ensure we have the correct first checkout timestamp when an error occurs | ||
if (!session.segmentId) { | ||
replay.getContext().earliestEvent = eventBuffer.getFirstCheckoutTimestamp(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timestamp of the first checkout is not necessarily the earliest event. This is because we add performance entries at flush time and events there can happen before rrweb checkout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, good point! I wonder what's the best way to solve this is then... I guess most correct is to check this at first flush time, and just get the earliest timestamp from the events array? WDYT?
const pendingEvents = this.pendingEvents.slice(); | ||
this.clear(); | ||
|
||
return Promise.resolve(this._finishRecording(pendingEvents)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make function async?
return Promise.resolve(this._finishRecording(pendingEvents)); | |
return this._finishRecording(pendingEvents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to return a promise here to satisfy the interface, so need to wrap this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it's async, it will return an promise in any case :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's currently not async, though!
3a8f9f0
to
23828b2
Compare
4c5f3b3
to
cb6be34
Compare
Superseded this by new PR: #7025 |
Currently, we keep up to 60s of recording data in buffer, for when an error occurs. This means that if you have bad luck, you may only get a second of replay recording when an error happens.
This PR changes this by making sure we keep one full checkout worth of data in the cache, meaning you should get 30-60s of recording data instead (by also reducing the checkout time to every 30s instead of every 60s).
This was a bit more involved than anticipated, but in the process I also streamlined the worker code a bit. Now, the worker is stateless. Since we already keep the pending events in the main script memory anyhow, we can just send all events along when we want to compress them. This allows us to simplify this code quite a bit, especially with the changes here where the buffer has to hold a bit more state so we can clear only the desired cache.
The only potential downside of this is that theoretically, the local cache could be very large, and we may transfer a single very large payload to the worker. However, my research showed that payloads up to 100-200MB should not be a problem, and I think we're unlikely to exceed this in 60 seconds. Worst case, compression would fail, in which case we fall back to just use the uncompressed payload.
This change also means that
addEvent
andclear
can be sync again, as we always just write to local memory. Onlyfinish
has to be async, which also simplifies some things a bit!This also fixes an issue where the initial event sent is usually uncompressed. This is because we immediately send the initial snapshot, at which time the worker is usually not loaded. This triggers the fallback behavior of sending the event uncompressed. With this PR, for the initial payload we wait for the worker to be loaded to send, to avoid this.
Closes #6908
ref #6923