Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocalVariables integration - race condition processing of cached frames #13415

Closed
3 tasks done
Bruno-DaSilva opened this issue Aug 17, 2024 · 11 comments · Fixed by #13501
Closed
3 tasks done

LocalVariables integration - race condition processing of cached frames #13415

Bruno-DaSilva opened this issue Aug 17, 2024 · 11 comments · Fixed by #13501
Assignees
Labels
Package: node Issues related to the Sentry Node SDK

Comments

@Bruno-DaSilva
Copy link

Bruno-DaSilva commented Aug 17, 2024

Is there an existing issue for this?

How do you use Sentry?

Sentry Saas (sentry.io)

Which SDK are you using?

@sentry/node

SDK Version

8.26.0

Framework Version

Node v20.12.2

Link to Sentry event

First error's local variables (endpoint fees/33/fee) are blank: https://clearbanc.sentry.io/issues/5724968061/events/13ac4e69d73e4cb6b79160830ef972f7/

First error's local variables are on the SECOND error's call (endpoint fees/44/fee) https://clearbanc.sentry.io/issues/5724968061/events/f697d4fe012e408d8fea4241d6d3c786/

Reproduction Example/SDK Setup

SDK setup:

Sentry.init({
    dsn: process.env.SENTRY_DSN,
    includeLocalVariables: true,
});

Unfortunately because it's a race condition it may be hard to reproduce without very specific code, error behaviour, and hardware. Hopefully you can trust that it does happen as an edge case and we can find some way to gate frame checks to wait for the debugger or something?

Steps to Reproduce

  1. startup the nodejs app fresh
  2. init sentry with includeLocalVariables: true
  3. trigger/capture an exception with local var state A
  4. trigger/capture an exception with local var state B

Expected Result

Sentry UI should show exception #1 with state A and exception #2 with state B

Actual Result

Sentry UI shows exception #1 with NO state, and exception #2 with state A.

I dug into this and it appears there is a race condition with cachedFrames where we check the LRU cache for frames before the worker sends the frame info message to populate the cache. So then the second time the error is captured, it will receive the previous frame's information that was never consumed the first time. And then the worker sends the second frame's late, repeating ad nauseum.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Aug 17, 2024
@github-actions github-actions bot added the Package: node Issues related to the Sentry Node SDK label Aug 17, 2024
@Bruno-DaSilva
Copy link
Author

I'm hoping with the digging I've done to root cause these it shouldn't require a high effort repro example -- but if it's needed I can create one (it'll just take me some time that I don't immediately have).

@Bruno-DaSilva
Copy link
Author

Bruno-DaSilva commented Aug 18, 2024

For some clarity/context:

This happens inconsistently but often in my application - I notice when on first startup, the first time an error is captured it will sometimes NOT have local variables attached to it.

Some extra logging I added shows that this is because the worker sends the callFrames after the exception is processed - a race condition.

To make this more easy to test/reproduce, you can add a temporary sleep to the worker before it sends the variables over.

You can see the result in our sentry instance here:


Personally, I'm thinking there should be some mechanism where the exception capturing half will wait for the worker to report on the exception. There's some problems with that though:

  1. What if an error is new Error()'d but never thrown? then we will wait for the worker which will never trigger on a thrown exception.
  2. How can we differentiate between the debugger/variable info from last error that happened, and the latest instance? Both have the same frame hash.
  3. How does sampling errors play into this? Will a sentry exception that is not sampled ever go through the local variable flow, to pop from the LRU? If not, then stale entries might stick around in the LRU and picked up by the next error that IS sampled.
  4. How to handle local variable rate limiting, where we stop reporting for all exceptions?

So, maybe an even more ideal mechanism comes if we have a way of somehow linking the sentry exceptions with the exact debugger frame related to it, instead of just hoping they match. Then we don't have this weird additional behaviour with frames colliding.

Thoughts?

@Lms24
Copy link
Member

Lms24 commented Aug 19, 2024

Hey @Bruno-DaSilva thanks for writing in!

We'll look into your issue next week as this week is Hackweek at Sentry (see #13421).

@timfish
Copy link
Collaborator

timfish commented Aug 27, 2024

I think you're right, the simplest solution is to delay before checking for local variables and we can include a timeout.

@timfish timfish self-assigned this Aug 27, 2024
@Bruno-DaSilva
Copy link
Author

@timfish one wrench in that I'm thinking about is it may still have a race condition, albeit a smaller one: what if two similar requests/exceptions come in at once and both hit the debugger? Both exceptions will save in the LRU in the same key so only one will persist, and then the delay will finish and it will be attached to just one of the two Sentry.captureExceptions, right?

What do you think? It's all rooted in the fact we can't match the thrown exceptions to their exact parent request from Sentry.captureException.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Aug 27, 2024
@timfish
Copy link
Collaborator

timfish commented Aug 27, 2024

Both exceptions will save in the LRU in the same key so only one will persist, and then the delay will finish and it will be attached to just one of the two

Yeah it's tricky. We could put the exception variables in a stack and only read from one end and we'd be more sure that we're getting them in the correct order. This would likely be reliable if we only had to consider uncaught exceptions. However, for caught exceptions there's nothing stopping users from calling Sentry.captureException after another exception has occurred/been caught so we can't really rely on ordering.

@Bruno-DaSilva
Copy link
Author

Bruno-DaSilva commented Aug 27, 2024

I wonder if we could generate+inject a unique uuid on exception creation or exception throw that could then be used to match? I can't immediately think of a way of doing that for every possible exception without patching node internals though...

EDIT: could we use the memory address for this (or perhaps some other very unique, unchanging id that might exist in an exception) for the exception?

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Aug 27, 2024
@timfish
Copy link
Collaborator

timfish commented Aug 27, 2024

I wonder if we could generate+inject a unique uuid on exception creation or exception throw that could then be used to match?

From the debugger we can modify almost anything in the app state so I wonder if we could write a unique ID property the error object which we could then read off later in runtime as it passes through the Sentry SDK 🤔

@timfish
Copy link
Collaborator

timfish commented Aug 27, 2024

I wonder if we could write a unique ID property the error object

So I don't think we can't do this at the point when the debugger is initially paused. We'd need to step further on in execution until the error object is part of the scope. This would again open us up to race conditions with multiple exceptions occurring closely together.

@Bruno-DaSilva
Copy link
Author

One (terrible) option is to monkeypatch global.Error. But... that perhaps is too scary and dangerous.

@getsantry getsantry bot moved this to Waiting for: Product Owner in GitHub Issues with 👀 3 Aug 27, 2024
@Bruno-DaSilva
Copy link
Author

Bruno-DaSilva commented Aug 27, 2024

From the debugger we can modify almost anything in the app state so I wonder if we could write a unique ID property the error object which we could then read off later in runtime as it passes through the Sentry SDK 🤔

@timfish I think I found a way to do what you were looking for. Check this out:

  1. Worker receives a pause event with data.objectId defined. Turns out, this objectId is for the Error object 🎉
    Image

  2. Worker calls function on the object to set a new property

await session.post('Runtime.callFunctionOn', { functionDeclaration: 'function() { this.brunosId = "asfg"; }', objectId: data.objectId });
  1. exception object grabbed by sentry.captureException contains this.brunoId which is a new property written in step 2, so long as it waits long enough for step 2 to finish
    Image

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Package: node Issues related to the Sentry Node SDK
Projects
Archived in project
5 participants