-
Notifications
You must be signed in to change notification settings - Fork 341
[lldb][swift] Fix StackID comparisons to enable stepping out of async functions #9162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[lldb][swift] Fix StackID comparisons to enable stepping out of async functions #9162
Conversation
f109b2d
to
a1718ed
Compare
Fixed typo in commit message |
Status error; | ||
// The continuation's context is the first field of an async context. | ||
if (!process.ReadMemory(parent_ctx, &parent_ctx, ptr_size, error)) | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's probably worth adding a log here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of logging information did you have in mind? Also, there is no logging presence in this file, so I'm not sure which channel we would use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the very first ReadMemory
fails, then we could log that the original context is unexpectedly bad.
if a subsequent ReadMemory
fails, log the value of the last known context, and its invalid parent context value.
both of those seem like events we should capture somewhere.
maybe @jasonmolenda has a suggestion for which channel, but maybe the step
and/or unwind
categories?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I would emit to the unwind log channel. If a user reports a bug where the async backtrace is truncated, the thing we're most likely to ask for is the unwind log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added logging
lldb/source/Target/StackID.cpp
Outdated
// or indirectly) RHS. Chase continuation pointers to check this case, until | ||
// we hit the end of the chain (parent_ctx == 0) or a safety limit in case of | ||
// an invalid continuation chain. | ||
auto max_num_frames = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this limit be lower? Do we have an data to guide the choice of limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the absence of data, I'd go with the high number to protect against catastrophic failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am worried the number of memory reads could lead to catastrophic performance.
I added @jasonmolenda as a reviewer, because one of the lessons I learned from one of my previous StackID changes is that doing a ReadMemory
to compare StackIDs can have a non-trivial impact on performance. Doing 1024 memory reads, for a single comparison, could lead to bad problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the worst case scenario, where LHS is older and therefore RHS is not going to be found on the continuation chain, we will do as many memory reads as virtual frames on the stack.
And we only do any memory reads at all if we're comparing virtual frames with each other.
This is pretty much the price of doing business with async, AFAICT there are no other ways of comparing async frames with each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty much the price of doing business with async
I'm not suggesting there's a more efficient way (though there may be improvements to consider, such as caching).
I'm referring to "safety limit in case of an invalid continuation chain". Given invalid data/chain, is 1024 our best choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah wait, I see the problem you're pointing out. If we have a stub that doesn't support memory regions, or doesn't distinguish between stack & heap, this will say it's not on stack. :/ We can require an affirmative "is heap" check. I'd have to see when I added that to debugserver, and idk what we get on linux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. if there was a StackID::IsCFAOnStack and a StackID::IsCFAOnHeap, it's possible that both of these may return false, or basically "unknown".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or StackID::IsCFAOnStack should return true
unless (1) we can get the memory region, (2) the memory region is not stack, and (3) the memory region IS heap. Require a positive match of the address in a heap memory region, or else we say it is stack because we just don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, we can definitely explore more elaborate solutions, but I think for now we can just flip the conservative answer to be "on stack" always. It's never wrong to assume that (none of the upstream code has this stack vs heap distinction for StackIDs), we'll just fall back to the "regular" unwinding / stepping, which is undesirable but fine for those scenarios
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, you're right, that's a simple fix that will address my worry here.
lldb/source/Target/StackID.cpp
Outdated
// or indirectly) RHS. Chase continuation pointers to check this case, until | ||
// we hit the end of the chain (parent_ctx == 0) or a safety limit in case of | ||
// an invalid continuation chain. | ||
auto max_num_frames = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the absence of data, I'd go with the high number to protect against catastrophic failure.
lldb/source/Target/StackID.cpp
Outdated
max_num_frames--) { | ||
Status error; | ||
// The continuation's context is the first field of an async context. | ||
if (!process.ReadMemory(parent_ctx, &parent_ctx, ptr_size, error)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nbd but I would use Process::ReadPointerFromMemory() here.
Status error; | ||
// The continuation's context is the first field of an async context. | ||
if (!process.ReadMemory(parent_ctx, &parent_ctx, ptr_size, error)) | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I would emit to the unwind log channel. If a user reports a bug where the async backtrace is truncated, the thing we're most likely to ask for is the unwind log.
lldb/source/Target/StackID.cpp
Outdated
// or indirectly) RHS. Chase continuation pointers to check this case, until | ||
// we hit the end of the chain (parent_ctx == 0) or a safety limit in case of | ||
// an invalid continuation chain. | ||
auto max_num_frames = 1024; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a perf point of view, I'm less concerned about the memory read because we should only hit this codepath when we have at least one StackID that's not in stack memory, right. We will have targets that we connect to (e.g. JTAG debuggers, maybe Linux) where we may not know if a memory region is stack or heap memory. If we started treating StackIDs in "unknown memory type" as "could be in heap" and start dereferencing them, I'd have concerns. This is REALLY only appropriate when we're working with swift stack frames but I don't think we've encoded that anywhere, so we have to consider unusual environments were this might get incorrectly triggered.
The optimization in swiftlang#7877 accidentally changed the default answer for this function. It's important to default to "On Stack" when we don't know, because that's where most CFAs live, and because we don't want to trigger the expensive heap comparisons unless we're absolutely sure the CFA is on the heap.
Currently, if we're comparing a StackID on the stack with one StackID on the heap, we perform a special kind of IsYounger comparison. However, if `IsYounger(A, b)` returns true, it must be the case that `IsYounger(B, A)` returns false. But this is not necessarily the case today because, before this patch, we would fallback to the traditional "stack" comparison, which is wrong if one of the StackIDs is on the heap.
a1718ed
to
c044ec1
Compare
For two StackIDs whose CFAs are on the heap, i.e., they are both async frames, we can decide which one is younger by finding one async context in the continuation chain of the other. A lot of stepping algorithms rely on frame comparisons. In particular, this patch fixes stepping out of async functions, and makes strides towards fixing other stepping algorithms. The explanation below is not needed to understand this patch, but it sheds some light into why the most common kind of step out fails. Consider the case where a breakpoint was set inside some async function, the program hit that breakpoint, and then a "step out" operation is performed. This is a very common use case. The thread plan looks like this: 0. Single step past breakpoint. 1. Step out LLDB performs a single step, and thread plan 0 says "continue running, I'm done". Before continuing program execution, LLDB asks all thread plans, in this case there's only the step out plan, if it should stop. To answer the "should stop" question, the step out plan compares the current frame (which is the same as when step out was pushed, since LLDB has only performed a single step so far) with its target frame (which is just one frame above). In this scenario, the current frame is async, and so the parent frame must also be async. The StepOut plan is going to compare two async StackIDs. Using the incorrect comparison, StepOut concludes the current frame is _older_ than the target frame. In other words, the program reached an older frame without ever stopping on the target frame. Something weird must have happened, and the plan concludes it is done.
c044ec1
to
8d7935d
Compare
@jasonmolenda I've added a commit fixing the default return value of "IsCFAOnHeap" to conservatively say no if we don't have information. Addressed all review comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks for digging in to this corner.
This should be the final two fixes we need in order to enable stepping out of async functions.
The first commit makes the stack comparison operation symmetric, so that if
IsYounger(A, B)
returns true, thenIsYounger(B, A)
returns false.The second commit implements StackID comparisons in the case where both CFAs are on the heap.
Pasted the commit messages below for ease of reading.