Skip to content

[lldb][swift] Fix StackID comparisons to enable stepping out of async functions #9162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

felipepiovezan
Copy link

@felipepiovezan felipepiovezan commented Aug 21, 2024

This should be the final two fixes we need in order to enable stepping out of async functions.

The first commit makes the stack comparison operation symmetric, so that if IsYounger(A, B) returns true, then IsYounger(B, A) returns false.

The second commit implements StackID comparisons in the case where both CFAs are on the heap.

Pasted the commit messages below for ease of reading.


[lldb] Make StackID comparison symmetric

Currently, if we're comparing a StackID on the stack with one StackID on the
heap, we perform a special kind of IsYounger comparison. However, if
`IsYounger(A, b)` returns true, it must be the case that `IsYounger(B, A)`
returns false. But this is not necessarily the case today because, before this
patch, we would fallback to the traditional "stack" comparison, which is wrong
if one of the StackIDs is on the heap.

[lldb] Make StackID comparison search parent heap CFAs

For two StackIDs whose CFAs are on the heap, i.e., they are both async frames,
we can decide which one is younger by finding one async context in the
continuation chain of the other.

A lot of stepping algorithms rely on frame comparisons. In particular, this
patch fixes stepping out of async functions, and makes strides towards fixing
other stepping algorithms.

The explanation below is not needed to understand this patch, but it sheds some
light into why the most common kind of step out fails.

Consider the case where a breakpoint was set inside some async function, the
program hit that breakpoint, and then a "step out" operation is performed. This
is a very common use case.

The thread plan looks like this:

0. Single step past breakpoint.
1. Step out

LLDB performs a single step, and thread plan 0 says "continue running, I'm
done". Before continuing program execution, LLDB asks all thread plans, in this
case there's only the step out plan, if it should stop.

To answer the "should stop" question, the step out plan compares the current
frame (which is the same as when step out was pushed, since LLDB has only
performed a single step so far) with its target frame (which is just one frame
above). In this scenario, the current frame is async, and so the parent frame
must also be async. The StepOut plan is going to compare two async StackIDs.

Using the incorrect comparison, StepOut concludes the current frame is _older_
than the target frame. In other words, the program reached an older frame
without ever stopping on the target frame. Something weird must have happened,
and the plan concludes it is done.

@felipepiovezan
Copy link
Author

Fixed typo in commit message

Status error;
// The continuation's context is the first field of an async context.
if (!process.ReadMemory(parent_ctx, &parent_ctx, ptr_size, error))
break;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's probably worth adding a log here

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of logging information did you have in mind? Also, there is no logging presence in this file, so I'm not sure which channel we would use.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the very first ReadMemory fails, then we could log that the original context is unexpectedly bad.

if a subsequent ReadMemory fails, log the value of the last known context, and its invalid parent context value.

both of those seem like events we should capture somewhere.

maybe @jasonmolenda has a suggestion for which channel, but maybe the step and/or unwind categories?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I would emit to the unwind log channel. If a user reports a bug where the async backtrace is truncated, the thing we're most likely to ask for is the unwind log.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added logging

// or indirectly) RHS. Chase continuation pointers to check this case, until
// we hit the end of the chain (parent_ctx == 0) or a safety limit in case of
// an invalid continuation chain.
auto max_num_frames = 1024;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this limit be lower? Do we have an data to guide the choice of limit?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the absence of data, I'd go with the high number to protect against catastrophic failure.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am worried the number of memory reads could lead to catastrophic performance.

I added @jasonmolenda as a reviewer, because one of the lessons I learned from one of my previous StackID changes is that doing a ReadMemory to compare StackIDs can have a non-trivial impact on performance. Doing 1024 memory reads, for a single comparison, could lead to bad problems.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the worst case scenario, where LHS is older and therefore RHS is not going to be found on the continuation chain, we will do as many memory reads as virtual frames on the stack.

And we only do any memory reads at all if we're comparing virtual frames with each other.

This is pretty much the price of doing business with async, AFAICT there are no other ways of comparing async frames with each other.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty much the price of doing business with async

I'm not suggesting there's a more efficient way (though there may be improvements to consider, such as caching).

I'm referring to "safety limit in case of an invalid continuation chain". Given invalid data/chain, is 1024 our best choice?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah wait, I see the problem you're pointing out. If we have a stub that doesn't support memory regions, or doesn't distinguish between stack & heap, this will say it's not on stack. :/ We can require an affirmative "is heap" check. I'd have to see when I added that to debugserver, and idk what we get on linux.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. if there was a StackID::IsCFAOnStack and a StackID::IsCFAOnHeap, it's possible that both of these may return false, or basically "unknown".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or StackID::IsCFAOnStack should return true unless (1) we can get the memory region, (2) the memory region is not stack, and (3) the memory region IS heap. Require a positive match of the address in a heap memory region, or else we say it is stack because we just don't know.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, we can definitely explore more elaborate solutions, but I think for now we can just flip the conservative answer to be "on stack" always. It's never wrong to assume that (none of the upstream code has this stack vs heap distinction for StackIDs), we'll just fall back to the "regular" unwinding / stepping, which is undesirable but fine for those scenarios

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, you're right, that's a simple fix that will address my worry here.

// or indirectly) RHS. Chase continuation pointers to check this case, until
// we hit the end of the chain (parent_ctx == 0) or a safety limit in case of
// an invalid continuation chain.
auto max_num_frames = 1024;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the absence of data, I'd go with the high number to protect against catastrophic failure.

max_num_frames--) {
Status error;
// The continuation's context is the first field of an async context.
if (!process.ReadMemory(parent_ctx, &parent_ctx, ptr_size, error))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nbd but I would use Process::ReadPointerFromMemory() here.

Status error;
// The continuation's context is the first field of an async context.
if (!process.ReadMemory(parent_ctx, &parent_ctx, ptr_size, error))
break;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I would emit to the unwind log channel. If a user reports a bug where the async backtrace is truncated, the thing we're most likely to ask for is the unwind log.

// or indirectly) RHS. Chase continuation pointers to check this case, until
// we hit the end of the chain (parent_ctx == 0) or a safety limit in case of
// an invalid continuation chain.
auto max_num_frames = 1024;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a perf point of view, I'm less concerned about the memory read because we should only hit this codepath when we have at least one StackID that's not in stack memory, right. We will have targets that we connect to (e.g. JTAG debuggers, maybe Linux) where we may not know if a memory region is stack or heap memory. If we started treating StackIDs in "unknown memory type" as "could be in heap" and start dereferencing them, I'd have concerns. This is REALLY only appropriate when we're working with swift stack frames but I don't think we've encoded that anywhere, so we have to consider unusual environments were this might get incorrectly triggered.

The optimization in swiftlang#7877
accidentally changed the default answer for this function.

It's important to default to "On Stack" when we don't know, because that's where
most CFAs live, and because we don't want to trigger the expensive heap
comparisons unless we're absolutely sure the CFA is on the heap.
Currently, if we're comparing a StackID on the stack with one StackID on the
heap, we perform a special kind of IsYounger comparison. However, if
`IsYounger(A, b)` returns true, it must be the case that `IsYounger(B, A)`
returns false. But this is not necessarily the case today because, before this
patch, we would fallback to the traditional "stack" comparison, which is wrong
if one of the StackIDs is on the heap.
For two StackIDs whose CFAs are on the heap, i.e., they are both async frames,
we can decide which one is younger by finding one async context in the
continuation chain of the other.

A lot of stepping algorithms rely on frame comparisons. In particular, this
patch fixes stepping out of async functions, and makes strides towards fixing
other stepping algorithms.

The explanation below is not needed to understand this patch, but it sheds some
light into why the most common kind of step out fails.

Consider the case where a breakpoint was set inside some async function, the
program hit that breakpoint, and then a "step out" operation is performed. This
is a very common use case.

The thread plan looks like this:

0. Single step past breakpoint.
1. Step out

LLDB performs a single step, and thread plan 0 says "continue running, I'm
done". Before continuing program execution, LLDB asks all thread plans, in this
case there's only the step out plan, if it should stop.

To answer the "should stop" question, the step out plan compares the current
frame (which is the same as when step out was pushed, since LLDB has only
performed a single step so far) with its target frame (which is just one frame
above). In this scenario, the current frame is async, and so the parent frame
must also be async. The StepOut plan is going to compare two async StackIDs.

Using the incorrect comparison, StepOut concludes the current frame is _older_
than the target frame. In other words, the program reached an older frame
without ever stopping on the target frame. Something weird must have happened,
and the plan concludes it is done.
@felipepiovezan
Copy link
Author

@jasonmolenda I've added a commit fixing the default return value of "IsCFAOnHeap" to conservatively say no if we don't have information.

Addressed all review comments

Copy link

@jasonmolenda jasonmolenda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for digging in to this corner.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants