[lldb][swift] Fix StackID comparisons to enable stepping out of async functions #9162

felipepiovezan · 2024-08-21T16:22:18Z

This should be the final two fixes we need in order to enable stepping out of async functions.

The first commit makes the stack comparison operation symmetric, so that if IsYounger(A, B) returns true, then IsYounger(B, A) returns false.

The second commit implements StackID comparisons in the case where both CFAs are on the heap.

Pasted the commit messages below for ease of reading.

[lldb] Make StackID comparison symmetric

Currently, if we're comparing a StackID on the stack with one StackID on the
heap, we perform a special kind of IsYounger comparison. However, if
`IsYounger(A, b)` returns true, it must be the case that `IsYounger(B, A)`
returns false. But this is not necessarily the case today because, before this
patch, we would fallback to the traditional "stack" comparison, which is wrong
if one of the StackIDs is on the heap.

[lldb] Make StackID comparison search parent heap CFAs

For two StackIDs whose CFAs are on the heap, i.e., they are both async frames,
we can decide which one is younger by finding one async context in the
continuation chain of the other.

A lot of stepping algorithms rely on frame comparisons. In particular, this
patch fixes stepping out of async functions, and makes strides towards fixing
other stepping algorithms.

The explanation below is not needed to understand this patch, but it sheds some
light into why the most common kind of step out fails.

Consider the case where a breakpoint was set inside some async function, the
program hit that breakpoint, and then a "step out" operation is performed. This
is a very common use case.

The thread plan looks like this:

0. Single step past breakpoint.
1. Step out

LLDB performs a single step, and thread plan 0 says "continue running, I'm
done". Before continuing program execution, LLDB asks all thread plans, in this
case there's only the step out plan, if it should stop.

To answer the "should stop" question, the step out plan compares the current
frame (which is the same as when step out was pushed, since LLDB has only
performed a single step so far) with its target frame (which is just one frame
above). In this scenario, the current frame is async, and so the parent frame
must also be async. The StepOut plan is going to compare two async StackIDs.

Using the incorrect comparison, StepOut concludes the current frame is _older_
than the target frame. In other words, the program reached an older frame
without ever stopping on the target frame. Something weird must have happened,
and the plan concludes it is done.

felipepiovezan · 2024-08-21T16:58:06Z

Fixed typo in commit message

kastiglione · 2024-08-21T17:50:20Z

lldb/source/Target/StackID.cpp

+    Status error;
+    // The continuation's context is the first field of an async context.
+    if (!process.ReadMemory(parent_ctx, &parent_ctx, ptr_size, error))
+      break;


it's probably worth adding a log here

What kind of logging information did you have in mind? Also, there is no logging presence in this file, so I'm not sure which channel we would use.

if the very first ReadMemory fails, then we could log that the original context is unexpectedly bad.

if a subsequent ReadMemory fails, log the value of the last known context, and its invalid parent context value.

both of those seem like events we should capture somewhere.

maybe @jasonmolenda has a suggestion for which channel, but maybe the step and/or unwind categories?

Yeah I would emit to the unwind log channel. If a user reports a bug where the async backtrace is truncated, the thing we're most likely to ask for is the unwind log.

added logging

kastiglione · 2024-08-21T17:54:20Z

lldb/source/Target/StackID.cpp

+  // or indirectly) RHS. Chase continuation pointers to check this case, until
+  // we hit the end of the chain (parent_ctx == 0) or a safety limit in case of
+  // an invalid continuation chain.
+  auto max_num_frames = 1024;


should this limit be lower? Do we have an data to guide the choice of limit?

In the absence of data, I'd go with the high number to protect against catastrophic failure.

I am worried the number of memory reads could lead to catastrophic performance.

I added @jasonmolenda as a reviewer, because one of the lessons I learned from one of my previous StackID changes is that doing a ReadMemory to compare StackIDs can have a non-trivial impact on performance. Doing 1024 memory reads, for a single comparison, could lead to bad problems.

In the worst case scenario, where LHS is older and therefore RHS is not going to be found on the continuation chain, we will do as many memory reads as virtual frames on the stack.

And we only do any memory reads at all if we're comparing virtual frames with each other.

This is pretty much the price of doing business with async, AFAICT there are no other ways of comparing async frames with each other.

This is pretty much the price of doing business with async

I'm not suggesting there's a more efficient way (though there may be improvements to consider, such as caching).

I'm referring to "safety limit in case of an invalid continuation chain". Given invalid data/chain, is 1024 our best choice?

Ah wait, I see the problem you're pointing out. If we have a stub that doesn't support memory regions, or doesn't distinguish between stack & heap, this will say it's not on stack. :/ We can require an affirmative "is heap" check. I'd have to see when I added that to debugserver, and idk what we get on linux.

e.g. if there was a StackID::IsCFAOnStack and a StackID::IsCFAOnHeap, it's possible that both of these may return false, or basically "unknown".

Or StackID::IsCFAOnStack should return true unless (1) we can get the memory region, (2) the memory region is not stack, and (3) the memory region IS heap. Require a positive match of the address in a heap memory region, or else we say it is stack because we just don't know.

I mean, we can definitely explore more elaborate solutions, but I think for now we can just flip the conservative answer to be "on stack" always. It's never wrong to assume that (none of the upstream code has this stack vs heap distinction for StackIDs), we'll just fall back to the "regular" unwinding / stepping, which is undesirable but fine for those scenarios

Yep, you're right, that's a simple fix that will address my worry here.

adrian-prantl · 2024-08-21T18:12:49Z

lldb/source/Target/StackID.cpp

+  // or indirectly) RHS. Chase continuation pointers to check this case, until
+  // we hit the end of the chain (parent_ctx == 0) or a safety limit in case of
+  // an invalid continuation chain.
+  auto max_num_frames = 1024;


In the absence of data, I'd go with the high number to protect against catastrophic failure.

jasonmolenda · 2024-08-21T23:51:30Z

lldb/source/Target/StackID.cpp

+       max_num_frames--) {
+    Status error;
+    // The continuation's context is the first field of an async context.
+    if (!process.ReadMemory(parent_ctx, &parent_ctx, ptr_size, error))


nbd but I would use Process::ReadPointerFromMemory() here.

jasonmolenda · 2024-08-21T23:55:06Z

lldb/source/Target/StackID.cpp

+    Status error;
+    // The continuation's context is the first field of an async context.
+    if (!process.ReadMemory(parent_ctx, &parent_ctx, ptr_size, error))
+      break;


Yeah I would emit to the unwind log channel. If a user reports a bug where the async backtrace is truncated, the thing we're most likely to ask for is the unwind log.

jasonmolenda · 2024-08-22T00:05:32Z

lldb/source/Target/StackID.cpp

+  // or indirectly) RHS. Chase continuation pointers to check this case, until
+  // we hit the end of the chain (parent_ctx == 0) or a safety limit in case of
+  // an invalid continuation chain.
+  auto max_num_frames = 1024;


From a perf point of view, I'm less concerned about the memory read because we should only hit this codepath when we have at least one StackID that's not in stack memory, right. We will have targets that we connect to (e.g. JTAG debuggers, maybe Linux) where we may not know if a memory region is stack or heap memory. If we started treating StackIDs in "unknown memory type" as "could be in heap" and start dereferencing them, I'd have concerns. This is REALLY only appropriate when we're working with swift stack frames but I don't think we've encoded that anywhere, so we have to consider unusual environments were this might get incorrectly triggered.

The optimization in swiftlang#7877 accidentally changed the default answer for this function. It's important to default to "On Stack" when we don't know, because that's where most CFAs live, and because we don't want to trigger the expensive heap comparisons unless we're absolutely sure the CFA is on the heap.

Currently, if we're comparing a StackID on the stack with one StackID on the heap, we perform a special kind of IsYounger comparison. However, if `IsYounger(A, b)` returns true, it must be the case that `IsYounger(B, A)` returns false. But this is not necessarily the case today because, before this patch, we would fallback to the traditional "stack" comparison, which is wrong if one of the StackIDs is on the heap.

For two StackIDs whose CFAs are on the heap, i.e., they are both async frames, we can decide which one is younger by finding one async context in the continuation chain of the other. A lot of stepping algorithms rely on frame comparisons. In particular, this patch fixes stepping out of async functions, and makes strides towards fixing other stepping algorithms. The explanation below is not needed to understand this patch, but it sheds some light into why the most common kind of step out fails. Consider the case where a breakpoint was set inside some async function, the program hit that breakpoint, and then a "step out" operation is performed. This is a very common use case. The thread plan looks like this: 0. Single step past breakpoint. 1. Step out LLDB performs a single step, and thread plan 0 says "continue running, I'm done". Before continuing program execution, LLDB asks all thread plans, in this case there's only the step out plan, if it should stop. To answer the "should stop" question, the step out plan compares the current frame (which is the same as when step out was pushed, since LLDB has only performed a single step so far) with its target frame (which is just one frame above). In this scenario, the current frame is async, and so the parent frame must also be async. The StepOut plan is going to compare two async StackIDs. Using the incorrect comparison, StepOut concludes the current frame is _older_ than the target frame. In other words, the program reached an older frame without ever stopping on the target frame. Something weird must have happened, and the plan concludes it is done.

felipepiovezan · 2024-08-23T21:42:32Z

@jasonmolenda I've added a commit fixing the default return value of "IsCFAOnHeap" to conservatively say no if we don't have information.

Addressed all review comments

jasonmolenda

Looks good, thanks for digging in to this corner.

felipepiovezan requested review from kastiglione and adrian-prantl August 21, 2024 16:22

felipepiovezan requested a review from JDevlieghere as a code owner August 21, 2024 16:22

felipepiovezan force-pushed the felipe/step_out_fixes branch from f109b2d to a1718ed Compare August 21, 2024 16:57

kastiglione requested a review from jasonmolenda August 21, 2024 17:48

kastiglione reviewed Aug 21, 2024

View reviewed changes

adrian-prantl approved these changes Aug 21, 2024

View reviewed changes

jasonmolenda reviewed Aug 22, 2024

View reviewed changes

felipepiovezan added 2 commits August 23, 2024 13:22

felipepiovezan force-pushed the felipe/step_out_fixes branch from a1718ed to c044ec1 Compare August 23, 2024 21:05

felipepiovezan force-pushed the felipe/step_out_fixes branch from c044ec1 to 8d7935d Compare August 23, 2024 21:40

jasonmolenda approved these changes Aug 23, 2024

View reviewed changes

felipepiovezan merged commit 14f6cd5 into swiftlang:stable/20240723 Sep 4, 2024

felipepiovezan deleted the felipe/step_out_fixes branch September 4, 2024 14:13

felipepiovezan mentioned this pull request Sep 4, 2024

[cherry-pick][lldb][swift] Fix StackID comparisons to enable stepping out of async functions #9211

Merged

[lldb][swift] Fix StackID comparisons to enable stepping out of async functions #9162

[lldb][swift] Fix StackID comparisons to enable stepping out of async functions #9162

Uh oh!

Conversation

felipepiovezan commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

felipepiovezan commented Aug 21, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipepiovezan commented Aug 23, 2024

Uh oh!

jasonmolenda left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

felipepiovezan commented Aug 21, 2024 •

edited

Loading