Disable ThreadPlanSingleThreadTimeout during step over breakpoint #104532

jeffreytan81 · 2024-08-16T01:22:00Z

This PR fixes another race condition in #90930. The failure was found by @labath with this log: https://paste.debian.net/hidden/30235a5c/:

dotest_wrapper.  <  15> send packet: $z0,224505,1#65
...
b-remote.async>  <  22> send packet: $vCont;s:p1dcf.1dcf#4c
intern-state     GDBRemoteClientBase::Lock::Lock sent packet: \x03
b-remote.async>  < 818> read packet: $T13thread:p1dcf.1dcf;name:a.out;threads:1dcf,1dd2;jstopinfo:5b7b226e616d65223a22612e6f7574222c22726561736f6e223a227369676e616c222c227369676e616c223a31392c22746964223a373633317d2c7b226e616d65223a22612e6f7574222c22746964223a373633347d5d;thread-pcs:0000000000224505,00007f4e4302119a;00:0000000000000000;01:0000000000000000;02:0100000000000000;03:0000000000000000;04:9084997dfc7f0000;05:a8742a0000000000;06:b084997dfc7f0000;07:6084997dfc7f0000;08:0000000000000000;09:00d7e5424e7f0000;0a:d0d9e5424e7f0000;0b:0202000000000000;0c:80cc290000000000;0d:d8cc1c434e7f0000;0e:2886997dfc7f0000;0f:0100000000000000;10:0545220000000000;11:0602000000000000;12:3300000000000000;13:0000000000000000;14:0000000000000000;15:2b00000000000000;16:80fbe5424e7f0000;17:0000000000000000;18:0000000000000000;19:0000000000000000;reason:signal;#b9

It shows an async interrupt "\x03" was sent immediately after vCont;s single step over breakpoint at address 0x224505 (which was disabled before vCont). And the later stop was still at the original PC (0x224505) not moving forward.

The investigation shows the failure happens when timeout is short and async interrupt is sent to lldb-server immediately after vCont so ptrace() resumes and then async interrupts debuggee immediately so debuggee does not get a chance to execute and move PC. So it enters stop mode immediately at original PC. ThreadPlanStepOverBreakpoint does not expect PC not moving and reports stop at the original place.

To fix this, the PR prevents ThreadPlanSingleThreadTimeout from being created during ThreadPlanStepOverBreakpoint by introduces a new SupportsResumeOthers() method and ThreadPlanStepOverBreakpoint returns false for it. This makes sense because we should never resume threads during step over breakpoint anyway otherwise it might cause other threads to miss breakpoint.

llvmbot · 2024-08-16T01:22:29Z

@llvm/pr-subscribers-lldb

Author: None (jeffreytan81)

Changes

This PR fixes another potential race condition in #90930. The failure is found by @labath with this log: https://paste.debian.net/hidden/30235a5c/.

The investigation shows the failure happens when timeout is short and async interrupt is sent to lldb-server immediately after vCont so ptrace() resumes and then async interrupts debuggee immediately so debuggee does not get a chance to execute and move PC. So it enters stop mode immediately at original PC. ThreadPlanStepOverBreakpoint does not expect PC not moving and reports stop at the original place.

To fix this, the PR prevents ThreadPlanSingleThreadTimeout from being created during ThreadPlanStepOverBreakpoint by introduces a new SupportsResumeOthers() method and ThreadPlanStepOverBreakpoint returns false for it. This makes sense because we should never resume threads during step over breakpoint anyway otherwise it might cause other threads to miss breakpoint.

Full diff: https://github.com/llvm/llvm-project/pull/104532.diff

4 Files Affected:

(modified) lldb/include/lldb/Target/ThreadPlan.h (+7-1)
(modified) lldb/include/lldb/Target/ThreadPlanStepOverBreakpoint.h (+1)
(modified) lldb/source/Target/ThreadPlanSingleThreadTimeout.cpp (+6)
(modified) lldb/source/Target/ThreadPlanStepOverBreakpoint.cpp (+6)

diff --git a/lldb/include/lldb/Target/ThreadPlan.h b/lldb/include/lldb/Target/ThreadPlan.h
index c336b6bb37df1b..d6da484f4fc137 100644
--- a/lldb/include/lldb/Target/ThreadPlan.h
+++ b/lldb/include/lldb/Target/ThreadPlan.h
@@ -385,7 +385,13 @@ class ThreadPlan : public std::enable_shared_from_this<ThreadPlan>,
   virtual void SetStopOthers(bool new_value);
 
   virtual bool StopOthers();
-  
+
+  // Returns true if the thread plan supports ThreadPlanSingleThreadTimeout to
+  // resume other threads after timeout. If the thread plan returns false it
+  // will prevent ThreadPlanSingleThreadTimeout from being created when this
+  // thread plan is alive.
+  virtual bool SupportsResumeOthers() { return true; }
+
   virtual bool ShouldRunBeforePublicStop() { return false; }
 
   // This is the wrapper for DoWillResume that does generic ThreadPlan logic,
diff --git a/lldb/include/lldb/Target/ThreadPlanStepOverBreakpoint.h b/lldb/include/lldb/Target/ThreadPlanStepOverBreakpoint.h
index 1f3aff45c49abe..0da8dbf44ffd8a 100644
--- a/lldb/include/lldb/Target/ThreadPlanStepOverBreakpoint.h
+++ b/lldb/include/lldb/Target/ThreadPlanStepOverBreakpoint.h
@@ -23,6 +23,7 @@ class ThreadPlanStepOverBreakpoint : public ThreadPlan {
   void GetDescription(Stream *s, lldb::DescriptionLevel level) override;
   bool ValidatePlan(Stream *error) override;
   bool ShouldStop(Event *event_ptr) override;
+  bool SupportsResumeOthers() override;
   bool StopOthers() override;
   lldb::StateType GetPlanRunState() override;
   bool WillStop() override;
diff --git a/lldb/source/Target/ThreadPlanSingleThreadTimeout.cpp b/lldb/source/Target/ThreadPlanSingleThreadTimeout.cpp
index 806ba95c508b7c..71be81365a2668 100644
--- a/lldb/source/Target/ThreadPlanSingleThreadTimeout.cpp
+++ b/lldb/source/Target/ThreadPlanSingleThreadTimeout.cpp
@@ -76,6 +76,9 @@ void ThreadPlanSingleThreadTimeout::PushNewWithTimeout(Thread &thread,
   if (!thread.GetCurrentPlan()->StopOthers())
     return;
 
+  if (!thread.GetCurrentPlan()->SupportsResumeOthers())
+    return;
+
   auto timeout_plan = new ThreadPlanSingleThreadTimeout(thread, info);
   ThreadPlanSP thread_plan_sp(timeout_plan);
   auto status = thread.QueueThreadPlan(thread_plan_sp,
@@ -102,6 +105,9 @@ void ThreadPlanSingleThreadTimeout::ResumeFromPrevState(Thread &thread,
   if (!thread.GetCurrentPlan()->StopOthers())
     return;
 
+  if (!thread.GetCurrentPlan()->SupportsResumeOthers())
+    return;
+
   auto timeout_plan = new ThreadPlanSingleThreadTimeout(thread, info);
   ThreadPlanSP thread_plan_sp(timeout_plan);
   auto status = thread.QueueThreadPlan(thread_plan_sp,
diff --git a/lldb/source/Target/ThreadPlanStepOverBreakpoint.cpp b/lldb/source/Target/ThreadPlanStepOverBreakpoint.cpp
index f88a2b895931cd..97c27ad4cd0493 100644
--- a/lldb/source/Target/ThreadPlanStepOverBreakpoint.cpp
+++ b/lldb/source/Target/ThreadPlanStepOverBreakpoint.cpp
@@ -103,6 +103,12 @@ bool ThreadPlanStepOverBreakpoint::ShouldStop(Event *event_ptr) {
 
 bool ThreadPlanStepOverBreakpoint::StopOthers() { return true; }
 
+// The ThreadPlanSingleThreadTimeout can interrupt and resume all threads during
+// stepping, which may cause them to miss breakpoint. Therefore, we should
+// prevent the creation of ThreadPlanSingleThreadTimeout during a step-over
+// breakpoint.
+bool ThreadPlanStepOverBreakpoint::SupportsResumeOthers() { return false; }
+
 StateType ThreadPlanStepOverBreakpoint::GetPlanRunState() {
   return eStateStepping;
 }

clayborg

Seems pretty clean, but we need to get Jim's opinion in case there is a better way to do this that he would prefer.

clayborg · 2024-08-16T07:04:03Z

lldb/source/Target/ThreadPlanStepOverBreakpoint.cpp

+// The ThreadPlanSingleThreadTimeout can interrupt and resume all threads during
+// stepping, which may cause them to miss breakpoint. Therefore, we should
+// prevent the creation of ThreadPlanSingleThreadTimeout during a step-over
+// breakpoint.


It might be a bit more clear to say something like:

// This thread plan does a single instruction step over a breakpoint instruction and needs // to not resume other threads, so return false to stop the ThreadPlanSingleThreadTimeout // from timing out and trying to resume all threads. If all threads gets resumed before we // disable, single step and re-enable the breakpoint, we can miss breakpoints on other // threads.

labath · 2024-08-16T07:43:24Z

Thanks for looking into this. I'll also defer to Jim, but I'll note two things:

stepping over a breakpoint can still block -- if the breakpoint is on a syscall instruction, and the syscall blocks. However, problem with missed breakpoints is also real, and probably more important than blocked steps. As I don't see a way to resume other threads while not risking them missing breakpoints, this may still be the right thing to do.
I was amused by the hedging in "potential race condition". The race is real, not potential. The fact that you win the race most of the time doesn't mean it doesn't exist. :P

DavidSpickett · 2024-08-16T08:59:27Z

with this log

If there are a few (< 10 maybe) specific lines from the log that are the key ones, please include those in the PR description. Just because there's a slim chance someone in the future might have their own strange timeouts and it would be a useful clue.

If the specific packets are not useful to see then no worries.

jeffreytan81 · 2024-08-16T21:54:53Z

@labath

stepping over a breakpoint can still block

Yes, I discussed this concern with @clayborg before writing it. However, this is not a new issue—the default 'step over' (next command) uses ThreadPlanStepOverBreakpoint to trace a single instruction across a breakpoint without resuming other threads, so it could, in theory, encounter the syscall deadlock as well.

@jimingham, let me know if this PR makes sense to you.

jimingham · 2024-08-27T01:32:08Z

This seems an okay solution for now. We really don't want to miss breakpoint hits if we can at all avoid it.

This isn't a regression, it's just one case where the proposed enhancement to stepping doesn't enhance stepping.

We haven't had many (any?) reports of single stepping over a breakpoint blocking because we're running only one thread. Mostly that's because the actual instructions that might block are in system libraries that people tend to next over.

Note, there's no inherent problem with the instruction we're stepping over not returning, it's only an issue if WE cause it to not return. By "block" we really mean "artificially block because the thread that should have caused the return was suspended by lldb". So this would have to be something like a read that was expecting another thread in the program to be the writer, which also limits the scope of the problem.

In any case, if we are going to implement true "non-stop" debugging we will need to avoid the "remove and replace instructions" dance and instead leave the traps always in place. Since we are contractually obligated not to stop the other threads from the outside, that's the only way we can avoid missing breakpoint hits. We'll have to come up with some scheme to execute the instruction that was under the breakpoint out of place, so that we never have to remove the traps. At that point, ThreadPlanStepOverBreakpoint will be fine with running the other threads, and this problem will go away.

jeffreytan81 · 2024-08-28T18:34:40Z

@jimingham, right, I am pretty sure that Microsoft Visual Studio debugger implements step-over breakpoint the same as lldb so it is a universal standard behavior and I haven't heard anyone really complaining about this.

Based on this, if there is no further concern, anyone can accept PR so that we can address the issue?

jimingham

LGTM

Fix async interrupt for step over breakpoint

e864542

jeffreytan81 requested review from labath, clayborg and jimingham August 16, 2024 01:22

jeffreytan81 requested a review from JDevlieghere as a code owner August 16, 2024 01:22

llvmbot added the lldb label Aug 16, 2024

jeffreytan81 mentioned this pull request Aug 16, 2024

Fix single thread stepping timeout race condition #104195

Merged

clayborg reviewed Aug 16, 2024

View reviewed changes

Update comment

029d203

jimingham approved these changes Aug 28, 2024

View reviewed changes

jeffreytan81 merged commit 38b252a into llvm:main Aug 28, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disable ThreadPlanSingleThreadTimeout during step over breakpoint #104532

Disable ThreadPlanSingleThreadTimeout during step over breakpoint #104532

Uh oh!

jeffreytan81 commented Aug 16, 2024 •

edited

Loading

Uh oh!

llvmbot commented Aug 16, 2024

Uh oh!

clayborg left a comment

Uh oh!

clayborg Aug 16, 2024

Uh oh!

labath commented Aug 16, 2024

Uh oh!

DavidSpickett commented Aug 16, 2024

Uh oh!

jeffreytan81 commented Aug 16, 2024

Uh oh!

jimingham commented Aug 27, 2024

Uh oh!

jeffreytan81 commented Aug 28, 2024

Uh oh!

jimingham left a comment

Uh oh!

Uh oh!

Uh oh!

Disable ThreadPlanSingleThreadTimeout during step over breakpoint #104532

Disable ThreadPlanSingleThreadTimeout during step over breakpoint #104532

Uh oh!

Conversation

jeffreytan81 commented Aug 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Aug 16, 2024

Uh oh!

clayborg left a comment

Choose a reason for hiding this comment

Uh oh!

clayborg Aug 16, 2024

Choose a reason for hiding this comment

Uh oh!

labath commented Aug 16, 2024

Uh oh!

DavidSpickett commented Aug 16, 2024

Uh oh!

jeffreytan81 commented Aug 16, 2024

Uh oh!

jimingham commented Aug 27, 2024

Uh oh!

jeffreytan81 commented Aug 28, 2024

Uh oh!

jimingham left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jeffreytan81 commented Aug 16, 2024 •

edited

Loading