[lldb] Correctly annotate threads at a bp site as hitting it #82709

jasonmolenda · 2024-02-22T23:50:37Z

This is next in my series of "fix the racey tests that fail on greendragon" addressing the failure of TestConcurrentManyBreakpoints.py where we set a breakpoint in a function that 100 threads execute, and we check that we hit the breakpoint 100 times. But sometimes it is only hit 99 times, and the test fails.

When we hit a software breakpoint, the pc value for the thread is the address of the breakpoint instruction - as if it had not been hit yet. And because a user might ADD a breakpoint for the current pc from the commandline, when we go to resume execution, any thread that is sitting at a breakpoint site will be silently advanced past the breakpoint instruction (disable bp, instruction step that thread, re-enable bp) before resuming -- whether that thread has hit its breakpoint or not.

What this test is exposing is that there is another corner case, a thread that is sitting at a breakpoint site but has not yet executed the breakpoint instruction. The thread will have no stop reason, no mach exception, so it will not be recorded as having hit the breakpoint (because it hasn't yet). But when we resume execution, because it is sitting at a breakpoint site, we advance past it and miss the breakpoint hit.

In 2016 Abhishek Aggarwal handled a similar issue with a patch in ProcessGDBRemote::SetThreadStopInfo(), adding a breakpoint StopInfo for a thread sitting at a breakpoint site that has no stop reason. debugserver's jThreadsInfo would not correctly execute Abhishek's code though because it would respond with "reason":"none" for a thread with no stop reason, and SetThreadStopInfo() expected an empty reason here. The first part of my patch is to clear the reason if it is "none" so we flow through the code correctly.

On Darwin, though, our stop reply packet (Txx...) includes the threads, thread-pcs, and jstopinfo keys, which give us the tids for all current threads, the pc values for those threads, and jstopinfo has a JSON dictionary with the mach exceptions for all threads that have a mach exception. In
ProcessGDBRemote::CalculateThreadStopInfo() we set the StopInfo for each thread for a private stop and if we have jstopinfo it is the source of all the StopInfos. I have to add the same logic here, to give the thread a breakpoint StopInfo even though it hasn't executed the breakpoint yet. In this case we are very early in thread construction and I only have the information in the Txx stop reply packet -- tids, pcs, and jstopinfo, so I can't use the normal general mechanisms of going through the RegisterContext to get the pc, it's a bit different.

If I hack debugserver to not issue jstopinfo, CalculateThreadStopInfo will fall back to sending qThreadStopInfo for each thread and going through ProcessGDBRemote::SetThreadStopInfo() to set the stop infos (and with the reason:none fix, use Abhishek's code).

rdar://110549165

This is next in my series of "fix the racey tests that fail on greendragon" addressing the failure of TestConcurrentManyBreakpoints.py where we set a breakpoint in a function that 100 threads execute, and we check that we hit the breakpoint 100 times. But sometimes it is only hit 99 times, and the test fails. When we hit a software breakpoint, the pc value for the thread is the address of the breakpoint instruction - as if it had not been hit yet. And because a user might ADD a breakpoint for the current pc from the commandline, when we go to resume execution, any thread that is sitting at a breakpoint site will be silently advanced past the breakpoint instruction (disable bp, instruction step that thread, re-enable bp) before resuming. What this test is exposing is that there is another corner case, a thread that is sitting at a breakpoint site but has not yet executed the breakpoint instruction. The thread will have no stop reason, no mach exception, so it will not be recorded as having hit the breakpoint (because it hasn't yet). But when we resume execution, because it is sitting at a breakpoint site, we advance past it and miss the breakpoint hit. In 2016 Abhishek Aggarwal handled a similar issue with a patch in `ProcessGDBRemote::SetThreadStopInfo()`, adding a breakpoint StopInfo for a thread sitting at a breakpoint site that has no stop reason. debugserver's `jThreadsInfo` would not correctly execute Abhishek's code though because it would respond with `"reason":"none"` for a thread with no stop reason, and `SetThreadStopInfo()` expected an empty reason here. The first part of my patch is to clear the `reason` if it is `"none"` so we flow through the code correctly. On Darwin, though, our stop reply packet (Txx...) includes the `threads`, `thread-pcs`, and `jstopinfo` keys, which give us the tids for all current threads, the pc values for those threads, and `jstopinfo` has a JSON dictionary with the mach exceptions for all threads that have a mach exception. In `ProcessGDBRemote::CalculateThreadStopInfo()` we set the StopInfo for each thread for a private stop and if we have `jstopinfo` it is the source of all the StopInfos. I have to add the same logic here, to give the thread a breakpoint StopInfo even though it hasn't executed the breakpoint yet. In this case we are very early in thread construction and I only have the information in the Txx stop reply packet -- tids, pcs, and jstopinfo, so I can't use the normal general mechanisms of going through the RegisterContext to get the pc, it's a bit different. If I hack debugserver to not issue `jstopinfo`, `CalculateThreadStopInfo` will fall back to sending `qThreadStopInfo` for each thread and going through `ProcessGDBRemote::SetThreadStopInfo()` to set the stop infos (and with the `reason:none` fix, use Abhishek's code). rdar://110549165

llvmbot · 2024-02-22T23:51:06Z

@llvm/pr-subscribers-lldb

Author: Jason Molenda (jasonmolenda)

Changes

This is next in my series of "fix the racey tests that fail on greendragon" addressing the failure of TestConcurrentManyBreakpoints.py where we set a breakpoint in a function that 100 threads execute, and we check that we hit the breakpoint 100 times. But sometimes it is only hit 99 times, and the test fails.

When we hit a software breakpoint, the pc value for the thread is the address of the breakpoint instruction - as if it had not been hit yet. And because a user might ADD a breakpoint for the current pc from the commandline, when we go to resume execution, any thread that is sitting at a breakpoint site will be silently advanced past the breakpoint instruction (disable bp, instruction step that thread, re-enable bp) before resuming.

What this test is exposing is that there is another corner case, a thread that is sitting at a breakpoint site but has not yet executed the breakpoint instruction. The thread will have no stop reason, no mach exception, so it will not be recorded as having hit the breakpoint (because it hasn't yet). But when we resume execution, because it is sitting at a breakpoint site, we advance past it and miss the breakpoint hit.

In 2016 Abhishek Aggarwal handled a similar issue with a patch in ProcessGDBRemote::SetThreadStopInfo(), adding a breakpoint StopInfo for a thread sitting at a breakpoint site that has no stop reason. debugserver's jThreadsInfo would not correctly execute Abhishek's code though because it would respond with "reason":"none" for a thread with no stop reason, and SetThreadStopInfo() expected an empty reason here. The first part of my patch is to clear the reason if it is "none" so we flow through the code correctly.

On Darwin, though, our stop reply packet (Txx...) includes the threads, thread-pcs, and jstopinfo keys, which give us the tids for all current threads, the pc values for those threads, and jstopinfo has a JSON dictionary with the mach exceptions for all threads that have a mach exception. In
ProcessGDBRemote::CalculateThreadStopInfo() we set the StopInfo for each thread for a private stop and if we have jstopinfo it is the source of all the StopInfos. I have to add the same logic here, to give the thread a breakpoint StopInfo even though it hasn't executed the breakpoint yet. In this case we are very early in thread construction and I only have the information in the Txx stop reply packet -- tids, pcs, and jstopinfo, so I can't use the normal general mechanisms of going through the RegisterContext to get the pc, it's a bit different.

If I hack debugserver to not issue jstopinfo, CalculateThreadStopInfo will fall back to sending qThreadStopInfo for each thread and going through ProcessGDBRemote::SetThreadStopInfo() to set the stop infos (and with the reason:none fix, use Abhishek's code).

rdar://110549165

Full diff: https://github.com/llvm/llvm-project/pull/82709.diff

2 Files Affected:

(modified) lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp (+27-6)
(modified) lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h (+1-1)

diff --git a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
index 629b191f3117aa..eabbc8ad433212 100644
--- a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
+++ b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
@@ -1600,6 +1600,26 @@ bool ProcessGDBRemote::CalculateThreadStopInfo(ThreadGDBRemote *thread) {
     // has no stop reason.
     thread->GetRegisterContext()->InvalidateIfNeeded(true);
     if (!GetThreadStopInfoFromJSON(thread, m_jstopinfo_sp)) {
+      // If a thread is stopped at a breakpoint site, set that as the stop
+      // reason even if it hasn't executed the breakpoint instruction yet.
+      // We will silently step over the breakpoint when we resume execution
+      // and miss the fact that this thread hit the breakpoint.
+      const size_t num_thread_ids = m_thread_ids.size();
+      for (size_t i = 0; i < num_thread_ids; i++) {
+        if (m_thread_ids[i] == thread->GetID() && m_thread_pcs.size() > i) {
+          addr_t pc = m_thread_pcs[i];
+          lldb::BreakpointSiteSP bp_site_sp =
+              thread->GetProcess()->GetBreakpointSiteList().FindByAddress(pc);
+          if (bp_site_sp) {
+            if (bp_site_sp->ValidForThisThread(*thread)) {
+              thread->SetStopInfo(
+                  StopInfo::CreateStopReasonWithBreakpointSiteID(
+                      *thread, bp_site_sp->GetID()));
+              return true;
+            }
+          }
+        }
+      }
       thread->SetStopInfo(StopInfoSP());
     }
     return true;
@@ -1630,7 +1650,7 @@ void ProcessGDBRemote::ParseExpeditedRegisters(
 
 ThreadSP ProcessGDBRemote::SetThreadStopInfo(
     lldb::tid_t tid, ExpeditedRegisterMap &expedited_register_map,
-    uint8_t signo, const std::string &thread_name, const std::string &reason,
+    uint8_t signo, const std::string &thread_name, std::string reason,
     const std::string &description, uint32_t exc_type,
     const std::vector<addr_t> &exc_data, addr_t thread_dispatch_qaddr,
     bool queue_vars_valid, // Set to true if queue_name, queue_kind and
@@ -1722,6 +1742,8 @@ ThreadSP ProcessGDBRemote::SetThreadStopInfo(
     } else {
       bool handled = false;
       bool did_exec = false;
+      if (reason == "none")
+        reason.clear();
       if (!reason.empty()) {
         if (reason == "trace") {
           addr_t pc = thread_sp->GetRegisterContext()->GetPC();
@@ -1864,11 +1886,10 @@ ThreadSP ProcessGDBRemote::SetThreadStopInfo(
         lldb::BreakpointSiteSP bp_site_sp =
             thread_sp->GetProcess()->GetBreakpointSiteList().FindByAddress(pc);
 
-        // If the current pc is a breakpoint site then the StopInfo should be
-        // set to Breakpoint even though the remote stub did not set it as such.
-        // This can happen when the thread is involuntarily interrupted (e.g.
-        // due to stops on other threads) just as it is about to execute the
-        // breakpoint instruction.
+        // If a thread is stopped at a breakpoint site, set that as the stop
+        // reason even if it hasn't executed the breakpoint instruction yet.
+        // We will silently step over the breakpoint when we resume execution
+        // and miss the fact that this thread hit the breakpoint.
         if (bp_site_sp && bp_site_sp->ValidForThisThread(*thread_sp)) {
           thread_sp->SetStopInfo(StopInfo::CreateStopReasonWithBreakpointSiteID(
               *thread_sp, bp_site_sp->GetID()));
diff --git a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h
index c1ea1cc7905587..f8b49c318180e3 100644
--- a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h
+++ b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h
@@ -366,7 +366,7 @@ class ProcessGDBRemote : public Process,
   lldb::ThreadSP
   SetThreadStopInfo(lldb::tid_t tid,
                     ExpeditedRegisterMap &expedited_register_map, uint8_t signo,
-                    const std::string &thread_name, const std::string &reason,
+                    const std::string &thread_name, std::string reason,
                     const std::string &description, uint32_t exc_type,
                     const std::vector<lldb::addr_t> &exc_data,
                     lldb::addr_t thread_dispatch_qaddr, bool queue_vars_valid,

JDevlieghere

Makes sense to me, but I'll defer to @jimingham as he's the expert.

jimingham

This seems safe to me, and it can't hurt to take care of this corner case here regardless of what the higher levels of lldb does.

It bugs me a little that we're treating what seems like a general problem down in the GDBRemote process plugin. But anyway, if we are going to make this a process plugin policy: that the plugins are responsible for producing breakpoint stop reasons both for when the underlying system generates one and when we see "artificial" hits like this we should probably say so somewhere.

With the couple doc comments aside, I'm fine with this.

jimingham · 2024-02-23T17:26:56Z

lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp

@@ -1722,6 +1742,8 @@ ThreadSP ProcessGDBRemote::SetThreadStopInfo(
    } else {
      bool handled = false;
      bool did_exec = false;
+      if (reason == "none")


This definitely needs a comment, or maybe somewhere else saying that none == empty, so we always convert none to empty.

I updated the patch where I handle this a little more cleanly, checking that reason has a value, and that the value is not "none", and added a short comment.

Treat `reason:none` as equivalent to no `reason` more cleanly in ProcessGDBRemote::SetThreadStopInfo. Add documentation to the Thread base class `CalculateStopInfo` method noting that the case of a thread at a breakpoint site which has not yet executed the breakpoint must be handled.

jasonmolenda · 2024-02-23T22:00:33Z

This seems safe to me, and it can't hurt to take care of this corner case here regardless of what the higher levels of lldb does.

This all falls under Thread::CalculateStopInfo() - ThreadGDBRemote::CalculateStopInfo is calling into ProcessGDBRemote where we do this. The base class method is pure virtual, so I'm not sure there's a higher level place where this logic could be situated. I updated the PR with a comment on Thread::CalculateStopInfo() specifically detailing this issue, so people writing new subclasses can (hopefully) be aware of it.

) This is next in my series of "fix the racey tests that fail on greendragon" addressing the failure of TestConcurrentManyBreakpoints.py where we set a breakpoint in a function that 100 threads execute, and we check that we hit the breakpoint 100 times. But sometimes it is only hit 99 times, and the test fails. When we hit a software breakpoint, the pc value for the thread is the address of the breakpoint instruction - as if it had not been hit yet. And because a user might ADD a breakpoint for the current pc from the commandline, when we go to resume execution, any thread that is sitting at a breakpoint site will be silently advanced past the breakpoint instruction (disable bp, instruction step that thread, re-enable bp) before resuming -- whether that thread has hit its breakpoint or not. What this test is exposing is that there is another corner case, a thread that is sitting at a breakpoint site but has not yet executed the breakpoint instruction. The thread will have no stop reason, no mach exception, so it will not be recorded as having hit the breakpoint (because it hasn't yet). But when we resume execution, because it is sitting at a breakpoint site, we advance past it and miss the breakpoint hit. In 2016 Abhishek Aggarwal handled a similar issue with a patch in `ProcessGDBRemote::SetThreadStopInfo()`, adding a breakpoint StopInfo for a thread sitting at a breakpoint site that has no stop reason. debugserver's `jThreadsInfo` would not correctly execute Abhishek's code though because it would respond with `"reason":"none"` for a thread with no stop reason, and `SetThreadStopInfo()` expected an empty reason here. The first part of my patch is to clear the `reason` if it is `"none"` so we flow through the code correctly. On Darwin, though, our stop reply packet (Txx...) includes the `threads`, `thread-pcs`, and `jstopinfo` keys, which give us the tids for all current threads, the pc values for those threads, and `jstopinfo` has a JSON dictionary with the mach exceptions for all threads that have a mach exception. In `ProcessGDBRemote::CalculateThreadStopInfo()` we set the StopInfo for each thread for a private stop and if we have `jstopinfo` it is the source of all the StopInfos. I have to add the same logic here, to give the thread a breakpoint StopInfo even though it hasn't executed the breakpoint yet. In this case we are very early in thread construction and I only have the information in the Txx stop reply packet -- tids, pcs, and jstopinfo, so I can't use the normal general mechanisms of going through the RegisterContext to get the pc, it's a bit different. If I hack debugserver to not issue `jstopinfo`, `CalculateThreadStopInfo` will fall back to sending `qThreadStopInfo` for each thread and going through `ProcessGDBRemote::SetThreadStopInfo()` to set the stop infos (and with the `reason:none` fix, use Abhishek's code). rdar://110549165 (cherry picked from commit 87fadb3)

jasonmolenda requested a review from jimingham February 22, 2024 23:50

jasonmolenda requested a review from JDevlieghere as a code owner February 22, 2024 23:50

llvmbot added the lldb label Feb 22, 2024

JDevlieghere reviewed Feb 23, 2024

View reviewed changes

jimingham approved these changes Feb 23, 2024

View reviewed changes

jasonmolenda merged commit 87fadb3 into llvm:main Feb 23, 2024

jasonmolenda deleted the add-breakpoint-stopinfo-to-thread-at-breakpoint-site branch February 23, 2024 22:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[lldb] Correctly annotate threads at a bp site as hitting it #82709

[lldb] Correctly annotate threads at a bp site as hitting it #82709

Uh oh!

jasonmolenda commented Feb 22, 2024 •

edited

Loading

Uh oh!

llvmbot commented Feb 22, 2024

Uh oh!

JDevlieghere left a comment

Uh oh!

jimingham left a comment

Uh oh!

jimingham Feb 23, 2024

Uh oh!

jasonmolenda Feb 23, 2024

Uh oh!

jasonmolenda commented Feb 23, 2024

Uh oh!

Uh oh!

[lldb] Correctly annotate threads at a bp site as hitting it #82709

[lldb] Correctly annotate threads at a bp site as hitting it #82709

Uh oh!

Conversation

jasonmolenda commented Feb 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Feb 22, 2024

Uh oh!

JDevlieghere left a comment

Choose a reason for hiding this comment

Uh oh!

jimingham left a comment

Choose a reason for hiding this comment

Uh oh!

jimingham Feb 23, 2024

Choose a reason for hiding this comment

Uh oh!

jasonmolenda Feb 23, 2024

Choose a reason for hiding this comment

Uh oh!

jasonmolenda commented Feb 23, 2024

Uh oh!

Uh oh!

jasonmolenda commented Feb 22, 2024 •

edited

Loading