Skip to content

[lldb] Correctly annotate threads at a bp site as hitting it #82709

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jasonmolenda
Copy link
Collaborator

@jasonmolenda jasonmolenda commented Feb 22, 2024

This is next in my series of "fix the racey tests that fail on greendragon" addressing the failure of TestConcurrentManyBreakpoints.py where we set a breakpoint in a function that 100 threads execute, and we check that we hit the breakpoint 100 times. But sometimes it is only hit 99 times, and the test fails.

When we hit a software breakpoint, the pc value for the thread is the address of the breakpoint instruction - as if it had not been hit yet. And because a user might ADD a breakpoint for the current pc from the commandline, when we go to resume execution, any thread that is sitting at a breakpoint site will be silently advanced past the breakpoint instruction (disable bp, instruction step that thread, re-enable bp) before resuming -- whether that thread has hit its breakpoint or not.

What this test is exposing is that there is another corner case, a thread that is sitting at a breakpoint site but has not yet executed the breakpoint instruction. The thread will have no stop reason, no mach exception, so it will not be recorded as having hit the breakpoint (because it hasn't yet). But when we resume execution, because it is sitting at a breakpoint site, we advance past it and miss the breakpoint hit.

In 2016 Abhishek Aggarwal handled a similar issue with a patch in ProcessGDBRemote::SetThreadStopInfo(), adding a breakpoint StopInfo for a thread sitting at a breakpoint site that has no stop reason. debugserver's jThreadsInfo would not correctly execute Abhishek's code though because it would respond with "reason":"none" for a thread with no stop reason, and SetThreadStopInfo() expected an empty reason here. The first part of my patch is to clear the reason if it is "none" so we flow through the code correctly.

On Darwin, though, our stop reply packet (Txx...) includes the threads, thread-pcs, and jstopinfo keys, which give us the tids for all current threads, the pc values for those threads, and jstopinfo has a JSON dictionary with the mach exceptions for all threads that have a mach exception. In
ProcessGDBRemote::CalculateThreadStopInfo() we set the StopInfo for each thread for a private stop and if we have jstopinfo it is the source of all the StopInfos. I have to add the same logic here, to give the thread a breakpoint StopInfo even though it hasn't executed the breakpoint yet. In this case we are very early in thread construction and I only have the information in the Txx stop reply packet -- tids, pcs, and jstopinfo, so I can't use the normal general mechanisms of going through the RegisterContext to get the pc, it's a bit different.

If I hack debugserver to not issue jstopinfo, CalculateThreadStopInfo will fall back to sending qThreadStopInfo for each thread and going through ProcessGDBRemote::SetThreadStopInfo() to set the stop infos (and with the reason:none fix, use Abhishek's code).

rdar://110549165

This is next in my series of "fix the racey tests that fail on
greendragon" addressing the failure of TestConcurrentManyBreakpoints.py
where we set a breakpoint in a function that 100 threads execute,
and we check that we hit the breakpoint 100 times.  But sometimes
it is only hit 99 times, and the test fails.

When we hit a software breakpoint, the pc value for the thread is
the address of the breakpoint instruction - as if it had not been
hit yet.  And because a user might ADD a breakpoint for the current
pc from the commandline, when we go to resume execution, any thread
that is sitting at a breakpoint site will be silently advanced past
the breakpoint instruction (disable bp, instruction step that thread,
re-enable bp) before resuming.

What this test is exposing is that there is another corner case, a
thread that is sitting at a breakpoint site but has not yet executed
the breakpoint instruction.  The thread will have no stop reason,
no mach exception, so it will not be recorded as having hit the
breakpoint (because it hasn't yet).  But when we resume execution,
because it is sitting at a breakpoint site, we advance past it and
miss the breakpoint hit.

In 2016 Abhishek Aggarwal handled a similar issue with a patch in
`ProcessGDBRemote::SetThreadStopInfo()`, adding a breakpoint StopInfo
for a thread sitting at a breakpoint site that has no stop reason.
debugserver's `jThreadsInfo` would not correctly execute Abhishek's
code though because it would respond with `"reason":"none"` for a
thread with no stop reason, and `SetThreadStopInfo()` expected an
empty reason here.  The first part of my patch is to clear the
`reason` if it is `"none"` so we flow through the code correctly.

On Darwin, though, our stop reply packet (Txx...) includes the
`threads`, `thread-pcs`, and `jstopinfo` keys, which give us the
tids for all current threads, the pc values for those threads, and
`jstopinfo` has a JSON dictionary with the mach exceptions for all
threads that have a mach exception.  In
`ProcessGDBRemote::CalculateThreadStopInfo()` we set the StopInfo
for each thread for a private stop and if we have `jstopinfo` it
is the source of all the StopInfos.  I have to add the same logic
here, to give the thread a breakpoint StopInfo even though it hasn't
executed the breakpoint yet.  In this case we are very early in
thread construction and I only have the information in the Txx stop
reply packet -- tids, pcs, and jstopinfo, so I can't use the normal
general mechanisms of going through the RegisterContext to get the
pc, it's a bit different.

If I hack debugserver to not issue `jstopinfo`, `CalculateThreadStopInfo`
will fall back to sending `qThreadStopInfo` for each thread and
going through `ProcessGDBRemote::SetThreadStopInfo()` to set the
stop infos (and with the `reason:none` fix, use Abhishek's code).

rdar://110549165
@llvmbot
Copy link
Member

llvmbot commented Feb 22, 2024

@llvm/pr-subscribers-lldb

Author: Jason Molenda (jasonmolenda)

Changes

This is next in my series of "fix the racey tests that fail on greendragon" addressing the failure of TestConcurrentManyBreakpoints.py where we set a breakpoint in a function that 100 threads execute, and we check that we hit the breakpoint 100 times. But sometimes it is only hit 99 times, and the test fails.

When we hit a software breakpoint, the pc value for the thread is the address of the breakpoint instruction - as if it had not been hit yet. And because a user might ADD a breakpoint for the current pc from the commandline, when we go to resume execution, any thread that is sitting at a breakpoint site will be silently advanced past the breakpoint instruction (disable bp, instruction step that thread, re-enable bp) before resuming.

What this test is exposing is that there is another corner case, a thread that is sitting at a breakpoint site but has not yet executed the breakpoint instruction. The thread will have no stop reason, no mach exception, so it will not be recorded as having hit the breakpoint (because it hasn't yet). But when we resume execution, because it is sitting at a breakpoint site, we advance past it and miss the breakpoint hit.

In 2016 Abhishek Aggarwal handled a similar issue with a patch in ProcessGDBRemote::SetThreadStopInfo(), adding a breakpoint StopInfo for a thread sitting at a breakpoint site that has no stop reason. debugserver's jThreadsInfo would not correctly execute Abhishek's code though because it would respond with "reason":"none" for a thread with no stop reason, and SetThreadStopInfo() expected an empty reason here. The first part of my patch is to clear the reason if it is "none" so we flow through the code correctly.

On Darwin, though, our stop reply packet (Txx...) includes the threads, thread-pcs, and jstopinfo keys, which give us the tids for all current threads, the pc values for those threads, and jstopinfo has a JSON dictionary with the mach exceptions for all threads that have a mach exception. In
ProcessGDBRemote::CalculateThreadStopInfo() we set the StopInfo for each thread for a private stop and if we have jstopinfo it is the source of all the StopInfos. I have to add the same logic here, to give the thread a breakpoint StopInfo even though it hasn't executed the breakpoint yet. In this case we are very early in thread construction and I only have the information in the Txx stop reply packet -- tids, pcs, and jstopinfo, so I can't use the normal general mechanisms of going through the RegisterContext to get the pc, it's a bit different.

If I hack debugserver to not issue jstopinfo, CalculateThreadStopInfo will fall back to sending qThreadStopInfo for each thread and going through ProcessGDBRemote::SetThreadStopInfo() to set the stop infos (and with the reason:none fix, use Abhishek's code).

rdar://110549165


Full diff: https://github.com/llvm/llvm-project/pull/82709.diff

2 Files Affected:

  • (modified) lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp (+27-6)
  • (modified) lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h (+1-1)
diff --git a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
index 629b191f3117aa..eabbc8ad433212 100644
--- a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
+++ b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
@@ -1600,6 +1600,26 @@ bool ProcessGDBRemote::CalculateThreadStopInfo(ThreadGDBRemote *thread) {
     // has no stop reason.
     thread->GetRegisterContext()->InvalidateIfNeeded(true);
     if (!GetThreadStopInfoFromJSON(thread, m_jstopinfo_sp)) {
+      // If a thread is stopped at a breakpoint site, set that as the stop
+      // reason even if it hasn't executed the breakpoint instruction yet.
+      // We will silently step over the breakpoint when we resume execution
+      // and miss the fact that this thread hit the breakpoint.
+      const size_t num_thread_ids = m_thread_ids.size();
+      for (size_t i = 0; i < num_thread_ids; i++) {
+        if (m_thread_ids[i] == thread->GetID() && m_thread_pcs.size() > i) {
+          addr_t pc = m_thread_pcs[i];
+          lldb::BreakpointSiteSP bp_site_sp =
+              thread->GetProcess()->GetBreakpointSiteList().FindByAddress(pc);
+          if (bp_site_sp) {
+            if (bp_site_sp->ValidForThisThread(*thread)) {
+              thread->SetStopInfo(
+                  StopInfo::CreateStopReasonWithBreakpointSiteID(
+                      *thread, bp_site_sp->GetID()));
+              return true;
+            }
+          }
+        }
+      }
       thread->SetStopInfo(StopInfoSP());
     }
     return true;
@@ -1630,7 +1650,7 @@ void ProcessGDBRemote::ParseExpeditedRegisters(
 
 ThreadSP ProcessGDBRemote::SetThreadStopInfo(
     lldb::tid_t tid, ExpeditedRegisterMap &expedited_register_map,
-    uint8_t signo, const std::string &thread_name, const std::string &reason,
+    uint8_t signo, const std::string &thread_name, std::string reason,
     const std::string &description, uint32_t exc_type,
     const std::vector<addr_t> &exc_data, addr_t thread_dispatch_qaddr,
     bool queue_vars_valid, // Set to true if queue_name, queue_kind and
@@ -1722,6 +1742,8 @@ ThreadSP ProcessGDBRemote::SetThreadStopInfo(
     } else {
       bool handled = false;
       bool did_exec = false;
+      if (reason == "none")
+        reason.clear();
       if (!reason.empty()) {
         if (reason == "trace") {
           addr_t pc = thread_sp->GetRegisterContext()->GetPC();
@@ -1864,11 +1886,10 @@ ThreadSP ProcessGDBRemote::SetThreadStopInfo(
         lldb::BreakpointSiteSP bp_site_sp =
             thread_sp->GetProcess()->GetBreakpointSiteList().FindByAddress(pc);
 
-        // If the current pc is a breakpoint site then the StopInfo should be
-        // set to Breakpoint even though the remote stub did not set it as such.
-        // This can happen when the thread is involuntarily interrupted (e.g.
-        // due to stops on other threads) just as it is about to execute the
-        // breakpoint instruction.
+        // If a thread is stopped at a breakpoint site, set that as the stop
+        // reason even if it hasn't executed the breakpoint instruction yet.
+        // We will silently step over the breakpoint when we resume execution
+        // and miss the fact that this thread hit the breakpoint.
         if (bp_site_sp && bp_site_sp->ValidForThisThread(*thread_sp)) {
           thread_sp->SetStopInfo(StopInfo::CreateStopReasonWithBreakpointSiteID(
               *thread_sp, bp_site_sp->GetID()));
diff --git a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h
index c1ea1cc7905587..f8b49c318180e3 100644
--- a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h
+++ b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h
@@ -366,7 +366,7 @@ class ProcessGDBRemote : public Process,
   lldb::ThreadSP
   SetThreadStopInfo(lldb::tid_t tid,
                     ExpeditedRegisterMap &expedited_register_map, uint8_t signo,
-                    const std::string &thread_name, const std::string &reason,
+                    const std::string &thread_name, std::string reason,
                     const std::string &description, uint32_t exc_type,
                     const std::vector<lldb::addr_t> &exc_data,
                     lldb::addr_t thread_dispatch_qaddr, bool queue_vars_valid,

Copy link
Member

@JDevlieghere JDevlieghere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, but I'll defer to @jimingham as he's the expert.

Copy link
Collaborator

@jimingham jimingham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems safe to me, and it can't hurt to take care of this corner case here regardless of what the higher levels of lldb does.

It bugs me a little that we're treating what seems like a general problem down in the GDBRemote process plugin. But anyway, if we are going to make this a process plugin policy: that the plugins are responsible for producing breakpoint stop reasons both for when the underlying system generates one and when we see "artificial" hits like this we should probably say so somewhere.

With the couple doc comments aside, I'm fine with this.

@@ -1722,6 +1742,8 @@ ThreadSP ProcessGDBRemote::SetThreadStopInfo(
} else {
bool handled = false;
bool did_exec = false;
if (reason == "none")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely needs a comment, or maybe somewhere else saying that none == empty, so we always convert none to empty.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the patch where I handle this a little more cleanly, checking that reason has a value, and that the value is not "none", and added a short comment.

Treat `reason:none` as equivalent to no `reason`
more cleanly in ProcessGDBRemote::SetThreadStopInfo.

Add documentation to the Thread base class `CalculateStopInfo`
method noting that the case of a thread at a breakpoint site
which has not yet executed the breakpoint must be handled.
@jasonmolenda
Copy link
Collaborator Author

This seems safe to me, and it can't hurt to take care of this corner case here regardless of what the higher levels of lldb does.

This all falls under Thread::CalculateStopInfo() - ThreadGDBRemote::CalculateStopInfo is calling into ProcessGDBRemote where we do this. The base class method is pure virtual, so I'm not sure there's a higher level place where this logic could be situated. I updated the PR with a comment on Thread::CalculateStopInfo() specifically detailing this issue, so people writing new subclasses can (hopefully) be aware of it.

@jasonmolenda jasonmolenda merged commit 87fadb3 into llvm:main Feb 23, 2024
@jasonmolenda jasonmolenda deleted the add-breakpoint-stopinfo-to-thread-at-breakpoint-site branch February 23, 2024 22:45
jasonmolenda added a commit to jasonmolenda/llvm-project that referenced this pull request Feb 24, 2024
)

This is next in my series of "fix the racey tests that fail on
greendragon" addressing the failure of TestConcurrentManyBreakpoints.py
where we set a breakpoint in a function that 100 threads execute, and we
check that we hit the breakpoint 100 times. But sometimes it is only hit
99 times, and the test fails.

When we hit a software breakpoint, the pc value for the thread is the
address of the breakpoint instruction - as if it had not been hit yet.
And because a user might ADD a breakpoint for the current pc from the
commandline, when we go to resume execution, any thread that is sitting
at a breakpoint site will be silently advanced past the breakpoint
instruction (disable bp, instruction step that thread, re-enable bp)
before resuming -- whether that thread has hit its breakpoint or not.

What this test is exposing is that there is another corner case, a
thread that is sitting at a breakpoint site but has not yet executed the
breakpoint instruction. The thread will have no stop reason, no mach
exception, so it will not be recorded as having hit the breakpoint
(because it hasn't yet). But when we resume execution, because it is
sitting at a breakpoint site, we advance past it and miss the breakpoint
hit.

In 2016 Abhishek Aggarwal handled a similar issue with a patch in
`ProcessGDBRemote::SetThreadStopInfo()`, adding a breakpoint StopInfo
for a thread sitting at a breakpoint site that has no stop reason.
debugserver's `jThreadsInfo` would not correctly execute Abhishek's code
though because it would respond with `"reason":"none"` for a thread with
no stop reason, and `SetThreadStopInfo()` expected an empty reason here.
The first part of my patch is to clear the `reason` if it is `"none"` so
we flow through the code correctly.

On Darwin, though, our stop reply packet (Txx...) includes the
`threads`, `thread-pcs`, and `jstopinfo` keys, which give us the tids
for all current threads, the pc values for those threads, and
`jstopinfo` has a JSON dictionary with the mach exceptions for all
threads that have a mach exception. In
`ProcessGDBRemote::CalculateThreadStopInfo()` we set the StopInfo for
each thread for a private stop and if we have `jstopinfo` it is the
source of all the StopInfos. I have to add the same logic here, to give
the thread a breakpoint StopInfo even though it hasn't executed the
breakpoint yet. In this case we are very early in thread construction
and I only have the information in the Txx stop reply packet -- tids,
pcs, and jstopinfo, so I can't use the normal general mechanisms of
going through the RegisterContext to get the pc, it's a bit different.

If I hack debugserver to not issue `jstopinfo`,
`CalculateThreadStopInfo` will fall back to sending `qThreadStopInfo`
for each thread and going through
`ProcessGDBRemote::SetThreadStopInfo()` to set the stop infos (and with
the `reason:none` fix, use Abhishek's code).

rdar://110549165
(cherry picked from commit 87fadb3)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants