Skip to content

Commit 1b02227

Browse files
committed
[lldb] Detect a Darwin kernel issue and work around it (llvm#81573)
On arm64 machines, when there is a hardware breakpoint or watchpoint set, and lldb has instruction-stepped a thread, and then done a Process::Resume, we will sometimes receive an extra "instruction step completed" mach exception and the pc has not advanced. From a user's perspective, they hit Continue and lldb stops again at the same spot. From the testsuite's perspective, this has been a constant source of testsuite failures for any test using hardware watchpoints and breakpoints, the arm64 CI bots seem especially good at hitting this issue. Jim and I have been slowly looking at this for a few months now, and finally I decided to try to detect this situation in lldb and silently resume the process again when it happens. We were already detecting this "got an insn-step finished mach exception but this thread was not instruction stepping" combination in StopInfoMachException where we take the mach exception and create a StopInfo object for it. We had a lot of logging we used to understand the failure as it was hit on the bots in assert builds. This patch adds a new case to `Thread::GetPrivateStopInfo()` to call the StopInfo's (new) `IsContinueInterrupted()` method. In StopInfoMachException, where we previously had logging for assert builds, I now note it in an ivar, and when `Thread::GetPrivateStopInfo()` asks if this has happened, we check all of the combination of events that this comes up: We have a hardware breakpoint or watchpoint, we were not instruction stepping this thread but got an insn-step mach exception, the pc is the same as the previous stop's pc. And in that case, `Thread::GetPrivateStopInfo()` returns no StopInfo -- indicating that this thread would like to resume execution. The `Thread` object has two StackFrameLists, `m_curr_frames_sp` and `m_prev_frames_sp`. When a thread resumes execution, we move `m_curr_frames_sp` in to `m_prev_frames_sp` and when it stops executing, w euse `m_prev_frames_sp` to seed the new `m_curr_frames_sp` if most of the stack is the same as before. In this same location, I now save the Thread's RegisterContext::GetPC into an ivar, `m_prev_framezero_pc`. StopInfoMachException needs this information to check all of the conditions I outlined above for `IsContinueInterrupted`. This has passed exhaustive testing and we do not have any testsuite failures for hardware watchpoints and breakpoints due to this kernel bug with the patch in place. In focusing on these tests for thousands of runs, I have found two other uncommon race conditions for the TestConcurrent* tests on arm64. TestConcurrentManyBreakpoints.py (which uses no hardware watchpoint/breakpoints) will sometimes only have 99 breakpoints when it expects 100, and any of the concurrent tests using the shared harness (I've seen it in TestConcurrentWatchBreakDelay.py, TestConcurrentTwoBreakpointsOneSignal.py, TestConcurrentSignalDelayWatch.py) can fail when the test harness checks that there is only one thread still running at the end, and it finds two -- one of them under pthread_exit / pthread_terminate. Both of these failures happen on github main without my changes, and with my changes - they are unrelated race conditions in these tests, and I'm sure I'll be looking into them at some point if they hit the CI bots with frequency. On my computer, these are in the 0.3-0.5% of the time class. But the CI bots do have different timing. (cherry picked from commit aab48c9)
1 parent f39c292 commit 1b02227

File tree

5 files changed

+101
-31
lines changed

5 files changed

+101
-31
lines changed

lldb/include/lldb/Target/StopInfo.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,11 @@ class StopInfo : public std::enable_shared_from_this<StopInfo> {
7979

8080
virtual bool IsValidForOperatingSystemThread(Thread &thread) { return true; }
8181

82+
/// A Continue operation can result in a false stop event
83+
/// before any execution has happened. We need to detect this
84+
/// and silently continue again one more time.
85+
virtual bool WasContinueInterrupted(Thread &thread) { return false; }
86+
8287
// Sometimes the thread plan logic will know that it wants a given stop to
8388
// stop or not, regardless of what the ordinary logic for that StopInfo would
8489
// dictate. The main example of this is the ThreadPlanCallFunction, which

lldb/include/lldb/Target/Thread.h

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111

1212
#include <memory>
1313
#include <mutex>
14+
#include <optional>
1415
#include <string>
1516
#include <vector>
1617

@@ -1251,6 +1252,16 @@ class Thread : public std::enable_shared_from_this<Thread>,
12511252

12521253
lldb::ValueObjectSP GetSiginfoValue();
12531254

1255+
/// Request the pc value the thread had when previously stopped.
1256+
///
1257+
/// When the thread performs execution, it copies the current RegisterContext
1258+
/// GetPC() value. This method returns that value, if it is available.
1259+
///
1260+
/// \return
1261+
/// The PC value before execution was resumed. May not be available;
1262+
/// an empty std::optional is returned in that case.
1263+
std::optional<lldb::addr_t> GetPreviousFrameZeroPC();
1264+
12541265
protected:
12551266
friend class ThreadPlan;
12561267
friend class ThreadList;
@@ -1331,6 +1342,9 @@ class Thread : public std::enable_shared_from_this<Thread>,
13311342
///populated after a thread stops.
13321343
lldb::StackFrameListSP m_prev_frames_sp; ///< The previous stack frames from
13331344
///the last time this thread stopped.
1345+
std::optional<lldb::addr_t>
1346+
m_prev_framezero_pc; ///< Frame 0's PC the last
1347+
/// time this thread was stopped.
13341348
int m_resume_signal; ///< The signal that should be used when continuing this
13351349
///thread.
13361350
lldb::StateType m_resume_state; ///< This state is used to force a thread to

lldb/source/Plugins/Process/Utility/StopInfoMachException.cpp

Lines changed: 57 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@
2626
#include "lldb/Target/Thread.h"
2727
#include "lldb/Target/ThreadPlan.h"
2828
#include "lldb/Target/UnixSignals.h"
29+
#include "lldb/Utility/LLDBLog.h"
30+
#include "lldb/Utility/Log.h"
2931
#include "lldb/Utility/StreamString.h"
3032
#include <optional>
3133

@@ -600,6 +602,7 @@ StopInfoSP StopInfoMachException::CreateStopReasonWithMachException(
600602
if (exc_type == 0)
601603
return StopInfoSP();
602604

605+
bool not_stepping_but_got_singlestep_exception = false;
603606
uint32_t pc_decrement = 0;
604607
ExecutionContext exe_ctx(thread.shared_from_this());
605608
Target *target = exe_ctx.GetTargetPtr();
@@ -729,30 +732,8 @@ StopInfoSP StopInfoMachException::CreateStopReasonWithMachException(
729732
// is set
730733
is_actual_breakpoint = true;
731734
is_trace_if_actual_breakpoint_missing = true;
732-
#ifndef NDEBUG
733-
if (thread.GetTemporaryResumeState() != eStateStepping) {
734-
StreamString s;
735-
s.Printf("CreateStopReasonWithMachException got EXC_BREAKPOINT [1,0] "
736-
"indicating trace event, but thread is not tracing, it has "
737-
"ResumeState %d",
738-
thread.GetTemporaryResumeState());
739-
if (RegisterContextSP regctx = thread.GetRegisterContext()) {
740-
if (const RegisterInfo *ri = regctx->GetRegisterInfoByName("esr")) {
741-
uint32_t esr =
742-
(uint32_t)regctx->ReadRegisterAsUnsigned(ri, UINT32_MAX);
743-
if (esr != UINT32_MAX) {
744-
s.Printf(" esr value: 0x%" PRIx32, esr);
745-
}
746-
}
747-
}
748-
thread.GetProcess()->DumpPluginHistory(s);
749-
llvm::report_fatal_error(s.GetData());
750-
lldbassert(
751-
false &&
752-
"CreateStopReasonWithMachException got EXC_BREAKPOINT [1,0] "
753-
"indicating trace event, but thread was not doing a step.");
754-
}
755-
#endif
735+
if (thread.GetTemporaryResumeState() != eStateStepping)
736+
not_stepping_but_got_singlestep_exception = true;
756737
}
757738
if (exc_code == 0x102) // EXC_ARM_DA_DEBUG
758739
{
@@ -839,6 +820,56 @@ StopInfoSP StopInfoMachException::CreateStopReasonWithMachException(
839820
break;
840821
}
841822

842-
return StopInfoSP(new StopInfoMachException(thread, exc_type, exc_data_count,
843-
exc_code, exc_sub_code));
823+
return std::make_shared<StopInfoMachException>(
824+
thread, exc_type, exc_data_count, exc_code, exc_sub_code,
825+
not_stepping_but_got_singlestep_exception);
826+
}
827+
828+
// Detect an unusual situation on Darwin where:
829+
//
830+
// 0. We did an instruction-step before this.
831+
// 1. We have a hardware breakpoint or watchpoint set.
832+
// 2. We resumed the process, but not with an instruction-step.
833+
// 3. The thread gets an "instruction-step completed" mach exception.
834+
// 4. The pc has not advanced - it is the same as before.
835+
//
836+
// This method returns true for that combination of events.
837+
bool StopInfoMachException::WasContinueInterrupted(Thread &thread) {
838+
Log *log = GetLog(LLDBLog::Step);
839+
840+
// We got an instruction-step completed mach exception but we were not
841+
// doing an instruction step on this thread.
842+
if (!m_not_stepping_but_got_singlestep_exception)
843+
return false;
844+
845+
RegisterContextSP reg_ctx_sp(thread.GetRegisterContext());
846+
std::optional<addr_t> prev_pc = thread.GetPreviousFrameZeroPC();
847+
if (!reg_ctx_sp || !prev_pc)
848+
return false;
849+
850+
// The previous pc value and current pc value are the same.
851+
if (*prev_pc != reg_ctx_sp->GetPC())
852+
return false;
853+
854+
// We have a watchpoint -- this is the kernel bug.
855+
ProcessSP process_sp = thread.GetProcess();
856+
if (process_sp->GetWatchpointResourceList().GetSize()) {
857+
LLDB_LOGF(log,
858+
"Thread stopped with insn-step completed mach exception but "
859+
"thread was not stepping; there is a hardware watchpoint set.");
860+
return true;
861+
}
862+
863+
// We have a hardware breakpoint -- this is the kernel bug.
864+
auto &bp_site_list = process_sp->GetBreakpointSiteList();
865+
for (auto &site : bp_site_list.Sites()) {
866+
if (site->IsHardware() && site->IsEnabled()) {
867+
LLDB_LOGF(log,
868+
"Thread stopped with insn-step completed mach exception but "
869+
"thread was not stepping; there is a hardware breakpoint set.");
870+
return true;
871+
}
872+
}
873+
874+
return false;
844875
}

lldb/source/Plugins/Process/Utility/StopInfoMachException.h

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -31,9 +31,12 @@ class StopInfoMachException : public StopInfo {
3131
// Constructors and Destructors
3232
StopInfoMachException(Thread &thread, uint32_t exc_type,
3333
uint32_t exc_data_count, uint64_t exc_code,
34-
uint64_t exc_subcode)
34+
uint64_t exc_subcode,
35+
bool not_stepping_but_got_singlestep_exception)
3536
: StopInfo(thread, exc_type), m_exc_data_count(exc_data_count),
36-
m_exc_code(exc_code), m_exc_subcode(exc_subcode) {}
37+
m_exc_code(exc_code), m_exc_subcode(exc_subcode),
38+
m_not_stepping_but_got_singlestep_exception(
39+
not_stepping_but_got_singlestep_exception) {}
3740

3841
~StopInfoMachException() override = default;
3942

@@ -58,10 +61,14 @@ class StopInfoMachException : public StopInfo {
5861
uint64_t exc_code, uint64_t exc_sub_code, uint64_t exc_sub_sub_code,
5962
bool pc_already_adjusted = true, bool adjust_pc_if_needed = false);
6063

64+
bool WasContinueInterrupted(Thread &thread) override;
65+
6166
protected:
6267
uint32_t m_exc_data_count;
6368
uint64_t m_exc_code;
6469
uint64_t m_exc_subcode;
70+
71+
bool m_not_stepping_but_got_singlestep_exception;
6572
};
6673

6774
} // namespace lldb_private

lldb/source/Target/Thread.cpp

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -223,7 +223,7 @@ Thread::Thread(Process &process, lldb::tid_t tid, bool use_invalid_index_id)
223223
: process.GetNextThreadIndexID(tid)),
224224
m_reg_context_sp(), m_state(eStateUnloaded), m_state_mutex(),
225225
m_frame_mutex(), m_curr_frames_sp(), m_prev_frames_sp(),
226-
m_resume_signal(LLDB_INVALID_SIGNAL_NUMBER),
226+
m_prev_framezero_pc(), m_resume_signal(LLDB_INVALID_SIGNAL_NUMBER),
227227
m_resume_state(eStateRunning), m_temporary_resume_state(eStateRunning),
228228
m_unwinder_up(), m_destroy_called(false),
229229
m_override_should_notify(eLazyBoolCalculate),
@@ -252,6 +252,7 @@ void Thread::DestroyThread() {
252252
std::lock_guard<std::recursive_mutex> guard(m_frame_mutex);
253253
m_curr_frames_sp.reset();
254254
m_prev_frames_sp.reset();
255+
m_prev_framezero_pc.reset();
255256
}
256257

257258
void Thread::BroadcastSelectedFrameChange(StackID &new_frame_id) {
@@ -440,6 +441,12 @@ lldb::StopInfoSP Thread::GetPrivateStopInfo(bool calculate) {
440441
}
441442
}
442443
}
444+
445+
// If we were resuming the process and it was interrupted,
446+
// return no stop reason. This thread would like to resume.
447+
if (m_stop_info_sp && m_stop_info_sp->WasContinueInterrupted(*this))
448+
return {};
449+
443450
return m_stop_info_sp;
444451
}
445452

@@ -1471,16 +1478,22 @@ StackFrameListSP Thread::GetStackFrameList() {
14711478
return m_curr_frames_sp;
14721479
}
14731480

1481+
std::optional<addr_t> Thread::GetPreviousFrameZeroPC() {
1482+
return m_prev_framezero_pc;
1483+
}
1484+
14741485
void Thread::ClearStackFrames() {
14751486
std::lock_guard<std::recursive_mutex> guard(m_frame_mutex);
14761487

14771488
GetUnwinder().Clear();
1489+
m_prev_framezero_pc.reset();
1490+
if (RegisterContextSP reg_ctx_sp = GetRegisterContext())
1491+
m_prev_framezero_pc = reg_ctx_sp->GetPC();
14781492

14791493
// Only store away the old "reference" StackFrameList if we got all its
14801494
// frames:
14811495
// FIXME: At some point we can try to splice in the frames we have fetched
1482-
// into
1483-
// the new frame as we make it, but let's not try that now.
1496+
// into the new frame as we make it, but let's not try that now.
14841497
if (m_curr_frames_sp && m_curr_frames_sp->GetAllFramesFetched())
14851498
m_prev_frames_sp.swap(m_curr_frames_sp);
14861499
m_curr_frames_sp.reset();

0 commit comments

Comments
 (0)