Skip to content

New ThreadPlanSingleThreadTimeout to resolve potential deadlock in single thread stepping #90930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Aug 6, 2024

Conversation

jeffreytan81
Copy link
Contributor

@jeffreytan81 jeffreytan81 commented May 3, 2024

This PR introduces a new ThreadPlanSingleThreadTimeout that will be used to address potential deadlock during single-thread stepping.

While debugging a target with a non-trivial number of threads (around 5000 threads in one example target), we noticed that a simple step over can take as long as 10 seconds. Enabling single-thread stepping mode significantly reduces the stepping time to around 3 seconds. However, this can introduce deadlock if we try to step over a method that depends on other threads to release a lock.

To address this issue, we introduce a new ThreadPlanSingleThreadTimeout that can be controlled by the target.process.thread.single-thread-plan-timeout setting during single-thread stepping mode. The concept involves counting the elapsed time since the last internal stop to detect overall stepping progress. Once a timeout occurs, we assume the target is not making progress due to a potential deadlock, as mentioned above. We then send a new async interrupt, resume all threads, and ThreadPlanSingleThreadTimeout completes its task.

To support this design, the major changes made in this PR are:

  1. ThreadPlanSingleThreadTimeout is popped during every internal stop and reset (re-pushed) to the top of the stack (as a leaf node) during resume. This is achieved by always returning true from ThreadPlanSingleThreadTimeout::DoPlanExplainsStop() and ThreadPlanSingleThreadTimeout::MischiefManaged().
  2. A new thread-specific async interrupt stop is introduced, which can be detected/consumed by ThreadPlanSingleThreadTimeout.
  3. The clearing of branch breakpoints in the range thread plan has been moved from DoPlanExplainsStop() to ShouldStop(), as it is not guaranteed that it will be called.

The detailed design is discussed in the RFC below:
https://discourse.llvm.org/t/improve-single-thread-stepping/74599

Copy link

github-actions bot commented May 6, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@jeffreytan81 jeffreytan81 marked this pull request as ready for review May 6, 2024 21:56
@jeffreytan81 jeffreytan81 requested a review from JDevlieghere as a code owner May 6, 2024 21:56
@llvmbot llvmbot added the lldb label May 6, 2024
@llvmbot
Copy link
Member

llvmbot commented May 6, 2024

@llvm/pr-subscribers-lldb

Author: None (jeffreytan81)

Changes

This PR introduces a new ThreadPlanSingleThreadTimeout that will be used to address potential deadlock during single-thread stepping.

While debugging a target with a non-trivial number of threads (around 5000 threads in one example target), we noticed that a simple step over can take as long as 10 seconds. Enabling single-thread stepping mode significantly reduces the stepping time to around 3 seconds. However, this can introduce deadlock if we try to step over a method that depends on other threads to release a lock.

To address this issue, we introduce a new ThreadPlanSingleThreadTimeout that can be controlled by the target.process.thread.single-thread-plan-timeout setting during single-thread stepping mode. The concept involves counting the elapsed time since the last internal stop to detect overall stepping progress. Once a timeout occurs, we assume the target is not making progress due to a potential deadlock, as mentioned above. We then send a new async interrupt, resume all threads, and ThreadPlanSingleThreadTimeout completes its task.

To support this design, the major changes made in this PR are:

  1. ThreadPlanSingleThreadTimeout is popped during every internal stop and reset (re-pushed) to the top of the stack (as a leaf node) during resume. This is achieved by always returning true from ThreadPlanSingleThreadTimeout::DoPlanExplainsStop() and ThreadPlanSingleThreadTimeout::MischiefManaged().
  2. A new thread-specific async interrupt stop is introduced, which can be detected/consumed by ThreadPlanSingleThreadTimeout.
  3. The clearing of branch breakpoints in the range thread plan has been moved from DoPlanExplainsStop() to ShouldStop(), as it is not guaranteed that it will be called.

The detailed design is discussed in the RFC below:
https://discourse.llvm.org/t/improve-single-thread-stepping/74599


Patch is 49.90 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/90930.diff

29 Files Affected:

  • (modified) lldb/include/lldb/Target/Process.h (+9-2)
  • (modified) lldb/include/lldb/Target/StopInfo.h (+4)
  • (modified) lldb/include/lldb/Target/Thread.h (+2)
  • (modified) lldb/include/lldb/Target/ThreadPlan.h (+5-3)
  • (added) lldb/include/lldb/Target/ThreadPlanSingleThreadTimeout.h (+93)
  • (modified) lldb/include/lldb/Target/ThreadPlanStepOut.h (+1)
  • (modified) lldb/include/lldb/Target/ThreadPlanStepOverRange.h (+6)
  • (modified) lldb/include/lldb/Target/ThreadPlanStepRange.h (+5)
  • (modified) lldb/include/lldb/lldb-enumerations.h (+1)
  • (modified) lldb/source/API/SBThread.cpp (+6)
  • (modified) lldb/source/Interpreter/CommandInterpreter.cpp (+1-1)
  • (modified) lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp (+2)
  • (modified) lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp (+50-11)
  • (modified) lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h (+6)
  • (modified) lldb/source/Target/CMakeLists.txt (+1)
  • (modified) lldb/source/Target/Process.cpp (+18-5)
  • (modified) lldb/source/Target/StopInfo.cpp (+37)
  • (modified) lldb/source/Target/TargetProperties.td (+4)
  • (modified) lldb/source/Target/Thread.cpp (+8-1)
  • (modified) lldb/source/Target/ThreadPlan.cpp (+1)
  • (added) lldb/source/Target/ThreadPlanSingleThreadTimeout.cpp (+234)
  • (modified) lldb/source/Target/ThreadPlanStepInRange.cpp (+1)
  • (modified) lldb/source/Target/ThreadPlanStepOverRange.cpp (+18-2)
  • (modified) lldb/source/Target/ThreadPlanStepRange.cpp (+45-26)
  • (added) lldb/test/API/functionalities/single-thread-step/Makefile (+4)
  • (added) lldb/test/API/functionalities/single-thread-step/TestSingleThreadStepTimeout.py (+123)
  • (added) lldb/test/API/functionalities/single-thread-step/main.cpp (+62)
  • (modified) lldb/tools/lldb-dap/JSONUtils.cpp (+3)
  • (modified) lldb/tools/lldb-dap/LLDBUtils.cpp (+1)
diff --git a/lldb/include/lldb/Target/Process.h b/lldb/include/lldb/Target/Process.h
index aac0cf51680a9e..7e758dbb9f6457 100644
--- a/lldb/include/lldb/Target/Process.h
+++ b/lldb/include/lldb/Target/Process.h
@@ -1312,11 +1312,13 @@ class Process : public std::enable_shared_from_this<Process>,
 
   size_t GetThreadStatus(Stream &ostrm, bool only_threads_with_stop_reason,
                          uint32_t start_frame, uint32_t num_frames,
-                         uint32_t num_frames_with_source,
-                         bool stop_format);
+                         uint32_t num_frames_with_source, bool stop_format);
 
   void SendAsyncInterrupt();
 
+  // Send an async interrupt and receive stop from a specific /p thread.
+  void SendAsyncInterrupt(Thread *thread);
+
   // Notify this process class that modules got loaded.
   //
   // If subclasses override this method, they must call this version before
@@ -3102,6 +3104,11 @@ void PruneThreadPlans();
                            // Resume will only request a resume, using this
                            // flag to check.
 
+  lldb::tid_t m_interrupt_tid; /// The tid of the thread that issued the async
+                               /// interrupt, used by thread plan timeout. It
+                               /// can be LLDB_INVALID_THREAD_ID to indicate
+                               /// user level async interrupt.
+
   /// This is set at the beginning of Process::Finalize() to stop functions
   /// from looking up or creating things during or after a finalize call.
   std::atomic<bool> m_finalizing;
diff --git a/lldb/include/lldb/Target/StopInfo.h b/lldb/include/lldb/Target/StopInfo.h
index d1848fcbbbdb19..fae90364deaf0a 100644
--- a/lldb/include/lldb/Target/StopInfo.h
+++ b/lldb/include/lldb/Target/StopInfo.h
@@ -123,6 +123,10 @@ class StopInfo : public std::enable_shared_from_this<StopInfo> {
                              const char *description = nullptr,
                              std::optional<int> code = std::nullopt);
 
+  static lldb::StopInfoSP
+  CreateStopReasonWithInterrupt(Thread &thread, int signo,
+                                const char *description);
+
   static lldb::StopInfoSP CreateStopReasonToTrace(Thread &thread);
 
   static lldb::StopInfoSP
diff --git a/lldb/include/lldb/Target/Thread.h b/lldb/include/lldb/Target/Thread.h
index c17bddf4d98b85..584093348b4c6a 100644
--- a/lldb/include/lldb/Target/Thread.h
+++ b/lldb/include/lldb/Target/Thread.h
@@ -57,6 +57,8 @@ class ThreadProperties : public Properties {
   bool GetStepOutAvoidsNoDebug() const;
 
   uint64_t GetMaxBacktraceDepth() const;
+
+  uint64_t GetSingleThreadPlanTimeout() const;
 };
 
 class Thread : public std::enable_shared_from_this<Thread>,
diff --git a/lldb/include/lldb/Target/ThreadPlan.h b/lldb/include/lldb/Target/ThreadPlan.h
index bf68a42e54d18f..640e997caf7b3d 100644
--- a/lldb/include/lldb/Target/ThreadPlan.h
+++ b/lldb/include/lldb/Target/ThreadPlan.h
@@ -302,7 +302,8 @@ class ThreadPlan : public std::enable_shared_from_this<ThreadPlan>,
     eKindStepInRange,
     eKindRunToAddress,
     eKindStepThrough,
-    eKindStepUntil
+    eKindStepUntil,
+    eKindSingleThreadTimeout,
   };
 
   virtual ~ThreadPlan();
@@ -483,6 +484,8 @@ class ThreadPlan : public std::enable_shared_from_this<ThreadPlan>,
     return m_takes_iteration_count;
   }
 
+  virtual lldb::StateType GetPlanRunState() = 0;
+
 protected:
   // Constructors and Destructors
   ThreadPlan(ThreadPlanKind kind, const char *name, Thread &thread,
@@ -522,8 +525,6 @@ class ThreadPlan : public std::enable_shared_from_this<ThreadPlan>,
     GetThread().SetStopInfo(stop_reason_sp);
   }
 
-  virtual lldb::StateType GetPlanRunState() = 0;
-
   bool IsUsuallyUnexplainedStopReason(lldb::StopReason);
 
   Status m_status;
@@ -590,6 +591,7 @@ class ThreadPlanNull : public ThreadPlan {
   const Status &GetStatus() { return m_status; }
 
 protected:
+  friend class ThreadPlanSingleThreadTimeout;
   bool DoPlanExplainsStop(Event *event_ptr) override;
 
   lldb::StateType GetPlanRunState() override;
diff --git a/lldb/include/lldb/Target/ThreadPlanSingleThreadTimeout.h b/lldb/include/lldb/Target/ThreadPlanSingleThreadTimeout.h
new file mode 100644
index 00000000000000..89db966b29bff3
--- /dev/null
+++ b/lldb/include/lldb/Target/ThreadPlanSingleThreadTimeout.h
@@ -0,0 +1,93 @@
+//===-- ThreadPlanSingleThreadTimeout.h -------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+#ifndef LLDB_TARGET_THREADPLANSINGLETHREADTIMEOUT_H
+#define LLDB_TARGET_THREADPLANSINGLETHREADTIMEOUT_H
+
+#include "lldb/Target/Thread.h"
+#include "lldb/Target/ThreadPlan.h"
+#include "lldb/Utility/Event.h"
+#include "lldb/Utility/LLDBLog.h"
+#include "lldb/Utility/State.h"
+
+#include <thread>
+
+namespace lldb_private {
+
+//
+// Thread plan used by single thread execution to issue timeout. This is useful
+// to detect potential deadlock in single thread execution. The timeout measures
+// the elapsed time from the last internal stop and got reset by each internal
+// stops to ensure we are accurately detecting execution not moving forward.
+// This means this thread plan  can be created/destroyed multiple times by the
+// parent execution plan.
+//
+// When timeout happens, the thread plan resolves the potential deadlock by
+// issuing a thread specific async interrupt to enter stop state, then all
+// threads execution are resumed to resolve the potential deadlock.
+//
+class ThreadPlanSingleThreadTimeout : public ThreadPlan {
+public:
+  ~ThreadPlanSingleThreadTimeout() override;
+
+  // Create a new instance from fresh new state.
+  static void CreateNew(Thread &thread);
+  // Reset and create a new instance from the previous state.
+  static void ResetFromPrevState(Thread &thread);
+
+  void GetDescription(Stream *s, lldb::DescriptionLevel level) override;
+  bool ValidatePlan(Stream *error) override { return true; }
+  bool WillStop() override;
+  void DidPop() override;
+
+  bool DoPlanExplainsStop(Event *event_ptr) override;
+
+  lldb::StateType GetPlanRunState() override;
+  static void TimeoutThreadFunc(ThreadPlanSingleThreadTimeout *self);
+
+  bool MischiefManaged() override;
+
+  bool ShouldStop(Event *event_ptr) override;
+  void SetStopOthers(bool new_value) override;
+  bool StopOthers() override;
+
+private:
+  ThreadPlanSingleThreadTimeout(Thread &thread);
+
+  static bool IsAlive();
+
+  enum class State {
+    WaitTimeout,    // Waiting for timeout.
+    AsyncInterrupt, // Async interrupt has been issued.
+    Done,           // Finished resume all threads.
+  };
+
+  static std::mutex s_mutex;
+  static ThreadPlanSingleThreadTimeout *s_instance;
+  static State s_prev_state;
+
+  bool HandleEvent(Event *event_ptr);
+  void HandleTimeout();
+
+  static std::string StateToString(State state);
+
+  ThreadPlanSingleThreadTimeout(const ThreadPlanSingleThreadTimeout &) = delete;
+  const ThreadPlanSingleThreadTimeout &
+  operator=(const ThreadPlanSingleThreadTimeout &) = delete;
+
+  State m_state;
+  std::mutex m_mutex;
+  std::condition_variable m_wakeup_cv;
+  // Whether the timer thread should exit or not.
+  bool m_exit_flag;
+  std::thread m_timer_thread;
+};
+
+} // namespace lldb_private
+
+#endif // LLDB_TARGET_THREADPLANSINGLETHREADTIMEOUT_H
diff --git a/lldb/include/lldb/Target/ThreadPlanStepOut.h b/lldb/include/lldb/Target/ThreadPlanStepOut.h
index b1d8769f7c546e..013c675afc33d0 100644
--- a/lldb/include/lldb/Target/ThreadPlanStepOut.h
+++ b/lldb/include/lldb/Target/ThreadPlanStepOut.h
@@ -30,6 +30,7 @@ class ThreadPlanStepOut : public ThreadPlan, public ThreadPlanShouldStopHere {
   bool ValidatePlan(Stream *error) override;
   bool ShouldStop(Event *event_ptr) override;
   bool StopOthers() override;
+  void SetStopOthers(bool new_value) override { m_stop_others = new_value; }
   lldb::StateType GetPlanRunState() override;
   bool WillStop() override;
   bool MischiefManaged() override;
diff --git a/lldb/include/lldb/Target/ThreadPlanStepOverRange.h b/lldb/include/lldb/Target/ThreadPlanStepOverRange.h
index 8585ac62f09b35..faa0ab596a2836 100644
--- a/lldb/include/lldb/Target/ThreadPlanStepOverRange.h
+++ b/lldb/include/lldb/Target/ThreadPlanStepOverRange.h
@@ -27,7 +27,9 @@ class ThreadPlanStepOverRange : public ThreadPlanStepRange,
   ~ThreadPlanStepOverRange() override;
 
   void GetDescription(Stream *s, lldb::DescriptionLevel level) override;
+  void SetStopOthers(bool new_value) override;
   bool ShouldStop(Event *event_ptr) override;
+  void DidPush() override;
 
 protected:
   bool DoPlanExplainsStop(Event *event_ptr) override;
@@ -42,8 +44,12 @@ class ThreadPlanStepOverRange : public ThreadPlanStepRange,
 
   void SetupAvoidNoDebug(LazyBool step_out_avoids_code_without_debug_info);
   bool IsEquivalentContext(const SymbolContext &context);
+  // Clear and create a new ThreadPlanSingleThreadTimeout to detect if program
+  // is moving forward.
+  void ResetSingleThreadTimeout();
 
   bool m_first_resume;
+  lldb::RunMode m_run_mode;
 
   ThreadPlanStepOverRange(const ThreadPlanStepOverRange &) = delete;
   const ThreadPlanStepOverRange &
diff --git a/lldb/include/lldb/Target/ThreadPlanStepRange.h b/lldb/include/lldb/Target/ThreadPlanStepRange.h
index 2fe88527710006..ced84c03e83cf5 100644
--- a/lldb/include/lldb/Target/ThreadPlanStepRange.h
+++ b/lldb/include/lldb/Target/ThreadPlanStepRange.h
@@ -58,8 +58,13 @@ class ThreadPlanStepRange : public ThreadPlan {
   // run' plan, then just single step.
   bool SetNextBranchBreakpoint();
 
+  // Whether the input stop info is caused by the next branch breakpoint.
+  bool IsNextBranchBreakpointStop(lldb::StopInfoSP stop_info_sp);
+
   void ClearNextBranchBreakpoint();
 
+  void ClearNextBranchBreakpointExplainedStop();
+
   bool NextRangeBreakpointExplainsStop(lldb::StopInfoSP stop_info_sp);
 
   SymbolContext m_addr_context;
diff --git a/lldb/include/lldb/lldb-enumerations.h b/lldb/include/lldb/lldb-enumerations.h
index 15e45857186091..2000931913c81b 100644
--- a/lldb/include/lldb/lldb-enumerations.h
+++ b/lldb/include/lldb/lldb-enumerations.h
@@ -253,6 +253,7 @@ enum StopReason {
   eStopReasonFork,
   eStopReasonVFork,
   eStopReasonVForkDone,
+  eStopReasonInterrupt, ///< Thread requested interrupt
 };
 
 /// Command Return Status Types.
diff --git a/lldb/source/API/SBThread.cpp b/lldb/source/API/SBThread.cpp
index ac3e2cd25daa94..c6925096e3cb1a 100644
--- a/lldb/source/API/SBThread.cpp
+++ b/lldb/source/API/SBThread.cpp
@@ -192,6 +192,9 @@ size_t SBThread::GetStopReasonDataCount() {
         case eStopReasonSignal:
           return 1;
 
+        case eStopReasonInterrupt:
+          return 1;
+
         case eStopReasonException:
           return 1;
 
@@ -261,6 +264,9 @@ uint64_t SBThread::GetStopReasonDataAtIndex(uint32_t idx) {
         case eStopReasonSignal:
           return stop_info_sp->GetValue();
 
+        case eStopReasonInterrupt:
+          return stop_info_sp->GetValue();
+
         case eStopReasonException:
           return stop_info_sp->GetValue();
 
diff --git a/lldb/source/Interpreter/CommandInterpreter.cpp b/lldb/source/Interpreter/CommandInterpreter.cpp
index 4c58ecc3c1848f..d327822db7bd37 100644
--- a/lldb/source/Interpreter/CommandInterpreter.cpp
+++ b/lldb/source/Interpreter/CommandInterpreter.cpp
@@ -2500,7 +2500,7 @@ bool CommandInterpreter::DidProcessStopAbnormally() const {
     const StopReason reason = stop_info->GetStopReason();
     if (reason == eStopReasonException ||
         reason == eStopReasonInstrumentation ||
-        reason == eStopReasonProcessorTrace)
+        reason == eStopReasonProcessorTrace || reason == eStopReasonInterrupt)
       return true;
 
     if (reason == eStopReasonSignal) {
diff --git a/lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp b/lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp
index ae1a77e5be8321..b3c0359b7dcf7b 100644
--- a/lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp
+++ b/lldb/source/Plugins/Process/gdb-remote/GDBRemoteCommunicationServerLLGS.cpp
@@ -714,6 +714,8 @@ static const char *GetStopReasonString(StopReason stop_reason) {
     return "vfork";
   case eStopReasonVForkDone:
     return "vforkdone";
+  case eStopReasonInterrupt:
+    return "async interrupt";
   case eStopReasonInstrumentation:
   case eStopReasonInvalid:
   case eStopReasonPlanComplete:
diff --git a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
index 871683a605686f..b93aad8808f5bd 100644
--- a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
+++ b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp
@@ -1730,14 +1730,24 @@ ThreadSP ProcessGDBRemote::SetThreadStopInfo(
       thread_sp = memory_thread_sp;
 
     if (exc_type != 0) {
-      const size_t exc_data_size = exc_data.size();
-
-      thread_sp->SetStopInfo(
-          StopInfoMachException::CreateStopReasonWithMachException(
-              *thread_sp, exc_type, exc_data_size,
-              exc_data_size >= 1 ? exc_data[0] : 0,
-              exc_data_size >= 2 ? exc_data[1] : 0,
-              exc_data_size >= 3 ? exc_data[2] : 0));
+      // For thread plan async interrupt, creating stop info on the
+      // original async interrupt request thread instead. If interrupt thread
+      // does not exist anymore we fallback to current signal receiving thread
+      // instead.
+      ThreadSP interrupt_thread;
+      if (m_interrupt_tid != LLDB_INVALID_THREAD_ID)
+        interrupt_thread = HandleThreadAsyncInterrupt(signo, description);
+      if (interrupt_thread)
+        thread_sp = interrupt_thread;
+      else {
+        const size_t exc_data_size = exc_data.size();
+        thread_sp->SetStopInfo(
+            StopInfoMachException::CreateStopReasonWithMachException(
+                *thread_sp, exc_type, exc_data_size,
+                exc_data_size >= 1 ? exc_data[0] : 0,
+                exc_data_size >= 2 ? exc_data[1] : 0,
+                exc_data_size >= 3 ? exc_data[2] : 0));
+      }
     } else {
       bool handled = false;
       bool did_exec = false;
@@ -1936,9 +1946,20 @@ ThreadSP ProcessGDBRemote::SetThreadStopInfo(
                   *thread_sp, signo, description.c_str()));
           }
         }
-        if (!handled)
-          thread_sp->SetStopInfo(StopInfo::CreateStopReasonWithSignal(
-              *thread_sp, signo, description.c_str()));
+        if (!handled) {
+          // For thread plan async interrupt, creating stop info on the
+          // original async interrupt request thread instead. If interrupt
+          // thread does not exist anymore we fallback to current signal
+          // receiving thread instead.
+          ThreadSP interrupt_thread;
+          if (m_interrupt_tid != LLDB_INVALID_THREAD_ID)
+            interrupt_thread = HandleThreadAsyncInterrupt(signo, description);
+          if (interrupt_thread)
+            thread_sp = interrupt_thread;
+          else
+            thread_sp->SetStopInfo(StopInfo::CreateStopReasonWithSignal(
+                *thread_sp, signo, description.c_str()));
+        }
       }
 
       if (!description.empty()) {
@@ -1957,6 +1978,24 @@ ThreadSP ProcessGDBRemote::SetThreadStopInfo(
   return thread_sp;
 }
 
+ThreadSP
+ProcessGDBRemote::HandleThreadAsyncInterrupt(uint8_t signo,
+                                             const std::string &description) {
+  ThreadSP thread_sp;
+  {
+    std::lock_guard<std::recursive_mutex> guard(m_thread_list_real.GetMutex());
+    thread_sp = m_thread_list_real.FindThreadByProtocolID(m_interrupt_tid,
+                                                          /*can_update=*/false);
+  }
+  if (thread_sp)
+    thread_sp->SetStopInfo(StopInfo::CreateStopReasonWithInterrupt(
+        *thread_sp, signo, description.c_str()));
+  // Clear m_interrupt_tid regardless we can find original interrupt thread or
+  // not.
+  m_interrupt_tid = LLDB_INVALID_THREAD_ID;
+  return thread_sp;
+}
+
 lldb::ThreadSP
 ProcessGDBRemote::SetThreadStopInfo(StructuredData::Dictionary *thread_dict) {
   static constexpr llvm::StringLiteral g_key_tid("tid");
diff --git a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h
index 610a1ee0b34d2f..615831dd7e6f60 100644
--- a/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h
+++ b/lldb/source/Plugins/Process/gdb-remote/ProcessGDBRemote.h
@@ -440,6 +440,12 @@ class ProcessGDBRemote : public Process,
   void HandleStopReply() override;
   void HandleAsyncStructuredDataPacket(llvm::StringRef data) override;
 
+  // Handle thread specific async interrupt and return the original thread
+  // that requested the async interrupt. It can be null if original thread
+  // has exited.
+  lldb::ThreadSP HandleThreadAsyncInterrupt(uint8_t signo,
+                                            const std::string &description);
+
   void SetThreadPc(const lldb::ThreadSP &thread_sp, uint64_t index);
   using ModuleCacheKey = std::pair<std::string, std::string>;
   // KeyInfo for the cached module spec DenseMap.
diff --git a/lldb/source/Target/CMakeLists.txt b/lldb/source/Target/CMakeLists.txt
index cf4818eae3eb8b..d1209d29245d54 100644
--- a/lldb/source/Target/CMakeLists.txt
+++ b/lldb/source/Target/CMakeLists.txt
@@ -59,6 +59,7 @@ add_lldb_library(lldbTarget
   ThreadPlanCallUserExpression.cpp
   ThreadPlanPython.cpp
   ThreadPlanRunToAddress.cpp
+  ThreadPlanSingleThreadTimeout.cpp
   ThreadPlanShouldStopHere.cpp
   ThreadPlanStepInRange.cpp
   ThreadPlanStepInstruction.cpp
diff --git a/lldb/source/Target/Process.cpp b/lldb/source/Target/Process.cpp
index 30c240b064b59c..3e5e4fe58b4b3b 100644
--- a/lldb/source/Target/Process.cpp
+++ b/lldb/source/Target/Process.cpp
@@ -446,7 +446,8 @@ Process::Process(lldb::TargetSP target_sp, ListenerSP listener_sp,
       m_memory_cache(*this), m_allocated_memory_cache(*this),
       m_should_detach(false), m_next_event_action_up(), m_public_run_lock(),
       m_private_run_lock(), m_currently_handling_do_on_removals(false),
-      m_resume_requested(false), m_finalizing(false), m_destructing(false),
+      m_resume_requested(false), m_interrupt_tid(LLDB_INVALID_THREAD_ID),
+      m_finalizing(false), m_destructing(false),
       m_clear_thread_plans_on_stop(false), m_force_next_event_delivery(false),
       m_last_broadcast_state(eStateInvalid), m_destroy_in_process(false),
       m_can_interpret_function_calls(false), m_run_thread_plan_lock(),
@@ -863,6 +864,7 @@ bool Process::HandleProcessStateChangedEvent(
             case eStopReasonThreadExiting:
             case eStopReasonInstrumentation:
             case eStopReasonProcessorTrace:
+            case eStopReasonInterrupt:
               if (!other_thread)
                 other_thread = thread;
               break;
@@ -3679,7 +3681,13 @@ void Process::ControlPrivateStateThread(uint32_t signal) {
   }
 }
 
-void Process::SendAsyncInterrupt() {
+void Process::SendAsyncInterrupt() { SendAsyncInterrupt(nullptr); }
+
+void Process::SendAsyncInterrupt(Thread *thread) {
+  if (thread != nullptr)
+    m_interrupt_tid = thread->GetProtocolID();
+  else
+    m_interrupt_tid = LLDB_INVALID_THREAD_ID;
   if (PrivateStateThreadIsValid())
     m_private_state_broadcaster.BroadcastEvent(Process::eBroadcastBitInterrupt,
                                                nullptr);
@@ -3905,9 +3913,14 @@ thread_result_t Process::RunPrivateStateThread(bool is_secondary_thread) {
 
       if (interrupt_requested) {
         if (StateIsStoppedState(internal_state, true)) {
-          // We requested the interrupt, so mark this as such in the stop event
-          // so clients can tell an interrupted process from a natural stop
-          ProcessEventData::SetInterruptedInEvent(event_sp.get(), true);
+          // Oly mark interrupt event if it is not thread specific async
+          // interrupt.
+          if (m_interrupt_tid == LLDB_INVALID_THREAD_ID) {
+            // We requested the interrupt, so mark this as such in the stop
+            // event so clients can tell an interrupted process from ...
[truncated]

@jeffreytan81
Copy link
Contributor Author

@jimingham, friendly ping. @clayborg mentioned that you are the sole domain expert on ThreadPlan. Can you help to review this change? Thanks

@jimingham
Copy link
Collaborator

Thanks for working up this patch. I should be able to get time to look at this next week.


void SendAsyncInterrupt();

// Send an async interrupt and receive stop from a specific /p thread.
void SendAsyncInterrupt(Thread *thread);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need two API's here? Why not have thread default to nullptr and document that that means we let the system choose which thread gets the interrupt? It's also worth noting here that the "receive stop from specific thread" is artificial, so you might say something like "and attribute the stop to a specific thread" instead to make it clear we don't require the ability to actually have that thread receive the signal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose overloading because I remember reading some guideline to prefer overloading than default parameter. Google finds one:
https://stackoverflow.com/questions/15876590/overload-a-method-or-use-default-values-c

But I do not care that much really, so changed and made the comment more clear.

@@ -590,6 +591,7 @@ class ThreadPlanNull : public ThreadPlan {
const Status &GetStatus() { return m_status; }

protected:
friend class ThreadPlanSingleThreadTimeout;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this needed for? It's a little odd having a derived class friended to the base class, since you could make whatever the derived class needed protected rather than private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I thought there is a reason I can't remember, apparently not. Removed. Thanks.

//
// Thread plan used by single thread execution to issue timeout. This is useful
// to detect potential deadlock in single thread execution. The timeout measures
// the elapsed time from the last internal stop and got reset by each internal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"gets reset by each internal stop"

// to detect potential deadlock in single thread execution. The timeout measures
// the elapsed time from the last internal stop and got reset by each internal
// stops to ensure we are accurately detecting execution not moving forward.
// This means this thread plan can be created/destroyed multiple times by the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"may" is better than "can" and there's one too many spaces.

Done, // Finished resume all threads.
};

static std::mutex s_mutex;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to see how you use this as I go along, but having statics here worries me. How are you going to support the case where lldb has two targets, each of which are running using this timeout feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I was storing some static states so that we can ensure:

  1. Singleton (only one ThreadPlanSingleThreadTimeout is alive/used anytime).
  2. If there is ever a race that, the ThreadPlanSingleThreadTimeout is in State::AsyncInterrupt but it got popped by another internal stop before async interrupt received. I want to remember the previous/last state, and the newly created ThreadPlanSingleThreadTimeout can be re-created from previous state.

Instead, I will look into storing these static states into parent controlling ThreadPlan (ThreadPlanStepOverRange in this case).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be better, but if you find you do need a singleton, you could also store that either in the Thread or the Process, depending on the scope at which you need them.

@jeffreytan81
Copy link
Contributor Author

@jimingham, I have revised the PR, let me know when you can review again. Thanks.

void SendAsyncInterrupt();
/// Send an async interrupt request.
///
/// If \a thread is specified the async interrupt stop will be attributed the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"will be attributed to the" - missing "to"

/// specified thread.
///
/// \param[in] thread
/// The thread from which to attribute the async interrupt stop to.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The thread the async interrupt will be attributed to"


~ThreadPlanSingleThreadTimeout() override;

// Create a new instance from fresh new state.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this API doesn't return the Created thread plan, then this API has to do more than just create it - it also pushes the plan, and sometimes doesn't. So CreateNew is not a good name, and the description of what the function does is too abbreviated.

@@ -42,8 +45,13 @@ class ThreadPlanStepOverRange : public ThreadPlanStepRange,

void SetupAvoidNoDebug(LazyBool step_out_avoids_code_without_debug_info);
bool IsEquivalentContext(const SymbolContext &context);
// Clear and create a new ThreadPlanSingleThreadTimeout to detect if program
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadPlanStepOverRange isn't the only thread plan that we want to have this behavior for. For instance, it would be great to do this for ThreadPlanStepOut, which we don't even try to run on one thread, IIRC, and in order not to duplicate code, we will want to switch the ThreadPlanCallUserExpression over to this method as well. So the code to handle the interrupt has to be shareable - maybe a mix-in class that does the management of the interrupt?

// Handle thread specific async interrupt and return the original thread
// that requested the async interrupt. It can be null if original thread
// has exited.
lldb::ThreadSP HandleThreadAsyncInterrupt(uint8_t signo,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any reason why the ability to manage this interrupt event is GDBRemote specific. So this should be a virtual method in Process, and overridden here.

// We requested the interrupt, so mark this as such in the stop event
// so clients can tell an interrupted process from a natural stop
ProcessEventData::SetInterruptedInEvent(event_sp.get(), true);
// Oly mark interrupt event if it is not thread specific async
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only

bool ShouldStop(Event *event_ptr) override {
ThreadSP thread_sp(m_thread_wp.lock());
if (thread_sp)
return thread_sp->GetProcess()->GetUnixSignals()->GetShouldStop(m_value);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right. This is a signal that you are using to implement your async interrupt. So you shouldn't be governed by whether the user would or would not want to stop for that signal if it had been generated by some other means.

@jeffreytan81 jeffreytan81 force-pushed the single_thread_timeout branch from a423f34 to b3c2c5e Compare May 30, 2024 21:58
@jeffreytan81
Copy link
Contributor Author

@jimingham , let me know any other comments/thoughts you have.

@jimingham
Copy link
Collaborator

jimingham commented Jun 6, 2024 via email

@jeffreytan81
Copy link
Contributor Author

@jimingham, hope WWDC is going well. Do you have time to review this now? Thanks.

@@ -2838,6 +2844,14 @@ void PruneThreadPlans();
return std::nullopt;
}

/// Handle thread specific async interrupt and return the original thread
/// that requested the async interrupt. It can be null if original thread
/// has exited.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should say what the description string is for.


namespace lldb_private {

// Mixin class that provides capability for ThreadPlan that supports single
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grammar:

// Mixin class that provides the capability for ThreadPlan to support single
// thread execution that resumes all threads after a timeout.

@jimingham
Copy link
Collaborator

Ack, sorry, I missed that you had updated this. My notifications from GitHub are somewhat flakey...

I didn't understand why you are checking both "IsLeafPlan" and "MischiefManaged", why does MischiefManaged get to override IsLeafPlan?

Copy link
Collaborator

@jimingham jimingham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM at this point.

@jeffreytan81 jeffreytan81 merged commit f838fa8 into llvm:main Aug 6, 2024
6 checks passed
@jimingham
Copy link
Collaborator

jimingham commented Aug 6, 2024

@jeffreytan81:

This patch almost certainly caused this asan failure:

https://green.lab.llvm.org/job/llvm.org/view/LLDB/job/lldb-cmake-sanitized/581/consoleText

Can you take a look?

@adrian-prantl
Copy link
Collaborator

@jimingham Should we revert the patch until this is fixed or are there more patches depending on this?

@jimingham
Copy link
Collaborator

This is a pretty big patch, since it's only failing on ASAN bots, I'd give @jeffreytan81 a bit to see if it's something obvious before reverting that. reverts aren't 100% free, they can make cherrypicking a hassle, so...

@jeffreytan81
Copy link
Contributor Author

@jimingham, @adrian-prantl , thanks for letting me know. Surprised that I did not get failure email.
Do you know how I can reproduce the failure?

@jimingham
Copy link
Collaborator

jimingham commented Aug 6, 2024

The console logs for the bots usually have the CMake command used to configure the tree for the test. The one I referred to above has:

'/usr/local/bin/cmake' '-G' 'Ninja' '/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake-sanitized/llvm-project/llvm' '-DCMAKE_BUILD_TYPE=Release' '-DCMAKE_EXPORT_COMPILE_COMMANDS=ON' '-DLLVM_TARGETS_TO_BUILD=X86;ARM;AArch64' '-DCMAKE_INSTALL_PREFIX=/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake-sanitized/lldb-install' '-DCMAKE_MAKE_PROGRAM=/usr/local/bin/ninja' '-DLLDB_TEST_USER_ARGS=--build-dir;/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake-sanitized/lldb-build/lldb-test-build.noindex;-t;--env;TERM=vt100' '-DLLDB_ENFORCE_STRICT_TEST_REQUIREMENTS=On' '-DLLDB_ENABLE_PYTHON=On' '-DLLDB_ENABLE_LZMA=Off' '-DLIBCXX_HARDENING_MODE=none' '-DLLVM_ENABLE_ASSERTIONS:BOOL=TRUE' '-DLLVM_ENABLE_MODULES=Off' '-DLLVM_ENABLE_PROJECTS=clang;lld;lldb' '-DLLVM_LIT_ARGS=-v --time-tests --shuffle --xunit-xml-output=/Users/ec2-user/jenkins/workspace/llvm.org/lldb-cmake-sanitized/test/results.xml -v -j 3 --timeout 1200' '-DLLVM_VERSION_PATCH=99' '-DLLVM_ENABLE_RUNTIMES=libcxx;libcxxabi;libunwind' '-DLLVM_TARGETS_TO_BUILD=X86' '-DLLVM_USE_SANITIZER=Address;Undefined'

I think the only part of that that's relevant for you is -DLLVM_USE_SANITIZER=Address;Undefined - you should be able to just add that to your CMakery, then rebuild and run the test. But if you are unlucky, you might need to build in Release mode, sometimes these failures don't happen in unoptimized builds.

@jeffreytan81
Copy link
Contributor Author

I think I understand the lifetime issue that caused the ASAN failure. I will fix it.

jeffreytan81 added a commit that referenced this pull request Aug 7, 2024
This PR fixes the ASAN failure in
#90930.

The original PR made the assumption that parent
`ThreadPlanStepOverRange`'s lifetime will always be longer than
`ThreadPlanSingleThreadTimeout` leaf plan so it passes the
`m_timeout_info` as reference to it.
From the ASAN failure, it seems that this assumption may not be true
(likely the thread stack is holding a strong reference to the leaf
plan).

This PR fixes this lifetime issue by using shared pointer instead of
passing by reference.

---------

Co-authored-by: jeffreytan81 <[email protected]>
@jeffreytan81
Copy link
Contributor Author

Should fixed by #102208.

@labath
Copy link
Collaborator

labath commented Aug 12, 2024

The new test is still hanging occasionally (e.g. https://lab.llvm.org/buildbot/#/builders/162/builds/3718/steps/6/logs/stdio ). I don't know if it's a problem with the code being tested, or the test itself, but it looks like there's still some issues here.

@jeffreytan81
Copy link
Contributor Author

@labath, unfortunately the failing log does not provide much information to troubleshoot. Is it possible to gather more information of the failure? Like is it only this lldb-x86_64-debian bot timeout or including other bots? And is it consistently failing or occasionally (race condition)? I tried to navigate the buildbot UI but couldn't make sense or find answers to these questions.

We only have CentOS9 Linux machines and have run this PR clean build multiple times but couldn't reproduce the failure/hang.

Maybe we could try to disable the tests for Debian Linux (if it only happens in Debian bot)? Any suggestion is welcome.

@labath
Copy link
Collaborator

labath commented Aug 13, 2024

It's not just this bot. I also found failures on aarch64 and arm. I don't see it on the mac bots, but I don't know how to look further into the history for those. The only way I can reproduce the problem is the same way that you can -- run the test many many times and try to catch it happening. For a relatively rare failure like this you may need to run it hundreds of times (and also put some background load on the system to make scheduling more nondeterministic) to catch it. I'm pretty sure CentOS is not immune to this bug.

@jeffreytan81
Copy link
Contributor Author

Thanks for sharing. I have written a shell script to run TestSingleThreadStepTimeout.py in a loop infinitely until failing. Unfortunately, it has been running for 127 iterations (more than 7 hours) on my CentOS9 machine, haven't seen a single failure/timeout yet. I am not confident that I can reproduce with the linux machine I have.

Assuming it is only linux platforms, If this is blocking stuff, I am thinking of disabling the tests for Linux platform until a solid reproduce can be found to investigate. Let me know if you have any other thoughts.

@labath
Copy link
Collaborator

labath commented Aug 14, 2024

This is not specific to linux. I've now figured out how to scroll the green dragon output, and I've found at least three failures on mac as well.

I don't think disabling the test everywhere is a solution. If we can't get this resolved quickly (O(days)), we should revert the patch and continue the investigation offline. I've captured the gdb packet log from one of the failing runs, maybe that can get you started?

@jeffreytan81
Copy link
Contributor Author

The GDB packet log is very useful (especially with a good run log at the top to compare).

I read with @clayborg . The log indicates that "self.thread.StepOver(lldb.eOnlyThisThread)" sent a "$vCont;c:p1a24.1a24" to continue single thread at the end but lacking a "\x03" async interrupt packet (which can be found in the good log at the top). The async interrupt should be sent by ThreadPlanSingleThreadTimeout::TimeoutThreadFunc() timer thread after timeout. Further reading the code, we found that there is a potential race that timer thread checks the m_isAlive flag before it was initialized.

Created #104195 to fix the race.

jeffreytan81 added a commit that referenced this pull request Aug 15, 2024
This PR fixes a potential race condition in
#90930.

This race can happen because the original code set `m_info->m_isAlive =
true` **after** the timer thread is created. So if there is a context
switch happens and timer thread checks `m_info->m_isAlive` before main
thread got a chance to run `m_info->m_isAlive = true`, the timer thread
may treat `ThreadPlanSingleThreadTimeout` as not alive and simply exit
resulting in async interrupt never being sent to resume all threads
(deadlock).

The PR fixes the race by initializing all states **before** worker timer
thread creates.

Co-authored-by: jeffreytan81 <[email protected]>
jeffreytan81 added a commit that referenced this pull request Aug 28, 2024
…04532)

This PR fixes another race condition in
#90930. The failure was found
by @labath with this log: https://paste.debian.net/hidden/30235a5c/:
```
dotest_wrapper.  <  15> send packet: $z0,224505,1#65
...
b-remote.async>  <  22> send packet: $vCont;s:p1dcf.1dcf#4c
intern-state     GDBRemoteClientBase::Lock::Lock sent packet: \x03
b-remote.async>  < 818> read packet: $T13thread:p1dcf.1dcf;name:a.out;threads:1dcf,1dd2;jstopinfo:5b7b226e616d65223a22612e6f7574222c22726561736f6e223a227369676e616c222c227369676e616c223a31392c22746964223a373633317d2c7b226e616d65223a22612e6f7574222c22746964223a373633347d5d;thread-pcs:0000000000224505,00007f4e4302119a;00:0000000000000000;01:0000000000000000;02:0100000000000000;03:0000000000000000;04:9084997dfc7f0000;05:a8742a0000000000;06:b084997dfc7f0000;07:6084997dfc7f0000;08:0000000000000000;09:00d7e5424e7f0000;0a:d0d9e5424e7f0000;0b:0202000000000000;0c:80cc290000000000;0d:d8cc1c434e7f0000;0e:2886997dfc7f0000;0f:0100000000000000;10:0545220000000000;11:0602000000000000;12:3300000000000000;13:0000000000000000;14:0000000000000000;15:2b00000000000000;16:80fbe5424e7f0000;17:0000000000000000;18:0000000000000000;19:0000000000000000;reason:signal;#b9
```
It shows an async interrupt "\x03" was sent immediately after `vCont;s`
single step over breakpoint at address `0x224505` (which was disabled
before vCont). And the later stop was still at the original PC
(0x224505) not moving forward.

The investigation shows the failure happens when timeout is short and
async interrupt is sent to lldb-server immediately after vCont so
ptrace() resumes and then async interrupts debuggee immediately so
debuggee does not get a chance to execute and move PC. So it enters stop
mode immediately at original PC. `ThreadPlanStepOverBreakpoint` does not
expect PC not moving and reports stop at the original place.

To fix this, the PR prevents `ThreadPlanSingleThreadTimeout` from being
created during `ThreadPlanStepOverBreakpoint` by introduces a new
`SupportsResumeOthers()` method and `ThreadPlanStepOverBreakpoint`
returns false for it. This makes sense because we should never resume
threads during step over breakpoint anyway otherwise it might cause
other threads to miss breakpoint.

---------

Co-authored-by: jeffreytan81 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants