Skip to content

[lldb][test] Don't call SBDebugger::Terminate if TestMultipleDebuggers times out #143732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 13, 2025

Conversation

DavidSpickett
Copy link
Collaborator

@DavidSpickett DavidSpickett commented Jun 11, 2025

Fixes #101162

This test did this:

  • SBDebugger::Initialize
  • Spawn a bunch of threads that do:
    • SBDebugger::Create
    • some work
    • SBDebugger::Destroy
  • Wait on those threads to finish then call SBDebugger::Terminate and exit, or -
  • Reach a time limit before all the threads finish, call SBDebugger::Terminate and exit.

The problem was that in the timeout case, calling SBDebugger::Terminate destroys data being used by threads that are still running. I expect this test was expecting said threads to be so broken they were probably stuck, but when the machine is just heavily loaded, one of them might read that data before the whole program exits.

This means what should have been a timeout becomes a crash. Sometimes. Which explains why we saw both timeouts and various signals on the AArch64 Linux bot. It depends on the timings.

So I'm changing it not to call SBDebugger::Terminate in the timeout case. We will have to tweak the timeout value based on what happens on the buildbot, but we will know it's machine load not an lldb bug.

Also use _exit instead of exit, to skip more cleanup that might cause a crash.

…s times out

Fixes llvm#101162

This test did this:
* SBDebugger::Initialize
* Spawn a bunch of threads that do
  * SBDebugger::Create
  * some work
  * SBDebugger::Destroy
* Wait on those threads to finish then call SBDebugger::Terminate, or -
* Reach a time limit before all the threads finish, call SBDebugger::Terminate
  and exit.

The problem was that in the timeout case, calling SBDebugger::Terminate
destroys data being used by threads that are still running. This test
was expecting said threads to be so broken they were probably stuck,
but when the machine is just heavily loaded, one of them might read
that data before the whole program can exit.

This means what should have been a timeout is now a crash. Sometimes.

Which explains why we saw both timeouts and various signals on the
AArch64 Linux bot. It depends on the timings.

So I'm changing it not to call SBDebugger::Terminate in the timeout
case.

We will have to tweak the timeout value based on what happens on the
buildbot.
@llvmbot
Copy link
Member

llvmbot commented Jun 11, 2025

@llvm/pr-subscribers-lldb

Author: David Spickett (DavidSpickett)

Changes

Fixes #101162

This test did this:

  • SBDebugger::Initialize
  • Spawn a bunch of threads that do:
    • SBDebugger::Create
    • some work
    • SBDebugger::Destroy
  • Wait on those threads to finish then call SBDebugger::Terminate and exit, or -
  • Reach a time limit before all the threads finish, call SBDebugger::Terminate and exit.

The problem was that in the timeout case, calling SBDebugger::Terminate destroys data being used by threads that are still running. I expect this test was expecting said threads to be so broken they were probably stuck, but when the machine is just heavily loaded, one of them might read that data before the whole program exits.

This means what should have been a timeout becomes a crash. Sometimes. Which explains why we saw both timeouts and various signals on the AArch64 Linux bot. It depends on the timings.

So I'm changing it not to call SBDebugger::Terminate in the timeout case. We will have to tweak the timeout value based on what happens on the buildbot, but we will know it's machine load not an lldb bug.


Full diff: https://github.com/llvm/llvm-project/pull/143732.diff

2 Files Affected:

  • (modified) lldb/test/API/api/multiple-debuggers/TestMultipleDebuggers.py (-2)
  • (modified) lldb/test/API/api/multiple-debuggers/multi-process-driver.cpp (+3-1)
diff --git a/lldb/test/API/api/multiple-debuggers/TestMultipleDebuggers.py b/lldb/test/API/api/multiple-debuggers/TestMultipleDebuggers.py
index 1fd4806cd74f4..f0a3893f53aab 100644
--- a/lldb/test/API/api/multiple-debuggers/TestMultipleDebuggers.py
+++ b/lldb/test/API/api/multiple-debuggers/TestMultipleDebuggers.py
@@ -12,8 +12,6 @@
 class TestMultipleSimultaneousDebuggers(TestBase):
     NO_DEBUG_INFO_TESTCASE = True
 
-    # Sometimes times out on Linux, see https://github.com/llvm/llvm-project/issues/101162.
-    @skipIfLinux
     @skipIfNoSBHeaders
     @skipIfWindows
     @skipIfHostIncompatibleWithTarget
diff --git a/lldb/test/API/api/multiple-debuggers/multi-process-driver.cpp b/lldb/test/API/api/multiple-debuggers/multi-process-driver.cpp
index 64728fb7c29a1..fcec9bae0ed9c 100644
--- a/lldb/test/API/api/multiple-debuggers/multi-process-driver.cpp
+++ b/lldb/test/API/api/multiple-debuggers/multi-process-driver.cpp
@@ -296,6 +296,8 @@ int main (int argc, char **argv)
                  NUMBER_OF_SIMULTANEOUS_DEBUG_SESSIONS);
     }
 
-    SBDebugger::Terminate();
+    // We do not call SBDebugger::Terminate() here because it will destroy
+    // data that might be being used by threads that are still running. Which
+    // would change the timeout into an unrelated crash.
     exit (1);
 }

@DavidSpickett
Copy link
Collaborator Author

Side note: this test doesn't actually care if all the threads are successful in what they do, but it's been that way forever and I don't want to get any more involved than I have to :)

@DavidSpickett
Copy link
Collaborator Author

This makes the assumption that child threads are torn down before their parent, which might be a faulty one.

I looked for a way to cancel std::threads but it seems like you have to get a native handle to it and use the specific OS APIs, which I'd rather not complicate the test with.

Copy link
Collaborator

@labath labath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exit() will also run some cleanup functions and could cause things to crash. You could one-up it to _exit(), but ultimately, there's no way to guarantee that misbehaving code will umm... behave in a certain way.

@DavidSpickett
Copy link
Collaborator Author

Then what I'm doing is narrowing the window in which crashes could happen, but not closing it.

_exit seems more safe for the timeout situation. If there's a genuine bug then yes anything could happen but at least we'd be rewarded for looking into it when it did.

If there's still flakiness, I'll declare this impossible to get 100% right and disable it again.

I did think I could send a signal from the main thread, but other threads continue running so that doesn't work. Unless it's SIGSTOP, but you can't catch that, and I don't know what return code we'd end up with.

@DavidSpickett DavidSpickett merged commit addd98f into llvm:main Jun 13, 2025
7 checks passed
@DavidSpickett DavidSpickett deleted the lldb-multiple branch June 13, 2025 08:31
@llvm-ci
Copy link
Collaborator

llvm-ci commented Jun 13, 2025

LLVM Buildbot has detected a new failure on builder lldb-aarch64-ubuntu running on linaro-lldb-aarch64-ubuntu while building lldb at step 6 "test".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/59/builds/19295

Here is the relevant piece of the build log for the reference
Step 6 (test) failure: build (failure)
...
PASS: lldb-api :: functionalities/thread/concurrent_events/TestConcurrentCrashWithWatchpoint.py (632 of 2259)
PASS: lldb-api :: functionalities/thread/concurrent_events/TestConcurrentCrashWithSignal.py (633 of 2259)
PASS: lldb-api :: functionalities/thread/concurrent_events/TestConcurrentBreakpointsDelayedBreakpointOneWatchpoint.py (634 of 2259)
PASS: lldb-api :: functionalities/thread/concurrent_events/TestConcurrentCrashWithWatchpointBreakpointSignal.py (635 of 2259)
PASS: lldb-api :: functionalities/thread/concurrent_events/TestConcurrentDelaySignalBreak.py (636 of 2259)
PASS: lldb-api :: functionalities/thread/concurrent_events/TestConcurrentDelayWatchBreak.py (637 of 2259)
PASS: lldb-api :: functionalities/thread/concurrent_events/TestConcurrentDelaySignalWatch.py (638 of 2259)
PASS: lldb-api :: functionalities/thread/concurrent_events/TestConcurrentDelayedCrashWithBreakpointSignal.py (639 of 2259)
PASS: lldb-api :: functionalities/thread/concurrent_events/TestConcurrentDelayedCrashWithBreakpointWatchpoint.py (640 of 2259)
UNRESOLVED: lldb-api :: api/multiple-debuggers/TestMultipleDebuggers.py (641 of 2259)
******************** TEST 'lldb-api :: api/multiple-debuggers/TestMultipleDebuggers.py' FAILED ********************
Script:
--
/usr/bin/python3.10 /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env LLVM_LIBS_DIR=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./lib --env LLVM_INCLUDE_DIR=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/include --env LLVM_TOOLS_DIR=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin --arch aarch64 --build-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex --lldb-module-cache-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin/lldb --compiler /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin/clang --dsymutil /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin/dsymutil --make /usr/bin/gmake --llvm-tools-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin --lldb-obj-root /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/tools/lldb --lldb-libs-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./lib --cmake-build-type Release /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/API/api/multiple-debuggers -p TestMultipleDebuggers.py
--
Exit Code: 1

Command Output (stdout):
--
lldb version 21.0.0git (https://github.com/llvm/llvm-project.git revision addd98f7a5b964a5a5860d65f327f3fc3b7e0a42)
  clang revision addd98f7a5b964a5a5860d65f327f3fc3b7e0a42
  llvm revision addd98f7a5b964a5a5860d65f327f3fc3b7e0a42
Skipping the following test categories: ['libc++', 'dsym', 'gmodules', 'debugserver', 'objc']

--
Command Output (stderr):
--
FAIL: LLDB (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/bin/clang-aarch64) :: test_multiple_debuggers (TestMultipleDebuggers.TestMultipleSimultaneousDebuggers)
======================================================================
ERROR: test_multiple_debuggers (TestMultipleDebuggers.TestMultipleSimultaneousDebuggers)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 149, in wrapper
    return func(*args, **kwargs)
  File "/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/packages/Python/lldbsuite/test/decorators.py", line 149, in wrapper
    return func(*args, **kwargs)
  File "/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/API/api/multiple-debuggers/TestMultipleDebuggers.py", line 34, in test_multiple_debuggers
    subprocess.check_call([self.driver_exe, self.inferior_exe])
  File "/usr/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex/api/multiple-debuggers/TestMultipleDebuggers.test_multiple_debuggers/multi-process-driver', '/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex/api/multiple-debuggers/TestMultipleDebuggers.test_multiple_debuggers/testprog']' returned non-zero exit status 1.
Config=aarch64-/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/bin/clang
----------------------------------------------------------------------
Ran 1 test in 121.848s

FAILED (errors=1)

--


@DavidSpickett
Copy link
Collaborator Author

Timed out on the very first build 🤣

On second thought this test is better disabled. It's more correct than it was, but it's never going to be truly stable.

06c7835

tomtor pushed a commit to tomtor/llvm-project that referenced this pull request Jun 14, 2025
…s times out (llvm#143732)

Fixes llvm#101162

This test did this:
* SBDebugger::Initialize
* Spawn a bunch of threads that do:
  * SBDebugger::Create
  * some work
  * SBDebugger::Destroy
* Wait on those threads to finish then call SBDebugger::Terminate and
exit, or -
* Reach a time limit before all the threads finish, call
SBDebugger::Terminate and exit.

The problem was that in the timeout case, calling SBDebugger::Terminate
destroys data being used by threads that are still running. I expect
this test was expecting said threads to be so broken they were probably
stuck, but when the machine is just heavily loaded, one of them might
read that data before the whole program exits.

This means what should have been a timeout becomes a crash. Sometimes.
Which explains why we saw both timeouts and various signals on the
AArch64 Linux bot. It depends on the timings.

So I'm changing it not to call SBDebugger::Terminate in the timeout
case. We will have to tweak the timeout value based on what happens on
the buildbot, but we will know it's machine load not an lldb bug.

Also use _exit instead of exit, to skip more cleanup that might cause a
crash.
akuhlens pushed a commit to akuhlens/llvm-project that referenced this pull request Jun 24, 2025
…s times out (llvm#143732)

Fixes llvm#101162

This test did this:
* SBDebugger::Initialize
* Spawn a bunch of threads that do:
  * SBDebugger::Create
  * some work
  * SBDebugger::Destroy
* Wait on those threads to finish then call SBDebugger::Terminate and
exit, or -
* Reach a time limit before all the threads finish, call
SBDebugger::Terminate and exit.

The problem was that in the timeout case, calling SBDebugger::Terminate
destroys data being used by threads that are still running. I expect
this test was expecting said threads to be so broken they were probably
stuck, but when the machine is just heavily loaded, one of them might
read that data before the whole program exits.

This means what should have been a timeout becomes a crash. Sometimes.
Which explains why we saw both timeouts and various signals on the
AArch64 Linux bot. It depends on the timings.

So I'm changing it not to call SBDebugger::Terminate in the timeout
case. We will have to tweak the timeout value based on what happens on
the buildbot, but we will know it's machine load not an lldb bug.

Also use _exit instead of exit, to skip more cleanup that might cause a
crash.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lldb TestMultipleDebuggers.py sometimes fails on Linux
4 participants