-
Notifications
You must be signed in to change notification settings - Fork 14.3k
Add a test for evicting unreachable modules from the global module cache #74894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-lldb Author: None (jimingham) ChangesWhen you debug a binary and the change & rebuild and then rerun in lldb w/o quitting lldb, the Modules in the Global Module Cache for the old binary & .o files if used are now "unreachable". Nothing in lldb is holding them alive, and they've already been unlinked. lldb will properly discard them if there's not another Target referencing them. However, this only works in simple cases at present. If you have several Targets that reference the same modules, it's pretty easy to end up stranding Modules that are no longer reachable, and if you use a sequence of SBDebuggers unreachable modules can also get stranded. If you run a long-lived lldb process and are iteratively developing on a large code base, lldb's memory gets filled with useless Modules. This patch adds a test for the mode that currently works: (lldb) target create foo In that case, we do delete the unreachable Modules. The next step will be to add tests for the cases where we fail to do this, then see how to safely/efficiently evict unreachable modules in those cases as well. Full diff: https://github.com/llvm/llvm-project/pull/74894.diff 4 Files Affected:
diff --git a/lldb/test/API/python_api/global_module_cache/Makefile b/lldb/test/API/python_api/global_module_cache/Makefile
new file mode 100644
index 0000000000000..22f1051530f87
--- /dev/null
+++ b/lldb/test/API/python_api/global_module_cache/Makefile
@@ -0,0 +1 @@
+include Makefile.rules
diff --git a/lldb/test/API/python_api/global_module_cache/TestGlobalModuleCache.py b/lldb/test/API/python_api/global_module_cache/TestGlobalModuleCache.py
new file mode 100644
index 0000000000000..ff74d09a12818
--- /dev/null
+++ b/lldb/test/API/python_api/global_module_cache/TestGlobalModuleCache.py
@@ -0,0 +1,110 @@
+"""
+Test the use of the global module cache in lldb
+"""
+
+import lldb
+
+from lldbsuite.test.decorators import *
+from lldbsuite.test.lldbtest import *
+from lldbsuite.test import lldbutil
+import os
+import shutil
+from pathlib import Path
+import time
+
+
+class GlobalModuleCacheTestCase(TestBase):
+ # NO_DEBUG_INFO_TESTCASE = True
+
+ def check_counter_var(self, thread, value):
+ frame = thread.frames[0]
+ var = frame.FindVariable("counter")
+ self.assertTrue(var.GetError().Success(), "Got counter variable")
+ self.assertEqual(var.GetValueAsUnsigned(), value, "This was one-print")
+
+ def copy_to_main(self, src, dst):
+ # We are relying on the source file being newer than the .o file from
+ # a previous build, so sleep a bit here to ensure that the touch is later.
+ time.sleep(2)
+ try:
+ shutil.copy(src, dst)
+ except:
+ self.fail(f"Could not copy {src} to {dst}")
+ Path(dst).touch()
+
+ # The rerun tests indicate rerunning on Windows doesn't really work, so
+ # this one won't either.
+ @skipIfWindows
+ def test_OneTargetOneDebugger(self):
+ # Make sure that if we have one target, and we run, then
+ # change the binary and rerun, the binary (and any .o files
+ # if using dwarf in .o file debugging) get removed from the
+ # shared module cache. They are no longer reachable.
+ debug_style = self.getDebugInfo()
+
+ # Before we do anything, clear the global module cache so we don't
+ # see objects from other runs:
+ lldb.SBDebugger.MemoryPressureDetected()
+
+ # Set up the paths for our two versions of main.c:
+ main_c_path = os.path.join(self.getBuildDir(), "main.c")
+ one_print_path = os.path.join(self.getSourceDir(), "one-print.c")
+ two_print_path = os.path.join(self.getSourceDir(), "two-print.c")
+ main_filespec = lldb.SBFileSpec(main_c_path)
+
+ # First copy the one-print.c to main.c in the build folder and
+ # build our a.out from there:
+ self.copy_to_main(one_print_path, main_c_path)
+ self.build(dictionary={"C_SOURCES": main_c_path, "EXE": "a.out"})
+
+ (target, process, thread, bkpt) = lldbutil.run_to_source_breakpoint(
+ self, "return counter;", main_filespec
+ )
+
+ # Make sure we ran the version we intended here:
+ self.check_counter_var(thread, 1)
+ process.Kill()
+
+ # Now copy two-print.c over main.c, rebuild, and rerun:
+ # os.unlink(target.GetExecutable().fullpath)
+ self.copy_to_main(two_print_path, main_c_path)
+
+ self.build(dictionary={"C_SOURCES": main_c_path, "EXE": "a.out"})
+ error = lldb.SBError()
+ (_, process, thread, _) = lldbutil.run_to_breakpoint_do_run(self, target, bkpt)
+ # In two-print.c counter will be 2:
+ self.check_counter_var(thread, 2)
+
+ num_a_dot_out_entries = 1
+ # For dSYM's there will be two lines of output, one for the a.out and one
+ # for the dSYM.
+ if debug_style == "dsym":
+ num_a_dot_out_entries += 1
+
+ self.check_image_list_result(num_a_dot_out_entries, 1)
+
+ def check_image_list_result(self, num_a_dot_out, num_main_dot_o):
+ # Now look at the global module list, there should only be one a.out, and if we are
+ # doing dwarf in .o file, there should only be one .o file:
+ image_cmd_result = lldb.SBCommandReturnObject()
+ interp = self.dbg.GetCommandInterpreter()
+ interp.HandleCommand("image list -g", image_cmd_result)
+ image_list_str = image_cmd_result.GetOutput()
+ image_list = image_list_str.splitlines()
+ found_a_dot_out = 0
+ found_main_dot_o = 0
+
+ for line in image_list:
+ # FIXME: force this to be at the end of the string:
+ if "a.out" in line:
+ found_a_dot_out += 1
+ if "main.o" in line:
+ found_main_dot_o += 1
+
+ self.assertEqual(
+ num_a_dot_out, found_a_dot_out, "Got the right number of a.out's"
+ )
+ if found_main_dot_o > 0:
+ self.assertEqual(
+ num_main_dot_o, found_main_dot_o, "Got the right number of main.o's"
+ )
diff --git a/lldb/test/API/python_api/global_module_cache/one-print.c b/lldb/test/API/python_api/global_module_cache/one-print.c
new file mode 100644
index 0000000000000..5a572ca7c65fe
--- /dev/null
+++ b/lldb/test/API/python_api/global_module_cache/one-print.c
@@ -0,0 +1,8 @@
+#include <stdio.h>
+
+int
+main() {
+ int counter = 0;
+ printf("I only print one time: %d.\n", counter++);
+ return counter;
+}
diff --git a/lldb/test/API/python_api/global_module_cache/two-print.c b/lldb/test/API/python_api/global_module_cache/two-print.c
new file mode 100644
index 0000000000000..ce6cb84c5c46e
--- /dev/null
+++ b/lldb/test/API/python_api/global_module_cache/two-print.c
@@ -0,0 +1,9 @@
+#include <stdio.h>
+
+int
+main() {
+ int counter = 0;
+ printf("I print one time: %d.\n", counter++);
+ printf("I print two times: %d.\n", counter++);
+ return counter;
+}
|
|
Note, I thought about adding an SBDebugger::GetSharedModules or something, but I don't actually think that's a good thing to give external access to. Some day we should make an SBTestAPI with some useful for testing but not for the SB API library so we can make this sort of testing easier, but for this test grubbing the command output is not all that bad. |
I wouldn't mind exposing such an API for the global module list as a static functions (get count + get at index) on SBDebugger. One issue is that only real modules will show up here. This works fine for all modules for Darwin as executables, dSYM files and the .o files are all represented each by a lldb_private::Module. Not true for .dwo files though as they share the original module for the executable and only create a new ObjectFile, not a new Module for each. Not a big deal though. This API would be useful for testing and also for people to know what modules are currently in the global module cache taking up memory in the LLDB process. To test this you might be able to run your first executable and then check the number of shared modules and access them if needed then run:
As this used to get rid of orphaned modules in the past and I think it still does, then you could check this shared module count and list again after creating and running the second executable to see what you found. |
If I knew how to hand out SBReadOnlyModule objects, I think this would be a fine API. But I am wary of adding an API where one SBDebugger could inadvertently change another SBDebugger's environment behind its back. We do do allow this by accident today, for instance have Debugger1 turn off "external symbol search" but then Debugger2 has it on, and loads the same module as Debugger1, forcing symbols on Debugger1. But IMO that's a bug, and I'd rather not add more ways to write that bug. Fortunately, the output of |
I added a test for the case where one target has the old version and another target has the new one. In this case, I don't think (short of MemoryPressureDetected by hand) that we'll ever get rid of these unreachable Modules. |
You can test this locally with the following command:darker --check --diff -r 32d535195ec5d9b0caf03fee13f796fc8c66a79f...90119f1d3955029931939d529a9f4b3c113b8284 lldb/test/API/python_api/global_module_cache/TestGlobalModuleCache.py View the diff from darker here.--- TestGlobalModuleCache.py 2023-12-12 19:26:09.000000 +0000
+++ TestGlobalModuleCache.py 2023-12-12 19:28:53.039653 +0000
@@ -9,10 +9,11 @@
from lldbsuite.test import lldbutil
import os
import shutil
from pathlib import Path
import time
+
class GlobalModuleCacheTestCase(TestBase):
# NO_DEBUG_INFO_TESTCASE = True
def check_counter_var(self, thread, value):
@@ -97,10 +98,11 @@
else:
if one_target:
new_debugger = lldb.SBDebugger().Create()
self.old_debugger = self.dbg
self.dbg = new_debugger
+
def cleanupDebugger(self):
lldb.SBDebugger.Destroy(self.dbg)
self.dbg = self.old_debugger
self.old_debugger = None
@@ -128,11 +130,11 @@
lldb.SBDebugger.MemoryPressureDetected()
error_after_mpd = self.check_image_list_result(num_a_dot_out_entries, 1)
fail_msg = ""
if error != "":
fail_msg = "Error before MPD: " + error
-
+
if error_after_mpd != "":
fail_msg = fail_msg + "\nError after MPD: " + error_after_mpd
if fail_msg != "":
self.fail(fail_msg)
@@ -157,13 +159,15 @@
# FIXME: force this to be at the end of the string:
if "a.out" in line:
found_a_dot_out += 1
if "main.o" in line:
found_main_dot_o += 1
-
+
if num_a_dot_out != found_a_dot_out:
return f"Got {found_a_dot_out} number of a.out's, expected {num_a_dot_out}"
-
+
if found_main_dot_o > 0 and num_main_dot_o != found_main_dot_o:
- return f"Got {found_main_dot_o} number of main.o's, expected {num_main_dot_o}"
+ return (
+ f"Got {found_main_dot_o} number of main.o's, expected {num_main_dot_o}"
+ )
return ""
|
This test crashes on Arm and AArch64:
Though not consistently. I'm going to revert this until I have time to figure out why. |
Nothing that test does should cause a crash. It's fine to revert for investigation, but it seems like a good test in that it is uncovering a real bug (though not the one intended).
Jim
… On Dec 13, 2023, at 3:34 AM, David Spickett ***@***.***> wrote:
This test crashes on Arm and AArch64:
******************** TEST 'lldb-api :: python_api/global_module_cache/TestGlobalModuleCache.py' FAILED ********************
Script:
--
/usr/bin/python3.8 /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/API/dotest.py -u CXXFLAGS -u CFLAGS --env ARCHIVER=/usr/local/bin/llvm-ar --env OBJCOPY=/usr/bin/llvm-objcopy --env LLVM_LIBS_DIR=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./lib --env LLVM_INCLUDE_DIR=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/include --env LLVM_TOOLS_DIR=/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin --arch aarch64 --build-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex --lldb-module-cache-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex/module-cache-lldb/lldb-api --clang-module-cache-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lldb-test-build.noindex/module-cache-clang/lldb-api --executable /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin/lldb --compiler /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin/clang --dsymutil /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin/dsymutil --llvm-tools-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./bin --lldb-libs-dir /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/./lib /home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/llvm-project/lldb/test/API/python_api/global_module_cache -p TestGlobalModuleCache.py
--
Exit Code: -6
Command Output (stdout):
--
lldb version 18.0.0git (https://github.com/llvm/llvm-project.git revision bb18611)
clang revision bb18611
llvm revision bb18611
--
Command Output (stderr):
--
python3.8: ../llvm-project/lldb/source/Target/ThreadPlanStack.cpp:151: lldb::ThreadPlanSP lldb_private::ThreadPlanStack::PopPlan(): Assertion `m_plans.size() > 1 && "Can't pop the base thread plan"' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
#0 0x0000ffff7f3d8d60 llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lib/python3.8/site-packages/lldb/_lldb.cpython-38-aarch64-linux-gnu.so+0x51d7d60)
#1 0x0000ffff7f3d6d1c llvm::sys::RunSignalHandlers() (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lib/python3.8/site-packages/lldb/_lldb.cpython-38-aarch64-linux-gnu.so+0x51d5d1c)
#2 0x0000ffff7f3d9494 SignalHandler(int) (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lib/python3.8/site-packages/lldb/_lldb.cpython-38-aarch64-linux-gnu.so+0x51d8494)
#3 0x0000ffff85fee7dc (linux-vdso.so.1+0x7dc)
#4 0x0000ffff85e7cd78 raise /build/glibc-RIFKjK/glibc-2.31/signal/../sysdeps/unix/sysv/linux/raise.c:51:1
#5 0x0000ffff85e69aac abort /build/glibc-RIFKjK/glibc-2.31/stdlib/abort.c:81:7
#6 0x0000ffff85e76490 __assert_fail_base /build/glibc-RIFKjK/glibc-2.31/assert/assert.c:89:7
#7 0x0000ffff85e764f4 (/lib/aarch64-linux-gnu/libc.so.6+0x2d4f4)
#8 0x0000ffff7ef6f744 lldb_private::ThreadPlanStack::PopPlan() (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lib/python3.8/site-packages/lldb/_lldb.cpython-38-aarch64-linux-gnu.so+0x4d6e744)
#9 0x0000ffff7ef482f8 lldb_private::Thread::PopPlan() (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lib/python3.8/site-packages/lldb/_lldb.cpython-38-aarch64-linux-gnu.so+0x4d472f8)
#10 0x0000ffff7ef47ca8 lldb_private::Thread::ShouldStop(lldb_private::Event*) (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lib/python3.8/site-packages/lldb/_lldb.cpython-38-aarch64-linux-gnu.so+0x4d46ca8)
#11 0x0000ffff7ef544c8 lldb_private::ThreadList::ShouldStop(lldb_private::Event*) (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lib/python3.8/site-packages/lldb/_lldb.cpython-38-aarch64-linux-gnu.so+0x4d534c8)
#12 0x0000ffff7eed3c30 lldb_private::Process::ShouldBroadcastEvent(lldb_private::Event*) (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lib/python3.8/site-packages/lldb/_lldb.cpython-38-aarch64-linux-gnu.so+0x4cd2c30)
#13 0x0000ffff7eecd220 lldb_private::Process::HandlePrivateEvent(std::shared_ptr<lldb_private::Event>&) (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lib/python3.8/site-packages/lldb/_lldb.cpython-38-aarch64-linux-gnu.so+0x4ccc220)
#14 0x0000ffff7eed4eb4 lldb_private::Process::RunPrivateStateThread(bool) (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lib/python3.8/site-packages/lldb/_lldb.cpython-38-aarch64-linux-gnu.so+0x4cd3eb4)
#15 0x0000ffff7eee06f0 std::_Function_handler<void* (), lldb_private::Process::StartPrivateStateThread(bool)::$_0>::_M_invoke(std::_Any_data const&) (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lib/python3.8/site-packages/lldb/_lldb.cpython-38-aarch64-linux-gnu.so+0x4cdf6f0)
#16 0x0000ffff7edd3f80 lldb_private::HostNativeThreadBase::ThreadCreateTrampoline(void*) (/home/tcwg-buildbot/worker/lldb-aarch64-ubuntu/build/lib/python3.8/site-packages/lldb/_lldb.cpython-38-aarch64-linux-gnu.so+0x4bd2f80)
#17 0x0000ffff85e1f624 start_thread /build/glibc-RIFKjK/glibc-2.31/nptl/pthread_create.c:477:8
#18 0x0000ffff85f1a49c /build/glibc-RIFKjK/glibc-2.31/misc/../sysdeps/unix/sysv/linux/aarch64/clone.S:81:0
Though not consistently.
I'm going to revert this until I have time to figure out why.
—
Reply to this email directly, view it on GitHub <#74894 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUPVWY5VMA66TJ5Z5EZGR3YJGHEHAVCNFSM6AAAAABANJIUI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJTG42TCNRVG4>.
You are receiving this because you modified the open/close state.
|
import time | ||
|
||
class GlobalModuleCacheTestCase(TestBase): | ||
# NO_DEBUG_INFO_TESTCASE = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this commented out?
process.Kill() | ||
|
||
# Now copy two-print.c over main.c, rebuild, and rerun: | ||
# os.unlink(target.GetExecutable().fullpath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also commented out code here.
…odule cache (#74894)" This reverts commit 35dacf2. And relands the original change with two additions so I can debug the failure on Arm/AArch64: * Enable lldb step logging in the tests. * Assert that the current plan is not the base plan at the spot I believe is calling PopPlan. These will be removed and replaced with a proper fix once I see some failures on the bots, I couldn't reproduce it locally. (also, no sign of it on the x86_64 bot)
Failures didn't reproduce locally, so this is back in with an extra assert and logging enabled so I can catch it if it happens again. |
I've caught it happening on Arm. We appear to be trying to pop the base plan after finishing a single step, going to read through the code tomorrow to figure out why. In the meantime, I've removed all my modifications and skipped it on Arm and AArch64 Linux. |
It came from the test case I copied to this one but wasn't appropriate. Should have been deleted, not commented out.
Jim
… On Dec 13, 2023, at 3:44 PM, Adrian Prantl ***@***.***> wrote:
@adrian-prantl commented on this pull request.
In lldb/test/API/python_api/global_module_cache/TestGlobalModuleCache.py <#74894 (comment)>:
> +"""
+Test the use of the global module cache in lldb
+"""
+
+import lldb
+
+from lldbsuite.test.decorators import *
+from lldbsuite.test.lldbtest import *
+from lldbsuite.test import lldbutil
+import os
+import shutil
+from pathlib import Path
+import time
+
+class GlobalModuleCacheTestCase(TestBase):
+ # NO_DEBUG_INFO_TESTCASE = True
why is this commented out?
—
Reply to this email directly, view it on GitHub <#74894 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUPVW5HJ4E4Y4VARD3P7UTYJI4VNAVCNFSM6AAAAABANJIUI2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTOOBQGY3DKMJZGQ>.
You are receiving this because you modified the open/close state.
|
This test does kill off a process by destroying the Debugger that owned the process. It would be interesting to see if changing the test to explicitly kill the process, then its target, then the Debugger makes the crash go away. That should not be necessary of course, but if that fixes the problem it will help narrow down the cause. |
My extra logging resulted in this:
With the assert going off here. Which is the last place I'd expect it to be with all the |
Though the original assert was: Not that there was exactly 1 plan on the stack. So |
Trying this as a756dc4. |
The problem will require a more involved fix I think, I've written up the details in #76057. |
This can happen if you don't succeed in shutting down the private state thread as part of exiting, so that it is still trying to operate on a process that has been finalized.
Not sure why the shutdown ordering seems wrong in the Linux case, however.
Jim
… On Dec 15, 2023, at 1:39 AM, David Spickett ***@***.***> wrote:
Though the original assert was: m_plans.size() > 1 && "Can't pop the base thread plan"
Not that there was exactly 1 plan on the stack. So current_plan might be the single step plan as expected, then when we go to pop that off the stack, but the stack has been emptied/destroyed already.
—
Reply to this email directly, view it on GitHub <#74894 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADUPVWY52KBS5JSVF5J6JTDYJQLFBAVCNFSM6AAAAABANJIUI2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJXGU3DOOBQHA>.
You are receiving this because you modified the open/close state.
|
Note I'm also seeing some flakiness on the public macOS buildbots: https://green.lab.llvm.org/job/llvm.org/view/LLDB/job/lldb-cmake-matrix/322/execution/node/69/log/ |
When you debug a binary and the change & rebuild and then rerun in lldb w/o quitting lldb, the Modules in the Global Module Cache for the old binary & .o files if used are now "unreachable". Nothing in lldb is holding them alive, and they've already been unlinked. lldb will properly discard them if there's not another Target referencing them.
However, this only works in simple cases at present. If you have several Targets that reference the same modules, it's pretty easy to end up stranding Modules that are no longer reachable, and if you use a sequence of SBDebuggers unreachable modules can also get stranded. If you run a long-lived lldb process and are iteratively developing on a large code base, lldb's memory gets filled with useless Modules.
This patch adds a test for the mode that currently works:
(lldb) target create foo
(lldb) run
(lldb) run
In that case, we do delete the unreachable Modules.
The next step will be to add tests for the cases where we fail to do this, then see how to safely/efficiently evict unreachable modules in those cases as well.