Cp/fix flakey test concurrent tests #8274

jasonmolenda · 2024-02-24T04:24:21Z

No description provided.

The concurrent tests all do a pthread_join at the end, and concurrent_base.py stops after that pthread_join and sanity checks that only 1 thread is running. On macOS, after pthread_join() has completed, there can be an extra thread still running which is completing the details of that task asynchronously; this causes testsuite failures. When this happens, we see the second thread is in ``` frame #0: 0x0000000180ce7700 libsystem_kernel.dylib`__ulock_wake + 8 frame swiftlang#1: 0x0000000180d25ad4 libsystem_pthread.dylib`_pthread_joiner_wake + 52 frame swiftlang#2: 0x0000000180d23c18 libsystem_pthread.dylib`_pthread_terminate + 384 frame swiftlang#3: 0x0000000180d23a98 libsystem_pthread.dylib`_pthread_terminate_invoke + 92 frame swiftlang#4: 0x0000000180d26740 libsystem_pthread.dylib`_pthread_exit + 112 frame swiftlang#5: 0x0000000180d26040 libsystem_pthread.dylib`_pthread_start + 148 ``` there are none of the functions from the test file present on this thread. In this patch, instead of counting the number of threads, I iterate over the threads looking for functions from our test file (by name) and only count threads that have at least one of them. It's a lower frequency failure than the darwin kernel bug causing an extra step instruction mach exception when hardware breakpoint/watchpoints are used, but once I fixed that, this came up as the next most common failure for these tests. rdar://110555062 (cherry picked from commit dbc40b3)

On arm64 machines, when there is a hardware breakpoint or watchpoint set, and lldb has instruction-stepped a thread, and then done a Process::Resume, we will sometimes receive an extra "instruction step completed" mach exception and the pc has not advanced. From a user's perspective, they hit Continue and lldb stops again at the same spot. From the testsuite's perspective, this has been a constant source of testsuite failures for any test using hardware watchpoints and breakpoints, the arm64 CI bots seem especially good at hitting this issue. Jim and I have been slowly looking at this for a few months now, and finally I decided to try to detect this situation in lldb and silently resume the process again when it happens. We were already detecting this "got an insn-step finished mach exception but this thread was not instruction stepping" combination in StopInfoMachException where we take the mach exception and create a StopInfo object for it. We had a lot of logging we used to understand the failure as it was hit on the bots in assert builds. This patch adds a new case to `Thread::GetPrivateStopInfo()` to call the StopInfo's (new) `IsContinueInterrupted()` method. In StopInfoMachException, where we previously had logging for assert builds, I now note it in an ivar, and when `Thread::GetPrivateStopInfo()` asks if this has happened, we check all of the combination of events that this comes up: We have a hardware breakpoint or watchpoint, we were not instruction stepping this thread but got an insn-step mach exception, the pc is the same as the previous stop's pc. And in that case, `Thread::GetPrivateStopInfo()` returns no StopInfo -- indicating that this thread would like to resume execution. The `Thread` object has two StackFrameLists, `m_curr_frames_sp` and `m_prev_frames_sp`. When a thread resumes execution, we move `m_curr_frames_sp` in to `m_prev_frames_sp` and when it stops executing, w euse `m_prev_frames_sp` to seed the new `m_curr_frames_sp` if most of the stack is the same as before. In this same location, I now save the Thread's RegisterContext::GetPC into an ivar, `m_prev_framezero_pc`. StopInfoMachException needs this information to check all of the conditions I outlined above for `IsContinueInterrupted`. This has passed exhaustive testing and we do not have any testsuite failures for hardware watchpoints and breakpoints due to this kernel bug with the patch in place. In focusing on these tests for thousands of runs, I have found two other uncommon race conditions for the TestConcurrent* tests on arm64. TestConcurrentManyBreakpoints.py (which uses no hardware watchpoint/breakpoints) will sometimes only have 99 breakpoints when it expects 100, and any of the concurrent tests using the shared harness (I've seen it in TestConcurrentWatchBreakDelay.py, TestConcurrentTwoBreakpointsOneSignal.py, TestConcurrentSignalDelayWatch.py) can fail when the test harness checks that there is only one thread still running at the end, and it finds two -- one of them under pthread_exit / pthread_terminate. Both of these failures happen on github main without my changes, and with my changes - they are unrelated race conditions in these tests, and I'm sure I'll be looking into them at some point if they hit the CI bots with frequency. On my computer, these are in the 0.3-0.5% of the time class. But the CI bots do have different timing. (cherry picked from commit aab48c9)

) This is next in my series of "fix the racey tests that fail on greendragon" addressing the failure of TestConcurrentManyBreakpoints.py where we set a breakpoint in a function that 100 threads execute, and we check that we hit the breakpoint 100 times. But sometimes it is only hit 99 times, and the test fails. When we hit a software breakpoint, the pc value for the thread is the address of the breakpoint instruction - as if it had not been hit yet. And because a user might ADD a breakpoint for the current pc from the commandline, when we go to resume execution, any thread that is sitting at a breakpoint site will be silently advanced past the breakpoint instruction (disable bp, instruction step that thread, re-enable bp) before resuming -- whether that thread has hit its breakpoint or not. What this test is exposing is that there is another corner case, a thread that is sitting at a breakpoint site but has not yet executed the breakpoint instruction. The thread will have no stop reason, no mach exception, so it will not be recorded as having hit the breakpoint (because it hasn't yet). But when we resume execution, because it is sitting at a breakpoint site, we advance past it and miss the breakpoint hit. In 2016 Abhishek Aggarwal handled a similar issue with a patch in `ProcessGDBRemote::SetThreadStopInfo()`, adding a breakpoint StopInfo for a thread sitting at a breakpoint site that has no stop reason. debugserver's `jThreadsInfo` would not correctly execute Abhishek's code though because it would respond with `"reason":"none"` for a thread with no stop reason, and `SetThreadStopInfo()` expected an empty reason here. The first part of my patch is to clear the `reason` if it is `"none"` so we flow through the code correctly. On Darwin, though, our stop reply packet (Txx...) includes the `threads`, `thread-pcs`, and `jstopinfo` keys, which give us the tids for all current threads, the pc values for those threads, and `jstopinfo` has a JSON dictionary with the mach exceptions for all threads that have a mach exception. In `ProcessGDBRemote::CalculateThreadStopInfo()` we set the StopInfo for each thread for a private stop and if we have `jstopinfo` it is the source of all the StopInfos. I have to add the same logic here, to give the thread a breakpoint StopInfo even though it hasn't executed the breakpoint yet. In this case we are very early in thread construction and I only have the information in the Txx stop reply packet -- tids, pcs, and jstopinfo, so I can't use the normal general mechanisms of going through the RegisterContext to get the pc, it's a bit different. If I hack debugserver to not issue `jstopinfo`, `CalculateThreadStopInfo` will fall back to sending `qThreadStopInfo` for each thread and going through `ProcessGDBRemote::SetThreadStopInfo()` to set the stop infos (and with the `reason:none` fix, use Abhishek's code). rdar://110549165 (cherry picked from commit 87fadb3)

jasonmolenda · 2024-02-24T04:24:30Z

@swift-ci test

jasonmolenda added 3 commits February 23, 2024 20:21

jasonmolenda merged commit 7dd93d0 into swiftlang:stable/20230725 Feb 24, 2024

jasonmolenda deleted the cp/fix-flakey-TestConcurrent-tests branch February 24, 2024 07:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cp/fix flakey test concurrent tests #8274

Cp/fix flakey test concurrent tests #8274

Uh oh!

jasonmolenda commented Feb 24, 2024

Uh oh!

jasonmolenda commented Feb 24, 2024

Uh oh!

Uh oh!

Cp/fix flakey test concurrent tests #8274

Cp/fix flakey test concurrent tests #8274

Uh oh!

Conversation

jasonmolenda commented Feb 24, 2024

Uh oh!

jasonmolenda commented Feb 24, 2024

Uh oh!

Uh oh!