Skip to content

[lldb] Added an exponential algorithm for the sleep time in connect_to_debug_monitor() #118222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Dec 2, 2024

Conversation

slydiman
Copy link
Contributor

@slydiman slydiman commented Dec 1, 2024

See #118032 for details.

@llvmbot
Copy link
Member

llvmbot commented Dec 1, 2024

@llvm/pr-subscribers-lldb

Author: Dmitry Vasilyev (slydiman)

Changes

See #118032 for details.


Full diff: https://github.com/llvm/llvm-project/pull/118222.diff

1 Files Affected:

  • (modified) lldb/packages/Python/lldbsuite/test/tools/lldb-server/gdbremote_testcase.py (+1-1)
diff --git a/lldb/packages/Python/lldbsuite/test/tools/lldb-server/gdbremote_testcase.py b/lldb/packages/Python/lldbsuite/test/tools/lldb-server/gdbremote_testcase.py
index 8c8e4abed0b454..67b07ff4ddd998 100644
--- a/lldb/packages/Python/lldbsuite/test/tools/lldb-server/gdbremote_testcase.py
+++ b/lldb/packages/Python/lldbsuite/test/tools/lldb-server/gdbremote_testcase.py
@@ -388,7 +388,7 @@ def connect_to_debug_monitor(self, attach_pid=None):
         # We're using a random port algorithm to try not to collide with other ports,
         # and retry a max # times.
         attempts = 0
-        MAX_ATTEMPTS = 20
+        MAX_ATTEMPTS = 30
 
         while attempts < MAX_ATTEMPTS:
             server = self.launch_debug_monitor(attach_pid=attach_pid)

@labath
Copy link
Collaborator

labath commented Dec 2, 2024

20 attempts is already more than what I would consider reasonable. Increasing it doesn't sound like a good idea. If the problem is lingering sockets, then I think a better approach is to slow down the rate of socket creation, not churn out more of them.

While this code tries to back off, it doesn't implement the (standard?) exponential backoff algorithm. I think that's important, as without it, it just ends up churning out more and more connections at a relative fast rate (time.sleep(random.randint(1, 5)) gives you an average of 3s, so this may even be faster than "running a test" -- which also generates a new socket. And since this happens a few times, I wouldn't be surprised if that's during the times that the random generator generates a locally-lower-than-average sequence of random numbers).

So, the thing I'd do is use a proper exponential algorithm and reduce the retry count (so that the total wait time ends up being slightly more than what it is now), and also increase the range of ports that we're picking from (I don't know how the 12000 + random.randint(0, 3999) thingy was chosen, but I think we could increase that any issues).

(Of course, a much better solution would be to either let lldb-server pick a free port (and communicate it back to the test somehow) or to reuse the socket multiplexing approach we (you) recently implemented. That may not work for all tests (we still need to test the user-provided-port scenario somehow), but 99% of the tests don't care about the connection method, o if we can solve the problem for the majority of tests, then this alleviates the pressure on the minority that does need it.)

…TS to 16.

The sleep time will be 3, 3.6, 4.3, ..., 38, 46.
The total sleep time is 262 + 0.5*MAX_CONNECT_ATTEMPTS*MAX_ATTEMPTS = 342.
Note the timeout is 600, so the rest time must be enough.
Increased the port range (12000..20000).
@slydiman
Copy link
Contributor Author

slydiman commented Dec 2, 2024

Updated the patch:
Added an exponential algorithm for the sleep time. Reduced MAX_ATTEMPTS to 16.
The sleep time will be 3, 3.6, 4.3, ..., 38, 46.
The total sleep time is 262 + 0.5 * MAX_CONNECT_ATTEMPTS * MAX_ATTEMPTS = 342.
Note the timeout is 600, so the rest time must be enough.
Increased the port range (12000..20000).

@slydiman slydiman changed the title [lldb] Increase MAX_ATTEMPTS in connect_to_debug_monitor() [lldb] Added an exponential algorithm for the sleep time in connect_to_debug_monitor() Dec 2, 2024
Copy link
Collaborator

@labath labath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's the general idea, although 16 attempts @ 262s sounds a bit too much (that's like almost three times as long as the first version with 30*3s). I worry that the tests will take too long to time out if they not able to establish a connection due to some other reason (like, imagine a bad patch causes lldb-server to never accept a connection). I'd suggest starting out with 10attempts@77 seconds. That's still higher than the initial ~60 seconds, and I really hope that should be enough since doubling the port range sort of doubles the number of attempts as well (by halving the chance of collision).

@slydiman slydiman merged commit a951e76 into llvm:main Dec 2, 2024
7 checks passed
@JDevlieghere
Copy link
Member

Thanks, I suggested something similar in #118032.

slydiman added a commit to slydiman/llvm-zorg that referenced this pull request Jan 7, 2025
API tests llvm-project/lldb/test/API/tools/lldb-server/* with the category `lldb-server` run lldb-server locally on the host with a random port.
Sometimes we got errors like
https://lab.llvm.org/buildbot/#/builders/195/builds/3145
https://lab.llvm.org/buildbot/#/builders/197/builds/312
The patch llvm/llvm-project#118222 improved the situation but did not fix it completely.
slydiman added a commit to llvm/llvm-zorg that referenced this pull request Jan 13, 2025
)

API tests llvm-project/lldb/test/API/tools/lldb-server/* with the
category `lldb-server` run lldb-server locally on the host with a random
port. Sometimes we got errors like
https://lab.llvm.org/buildbot/#/builders/195/builds/3145
https://lab.llvm.org/buildbot/#/builders/197/builds/312
The patch llvm/llvm-project#118222 improved the
situation but did not fix it completely.
@slydiman slydiman deleted the connect_to_debug_monitor_MAX_ATTEMPTS branch April 18, 2025 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants