Skip to content

[lldb] Fix broken pipe error #127100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sga-sc
Copy link
Contributor

@sga-sc sga-sc commented Feb 13, 2025

During LLDB testing on slow machines with the remote-linux platform all tests from llgs category fail with python exception BrokenPipeError. The main reason of these failures is slow start of lldb-server in gdbserver mode. Due to this desired gdbserver socket does not have time to open by the time the Python script tries to establish a connection.

List of failed tests:

TestAppleSimulatorOSType.py
TestGdbRemoteAttach.py
TestGdbRemoteAuxvSupport.py
TestGdbRemoteCompletion.py
TestGdbRemoteExitCode.py
TestGdbRemoteExpeditedRegisters.py
TestGdbRemoteHostInfo.py
TestGdbRemoteKill.py
TestGdbRemoteLaunch.py
TestGdbRemoteModuleInfo.py
TestGdbRemotePlatformFile.py
TestGdbRemoteProcessInfo.py
TestGdbRemoteRegisterState.py
TestGdbRemoteSaveCore.py
TestGdbRemoteSingleStep.py
TestGdbRemoteThreadsInStopReply.py
TestGdbRemote_qThreadStopInfo.py
TestGdbRemote_vCont.py
TestLldbGdbServer.py
TestNonStop.py
TestPtyServer.py
TestGdbRemoteAttachWait.py
TestGdbRemoteConnection.py
TestStubSetSID.py
TestGdbRemoteAbort.py
TestGdbRemoteSegFault.py
TestGdbRemoteLibrariesSvr4Support.py
TestGdbRemoteMemoryAllocation.py
TestGdbRemoteMemoryTagging.py
TestGdbRemoteGPacket.py
TestGdbRemoteTargetXmlPacket.py
TestGdbRemote_QPassSignals.py
TestGdbRemoteThreadName.py
TestPartialResume.py
TestSignal.py

This patch implements an additional check for the opened socket on lldb-server side and fixes this error.

@sga-sc sga-sc requested a review from JDevlieghere as a code owner February 13, 2025 17:54
@llvmbot llvmbot added the lldb label Feb 13, 2025
@llvmbot
Copy link
Member

llvmbot commented Feb 13, 2025

@llvm/pr-subscribers-lldb

Author: Georgiy Samoylov (sga-sc)

Changes

During LLDB testing on slow machines with the remote-linux platform all tests from llgs category fail with python exception BrokenPipeError. The main reason of these failures is slow start of lldb-server in gdbserver mode. Due to this desired gdbserver socket does not have time to open by the time the Python script tries to establish a connection.

List of failed tests:

TestAppleSimulatorOSType.py
TestGdbRemoteAttach.py
TestGdbRemoteAuxvSupport.py
TestGdbRemoteCompletion.py
TestGdbRemoteExitCode.py
TestGdbRemoteExpeditedRegisters.py
TestGdbRemoteHostInfo.py
TestGdbRemoteKill.py
TestGdbRemoteLaunch.py
TestGdbRemoteModuleInfo.py
TestGdbRemotePlatformFile.py
TestGdbRemoteProcessInfo.py
TestGdbRemoteRegisterState.py
TestGdbRemoteSaveCore.py
TestGdbRemoteSingleStep.py
TestGdbRemoteThreadsInStopReply.py
TestGdbRemote_qThreadStopInfo.py
TestGdbRemote_vCont.py
TestLldbGdbServer.py
TestNonStop.py
TestPtyServer.py
TestGdbRemoteAttachWait.py
TestGdbRemoteConnection.py
TestStubSetSID.py
TestGdbRemoteAbort.py
TestGdbRemoteSegFault.py
TestGdbRemoteLibrariesSvr4Support.py
TestGdbRemoteMemoryAllocation.py
TestGdbRemoteMemoryTagging.py
TestGdbRemoteGPacket.py
TestGdbRemoteTargetXmlPacket.py
TestGdbRemote_QPassSignals.py
TestGdbRemoteThreadName.py
TestPartialResume.py
TestSignal.py

This patch implements an additional check for the opened socket on lldb-server side and fixes this error.


Full diff: https://github.com/llvm/llvm-project/pull/127100.diff

1 Files Affected:

  • (modified) lldb/packages/Python/lldbsuite/test/tools/lldb-server/gdbremote_testcase.py (+30-12)
diff --git a/lldb/packages/Python/lldbsuite/test/tools/lldb-server/gdbremote_testcase.py b/lldb/packages/Python/lldbsuite/test/tools/lldb-server/gdbremote_testcase.py
index cbe430c92fa7f..8ee41563da59d 100644
--- a/lldb/packages/Python/lldbsuite/test/tools/lldb-server/gdbremote_testcase.py
+++ b/lldb/packages/Python/lldbsuite/test/tools/lldb-server/gdbremote_testcase.py
@@ -343,6 +343,22 @@ def get_target_byte_order(self):
         target = self.dbg.CreateTarget(inferior_exe_path)
         return target.GetByteOrder()
 
+    def is_port_opened(self):
+        connect_port = self.port
+    
+        err, retcode, cmd_output = self.run_platform_command(f"netstat -ltn | grep {connect_port} | grep LISTEN")
+        
+        self.assertTrue(
+            err.Success(),
+            "Failed to get opened tcp sockets: %s, retcode: %d"
+            % (err.GetCString(), retcode),
+        )
+        
+        if retcode == 0:
+            return True
+        else:
+            return False
+
     def launch_debug_monitor(self, attach_pid=None, logfile=None):
         if self.reverse_connect:
             family, type, proto, _, addr = socket.getaddrinfo(
@@ -397,21 +413,23 @@ def connect_to_debug_monitor(self, attach_pid=None):
             # Schedule debug monitor to be shut down during teardown.
             logger = self.logger
 
-            connect_attemps = 0
+            connect_attempts = 0
             MAX_CONNECT_ATTEMPTS = 10
 
-            while connect_attemps < MAX_CONNECT_ATTEMPTS:
-                # Create a socket to talk to the server
-                try:
-                    logger.info("Connect attempt %d", connect_attemps + 1)
-                    self.sock = self.create_socket()
-                    self._server = Server(self.sock, server)
-                    return server
-                except _ConnectionRefused as serr:
-                    # Ignore, and try again.
-                    pass
+            while connect_attempts < MAX_CONNECT_ATTEMPTS:
+                if self.is_port_opened():
+                    # Create a socket to talk to the server
+                    try:
+                        logger.info("Connect attempt %d", connect_attempts + 1)
+                        self.sock = self.create_socket()
+                        self._server = Server(self.sock, server)
+                        return server
+                    except _ConnectionRefused as serr:
+                        # Ignore, and try again.
+                        pass
+                        
                 time.sleep(0.5)
-                connect_attemps += 1
+                connect_attempts += 1
 
             # We should close the server here to be safe.
             server.terminate()

Copy link

github-actions bot commented Feb 13, 2025

✅ With the latest revision this PR passed the Python code formatter.

def is_port_opened(self):
connect_port = self.port

err, retcode, cmd_output = self.run_platform_command(f"netstat -ltn | grep {connect_port} | grep LISTEN")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to do this from Python? This adds a hidden dependency on netstat, which isn't available on all supported platforms (i.e. Windows). Similarly, don't pipe things to grep, do the filtering in Python.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that the command executes on the remote machine, so they only way to access it is via the SBPlatform API. (But I also think all of this shouldn't be necessary)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Jonas about dependency on netstat. But llgs tests themselves are very linux-oriented:

I also agree that filtering should be done using Python

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That may be true, but I don't think its a reason to make things worse, particularly if we don't understand why does this work. I'll also note that the code you point to in a block of code that does not execute on darwin (the other main user of these tests) while this code does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this function

@mgorny
Copy link
Member

mgorny commented Feb 13, 2025

I'm thoroughly confused by this. The idea is that if the port is not yet open, then Server should fail to connect, and the test should retry, correct? If you need to explicitly check if the port is already open, then it sounds like the code doesn't work correctly.

@labath
Copy link
Collaborator

labath commented Feb 14, 2025

I agree with Jonas and Michał. I think we should take a step back. Can you explain where exactly does this BrokenPipeError come from (an exception backtrace for example)?

The code here assumes that we get a "connection refused" error if the port wasn't open. Maybe we need to catch "broken pipe" as well (but first I want to understand how does it get triggered).

And if that isn't enough, we can increase the sleep time or change it into an exponential backoff thingy.

# Create a socket to talk to the server
try:
logger.info("Connect attempt %d", connect_attemps + 1)
self.sock = self.create_socket()
Copy link
Contributor Author

@sga-sc sga-sc Feb 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python somehow manages to create a socket when the server-side port is not yet open. Because of this, the first time we try to read data from the socket the code crashes with an exception. For example TestGdbRemoteHostInfo.py:

4: test_qHostInfo_returns_at_least_one_key_val_pair_llgs (TestGdbRemoteHostInfo.TestGdbRemoteHostInfo.test_qHostInfo_returns_at_least_one_key_val_pair_llgs) ... ERROR
FAIL: LLDB (/home/jorik/work/llvm-project/build/Release/bin/clang-rv64gc) :: test_qHostInfo_returns_at_least_one_key_val_pair_llgs (TestGdbRemoteHostInfo.TestGdbRemoteHostInfo.test_qHostInfo_returns_at_least_one_key_val_pair_llgs)
2025-02-14 12:04:31,718 WARNING  failed to send kill packet to debug monitor: <class 'BrokenPipeError'>; ignoring

======================================================================
ERROR: test_qHostInfo_returns_at_least_one_key_val_pair_llgs (TestGdbRemoteHostInfo.TestGdbRemoteHostInfo.test_qHostInfo_returns_at_least_one_key_val_pair_llgs)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jorik/work/llvm-project/lldb/packages/Python/lldbsuite/test/tools/lldb-server/gdbremote_testcase.py", line 48, in test_method
    return attrvalue(self)
           ^^^^^^^^^^^^^^^
  File "/home/jorik/work/llvm-project/lldb/test/API/tools/lldb-server/TestGdbRemoteHostInfo.py", line 122, in test_qHostInfo_returns_at_least_one_key_val_pair
    self.get_qHostInfo_response()
  File "/home/jorik/work/llvm-project/lldb/test/API/tools/lldb-server/TestGdbRemoteHostInfo.py", line 90, in get_qHostInfo_response
    self.do_handshake()
  File "/home/jorik/work/llvm-project/lldb/packages/Python/lldbsuite/test/tools/lldb-server/gdbremote_testcase.py", line 546, in do_handshake
    self.assertEqual(server.get_normal_packet(), b"+")
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jorik/work/llvm-project/lldb/packages/Python/lldbsuite/test/tools/lldb-server/lldbgdbserverutils.py", line 943, in get_normal_packet
    frame = self.get_raw_normal_packet()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jorik/work/llvm-project/lldb/packages/Python/lldbsuite/test/tools/lldb-server/lldbgdbserverutils.py", line 932, in get_raw_normal_packet
    return self._read(self._normal_queue)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jorik/work/llvm-project/lldb/packages/Python/lldbsuite/test/tools/lldb-server/lldbgdbserverutils.py", line 885, in _read
    new_bytes = self._sock.recv(4096)
                ^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, but doesn't quite explain how could the connection be established without the other side being present. We may need to dig deeper.

One way I can see this happening is if there is some kind of a port forwarder sitting between the test and the server. Are you sure you don't have one of those?

We already have a hackaround for that, but it's only enabled on android. Could you check what happens if you remove the android check (and maybe increase the timeout in the sleep call)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@labath Thank you for your advice! After removing check for android platform tests passed. I execute tests on qemu-system-riscv64 and it's port forwarding behaves probably like android's (judging by passed tests). What should I do with this check for android?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Glad we could sort that out. I think we can remove that android check (and reword the comment to be more generic). I think this is a general property/design bug of port forwarders (they only attempt the outgoing connection after they receive the incoming one, at which point it is too late to abort, so the only thing they really can do is drop the connection). The only reason that check is android-specific is because we didn't need it anywhere else (yet).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

@sga-sc sga-sc force-pushed the users/sga-sc/fix_broken_pipe_error_lldb branch from 8844cd6 to 9aab567 Compare February 14, 2025 12:53
Copy link
Collaborator

@labath labath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks. (But be sure to wait for Jonas since he clicked the "request changes" button)

Copy link
Member

@JDevlieghere JDevlieghere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@sga-sc
Copy link
Contributor Author

sga-sc commented Feb 14, 2025

Thank you for a review, guys, can you merge it, please? I don't have merge rights

@JDevlieghere JDevlieghere merged commit 1042bd7 into llvm:main Feb 14, 2025
7 checks passed
joaosaffran pushed a commit to joaosaffran/llvm-project that referenced this pull request Feb 14, 2025
During LLDB testing on slow machines with the remote-linux platform all
tests from llgs category fail with python exception `BrokenPipeError`.
The main reason of these failures is slow start of lldb-server in
gdbserver mode. Due to this desired gdbserver socket does not have time
to open by the time the Python script tries to establish a connection.

List of failed tests:

```
TestAppleSimulatorOSType.py
TestGdbRemoteAttach.py
TestGdbRemoteAuxvSupport.py
TestGdbRemoteCompletion.py
TestGdbRemoteExitCode.py
TestGdbRemoteExpeditedRegisters.py
TestGdbRemoteHostInfo.py
TestGdbRemoteKill.py
TestGdbRemoteLaunch.py
TestGdbRemoteModuleInfo.py
TestGdbRemotePlatformFile.py
TestGdbRemoteProcessInfo.py
TestGdbRemoteRegisterState.py
TestGdbRemoteSaveCore.py
TestGdbRemoteSingleStep.py
TestGdbRemoteThreadsInStopReply.py
TestGdbRemote_qThreadStopInfo.py
TestGdbRemote_vCont.py
TestLldbGdbServer.py
TestNonStop.py
TestPtyServer.py
TestGdbRemoteAttachWait.py
TestGdbRemoteConnection.py
TestStubSetSID.py
TestGdbRemoteAbort.py
TestGdbRemoteSegFault.py
TestGdbRemoteLibrariesSvr4Support.py
TestGdbRemoteMemoryAllocation.py
TestGdbRemoteMemoryTagging.py
TestGdbRemoteGPacket.py
TestGdbRemoteTargetXmlPacket.py
TestGdbRemote_QPassSignals.py
TestGdbRemoteThreadName.py
TestPartialResume.py
TestSignal.py
```

This patch implements an additional check for the opened socket on
lldb-server side and fixes this error.
sivan-shani pushed a commit to sivan-shani/llvm-project that referenced this pull request Feb 24, 2025
During LLDB testing on slow machines with the remote-linux platform all
tests from llgs category fail with python exception `BrokenPipeError`.
The main reason of these failures is slow start of lldb-server in
gdbserver mode. Due to this desired gdbserver socket does not have time
to open by the time the Python script tries to establish a connection.

List of failed tests:

```
TestAppleSimulatorOSType.py
TestGdbRemoteAttach.py
TestGdbRemoteAuxvSupport.py
TestGdbRemoteCompletion.py
TestGdbRemoteExitCode.py
TestGdbRemoteExpeditedRegisters.py
TestGdbRemoteHostInfo.py
TestGdbRemoteKill.py
TestGdbRemoteLaunch.py
TestGdbRemoteModuleInfo.py
TestGdbRemotePlatformFile.py
TestGdbRemoteProcessInfo.py
TestGdbRemoteRegisterState.py
TestGdbRemoteSaveCore.py
TestGdbRemoteSingleStep.py
TestGdbRemoteThreadsInStopReply.py
TestGdbRemote_qThreadStopInfo.py
TestGdbRemote_vCont.py
TestLldbGdbServer.py
TestNonStop.py
TestPtyServer.py
TestGdbRemoteAttachWait.py
TestGdbRemoteConnection.py
TestStubSetSID.py
TestGdbRemoteAbort.py
TestGdbRemoteSegFault.py
TestGdbRemoteLibrariesSvr4Support.py
TestGdbRemoteMemoryAllocation.py
TestGdbRemoteMemoryTagging.py
TestGdbRemoteGPacket.py
TestGdbRemoteTargetXmlPacket.py
TestGdbRemote_QPassSignals.py
TestGdbRemoteThreadName.py
TestPartialResume.py
TestSignal.py
```

This patch implements an additional check for the opened socket on
lldb-server side and fixes this error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants