scp_send silently fails to send all data for some targets (see bug #226) #337

wynnw · 2022-02-16T15:50:01Z

Describe the bug
Same thing as the closed bug #226 - scp operations are not transferring the whole file at times. This bug was closed as can't reproduce. I realize this is really hard to deal with as this appears to be a timing issue with gevent greenlets and non-blocking I/O. I can't really debug it as just using a debugger introduces more time and then the problem never manifests.

To Reproduce

Steps to reproduce the behavior:

Example code that produces error.
Stack trace or error messages.

Expected behavior
Destination files match source files.

Actual behaviour
At least one target server's destination file is missing a certain number of files at the end.
For a file of 102912000 bytes I typically have one or more servers that only have 102891520 bytes, and that smaller number is very consistent.

Screenshots
If applicable, add screenshots to help explain your problem.

Additional information
I've run into this a bunch today as I've begun using this library. It's hard to repro every time, but I've got a path where sending 100mb files to two hosts is missing the last few bytes on just one about 50% of the time. Still investigating, but this is a real issue...

If I add a sleep(2) at the end of the data loop in _scp_send I never have the problem, so that's interesting.

This is against openssh-server 7.4p1-21 running on amazonlinux2, using:
parallel-ssh==2.8.0
ssh-python==0.9.0
ssh2-python==0.27.0

The virtualenv is using python 3.10.1

in a virtualenv. The code I'm running is just:

    77     def put(self, local, remote=None, preserve_mode=True):  # pylint: disable=arguments-differ
    78         # use a parallel-ssh group to do this
    79         from pssh.clients import ParallelSSHClient
    80         from gevent import joinall
    81         from logging import DEBUG
    82         from pssh.utils import logger, enable_logger
    83
    84         enable_logger(logger, DEBUG)
    85
    86         if not remote:
    87             remote = "/tmp/"
    88         if not local:
    89             raise ValueError("You must specify a filepath")
    90         if local[0] != "/":
    91             local = os.path.join(os.getcwd(), local)
    92         if not os.path.exists(local):
    93             raise ValueError(f"The specified filepath: {local} does not exist")                                                                                                                                                                                                                                                                                                          
    94
    95         hosts = []
    96         for cnxn in self:
    97             hosts.append(cnxn.host)
    98         pclient = ParallelSSHClient(hosts)
    99         ops = pclient.scp_send(local, remote)                                                                                                                                                                                                                                                                                                                                            
  100         joinall(ops, raise_error=False)

Using a file that is 99mb and sending to two servers. This is code integrated with some fabric2 code.

In further testing, I'm able to reproduce the problems with a 10561 bytes text file and 4 servers, and quite often or or two of the servers gets a 0 byte file. If I add the gevent sleep call in at the end of the _scp_send data/eagain_write loop like sleep(2) I never see the problem. Seems like a channel can get closed before all the data is flushed out?

This is an awesome library, you've done great work. I can't really use it though until I get this problem figured out as it happens at least 50% of the time.

The text was updated successfully, but these errors were encountered:

wynnw · 2022-02-16T16:12:17Z

As a side note, I haven't been able to reproduce these errors with the asyncssh 2.9.0 project that uses asyncio instead of gevent.

JonasToth · 2022-02-21T14:48:55Z

I have seen a similar issue when upload a tar file to many devices.
The files were for some of them consistently corrupted, all bad files resulted in the same hash value.

pkittenis · 2022-03-01T09:43:12Z

Hi there,

Thanks for the interest and report. Since there have been multiple reports on this issue, I do believe there is one. Having gone through the code again, can see where there is potential for a race condition which sounds like the behaviour here. The difference in the written vs total bytes is exactly 2k as well, which is the internal buffer size of libssh2. That can't be an accident.

I am curious if you see the same behaviour with copy_file, the SFTP copy. I'd expect not. And you cannot have used SCP with asyncssh since it does not support it, it uses SFTP.

Another question, are you able to reproduce this against a local SSH server or only remote? I still need to be able to reproduce in a test to confirm a fix works, that would help in doing so. The existing tests do not use such a large file size, will try again with 100MB.

wynnw · 2022-03-01T17:48:37Z

I can't use sftp with my systems, not sure why and I haven't deep dived on that. So I am using scp (was before with paramiko), and now I'm using scp with asyncssh as it does support it.
https://asyncssh.readthedocs.io/en/latest/index.html#scp-client

wynnw · 2022-03-01T17:50:08Z

I've been focused on this inside aws. It reproduces very consistently for me within the same ec2 region and when going across ec2 regions. So that's more overhead than a local ssh server of course. I haven't tried on my own home LAN.

JonasToth · 2022-03-02T11:47:44Z

Thanks for the interest and report. Since there have been multiple reports on this issue, I do believe there is one. Having gone through the code again, can see where there is potential for a race condition which sounds like the behaviour here. The difference in the written vs total bytes is exactly 2k as well, which is the internal buffer size of libssh2. That can't be an accident.

Yes, that is approximately the amount of bytes missing in my case.
I uploaded ~2MB tars and they were damaged only at the end. I verified that with a hexdump of the files. And the missing bytes werent that much, 2K seems about right.

The remote-side in my production-system use dropbear over a mobile connection. Maybe such an environment is more prone to the condition as well?

We use toxiproxy for some tests, maybe their rate-limiting helps with reproduction.

I am curious if you see the same behaviour with copy_file, the SFTP copy. I'd expect not. And you cannot have used SCP with asyncssh since it does not support it, it uses SFTP.

AFAIK our dropbear does not support sftp properly/there were issues. Thats why we used scp.

Another question, are you able to reproduce this against a local SSH server or only remote? I still need to be able to reproduce in a test to confirm a fix works, that would help in doing so. The existing tests do not use such a large file size, will try again with 100MB.

https://github.com/shopify/toxiproxy

Maybe you can play around with timings/bandwith using toxiproxy. Local setups might be "too good" under normal circumstances.

pkittenis · 2022-03-20T18:25:07Z

~~Resolved in 2.9.0.~~ Really resolved in 2.9.1.

Triggered by client going out of scope, which caused a race condition if client is not kept alive.

scp_send should also be a lot faster as well now, 10x or more.

Please reply on this issue if still having problems, will look into it asap.

wynnw · 2022-03-23T04:12:38Z

Nice fix - thanks!

pkittenis added bug pending reproduction labels Mar 1, 2022

pkittenis mentioned this issue Mar 1, 2022

scp_send doesn't transfer entire file for all hosts, doesn't throw exceptions #226

Closed

pkittenis removed the pending reproduction label Mar 20, 2022

pkittenis closed this as completed in cf29d9d Mar 20, 2022

pkittenis reopened this Mar 22, 2022

pkittenis closed this as completed in 1a18e75 Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scp_send silently fails to send all data for some targets (see bug #226) #337

scp_send silently fails to send all data for some targets (see bug #226) #337

wynnw commented Feb 16, 2022

wynnw commented Feb 16, 2022

Uh oh!

JonasToth commented Feb 21, 2022

Uh oh!

pkittenis commented Mar 1, 2022

Uh oh!

wynnw commented Mar 1, 2022

Uh oh!

wynnw commented Mar 1, 2022

Uh oh!

JonasToth commented Mar 2, 2022

Uh oh!

pkittenis commented Mar 20, 2022 •

edited

Loading

Uh oh!

wynnw commented Mar 23, 2022

Uh oh!

scp_send silently fails to send all data for some targets (see bug #226) #337

scp_send silently fails to send all data for some targets (see bug #226) #337

Comments

wynnw commented Feb 16, 2022

wynnw commented Feb 16, 2022

Uh oh!

JonasToth commented Feb 21, 2022

Uh oh!

pkittenis commented Mar 1, 2022

Uh oh!

wynnw commented Mar 1, 2022

Uh oh!

wynnw commented Mar 1, 2022

Uh oh!

JonasToth commented Mar 2, 2022

Uh oh!

pkittenis commented Mar 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wynnw commented Mar 23, 2022

Uh oh!

pkittenis commented Mar 20, 2022 •

edited

Loading