Skip to content

bpo-29883: Add UDP support to Windows Proactor Event Loop #1067

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

ameily
Copy link
Contributor

@ameily ameily commented Apr 9, 2017

This PR adds UDP support to the Windows Proactor Event Loop. Several changes were performed:

  • Added wrappers around Winsock UDP functions:
    • WSAConnect - ConnectEx doesn't support binding a UDP endpoint to a socket so I had to add this. WSAConnect completes immediately for UDP sockets.
    • WSARecvFrom
    • WSASendTo
  • Implement _make_datagram_transport() and the corresponding UDP transport class _ProactorDatagramTransport.
    • Bound UDP endpoints can call WSARecv and WSASend. Unbound UDP endpoints require calls to WSARecvFrom and WSASendTo.
  • Several new unit tests exercising UDP proactor support.

I've verified that the change can be cleanly applied to master, 3.5, and 3.6. So, back porting this feature will be straightforward, which I would like personally since I require this feature on a Python 3.5 project.

One of the SSL unit tests is failing and I've verified that the same test fails with the same error in master.

cpython\lib\asyncio\sslproto.py:335: ResourceWarning: unclosed transport
<asyncio.sslproto._SSLProtocolTransport object at 0x00000171EABA8A90>
  source=self)

https://bugs.python.org/issue29883

@mention-bot
Copy link

@ameily, thanks for your PR! By analyzing the history of the files in this pull request, we identified @1st1, @zooba and @birkenfeld to be potential reviewers.

@njsmith
Copy link
Contributor

njsmith commented Apr 10, 2017

You might be able to remove some code by using socket.connect in place of _overlapped.WSAConnect. Not sure.

If I'm reading this correctly, your sendto implementation kicks off a WSASendTo, then waits for it to complete, kicks off another one, waits for that to complete, etc., just like you'd expect from TCP, right? It turns out that this can be really really inefficient for overlapped UDP. Chrome actually switched from doing UDP via IOCP to doing it via non-blocking sockets because of this (in one of the test reports in that thread they were getting 250 KB/s with IOCP, 8000 KB/s with non-blocking I/O; that's not a typo. And note that testing on loopback doesn't work; you need to test on a real network.). The hypothesis is that this is because overlapped UDP effectively disables buffering, like, the completion notification doesn't mean "this packet was queued to the send buffer" like it does for overlapped TCP, but rather "this packet was actually transmitted on the network". Or something.

I'm not sure what the best solution is. AFAICT libuv's solution is that it just keeps starting WSASendTo calls without waiting for previous ones to complete, though this must eventually go wrong somehow if you keep queueing send calls faster than the network card can process them...

You might also find this interested as a general overview of the challenges of making IOCP work well: python-trio/trio#52

@njsmith
Copy link
Contributor

njsmith commented Apr 10, 2017

Oops, forgot to include this link to more discussion of the chromium thing: https://groups.google.com/a/chromium.org/forum/#!topic/net-dev/VTPH1D8M6Ds

@ameily
Copy link
Contributor Author

ameily commented Apr 10, 2017

@njsmith Thanks for the feedback and resources.

You might be able to remove some code by using socket.connect in place of _overlapped.WSAConnect.

I can take a look at socket.connect() and see if that'll work instead of WSAConnect.

If I'm reading this correctly, your sendto implementation kicks off a WSASendTo, then waits for it to complete, kicks off another one, waits for that to complete, etc., just like you'd expect from TCP, right?

You are correct. I didn't want to run into the problem you described where we just spam the network with UDP packets. I can change it to only allow a certain number of outstanding WSASendTo operations at one time. If you're okay with this solution, what is an acceptable default number of outstanding send operations?

testing on loopback doesn't work; you need to test on a real network

Yup, I've been testing on a real network. I wrote a simple script to query a network's Windows systems NetBIOS name, which operates over UDP. The loopback address is only used in the unit tests, which I modeled after the TCP unit tests where nothing is actually written to the network.


I understand that IOCP UDP performance may not be the best but I think it's something that should be in the main CPython if possible. I need it on a project where I'm performing asynchronous TCP and UDP operations and the Selector event loop is adding a substantial amount of overhead (~80% in some cases). I can add a disclaimer in the documentation that IOCP sockets aren't the best option for high performance/traffic UDP applications.

I'll also take a look at the resources you linked to see if there are other ways I can improve the implementation.

@njsmith
Copy link
Contributor

njsmith commented Apr 10, 2017

I can change it to only allow a certain number of outstanding WSASendTo operations at one time. If you're okay with this solution, what is an acceptable default number of outstanding send operations?

I don't know. It's something I have been trying to figure out as well :-). Like I said, AFAICT from reading the source libuv doesn't appear to put any upper bound on the number of outstanding send operations, so maybe that works?

There's even an argument that one should handle UDP by making non-blocking sendto calls and if it returns EWOULDBLOCK, silently discard the packet. That way you don't have to worry about select :-)

@ameily
Copy link
Contributor Author

ameily commented Apr 20, 2017

I've done some benchmarking to help determine the best design for high efficiency UDP over IOCP.

  • Each test was sending 1,048,576 packets to a single IP address. The IP has a static ARP rule set to hopefully improve measurement accuracy.
  • Each test has two applications:
    • loop.call_soon(): UDP packets are sent one at a time with loop.call_soon(). This is to emulate a simple server where a UDP packet is sent as a response based on an incoming request from the event loop.
    • Immediate: UDP packets are sent within the same loop callback (for i in range(...): transport.sendto(...)). This is to blast the network and fill the socket's outgoing queue.
  • Each test has two time measurements, using time.perf_counter().
    • Time in Python: amount of time spent, in seconds, within Python from the first call to sendto to the last call to sendto.
    • Actual Capture Time: amount of time, in seconds, from the first packet written on the wire to the last packet written on the wire.
Design Application Time in Python Actual Capture Time Notes
IOCP with buffering loop.call_soon() 34.243 34.242 The current design where only a single sendto operation is active
IOCP with buffering Immediate 0.965 31.692 The current design where only a single sendto operation is active
IOCP no buffering loop.call_soon() 30.640 30.640 The modified current design where an unlimited number of sendto operations are active
IOCP no buffering Immediate 22.014 20.329 The modified current design where an unlimited number of sendto operations are active
Sync loop.call_soon() 23.058 23.058 Synchronous call to sendto
Sync Immediate 13.896 13.896 Synchronous call to sendto

I was surprised that the synchronous design did not drop any packets and sendto never returned EWOULDBLOCK.

The benchmark script I wrote is here: https://gist.github.com/ameily/3b801a32b0c896a2c4791abcf56cc3e1. Each design I tested requires some modification of the core Python source code but I used the same script for each design.

@larryhastings
Copy link
Contributor

FYI, since this is a new feature, it's not eligible for a backport to 3.5, or even 3.6. New features can't go into any released version of Python.

(unless Guido asks for it, which is rare)

@1st1 1st1 self-requested a review December 8, 2017 22:12
@1st1
Copy link
Member

1st1 commented Dec 8, 2017

@larryhastings Yes, this will be a 3.7-only thing.

@1st1
Copy link
Member

1st1 commented Dec 8, 2017

@ameily Somehow I missed this PR :( Let's work on it and get it merged asap. Could you please rebase it against the current master?

@ameily
Copy link
Contributor Author

ameily commented Dec 16, 2017

I've rebased onto upstream master. Now, one of the tests is timing out:

test_sock_client_ops (test.test_asyncio.test_events.ProactorEventLoopTests) ... Timeout (0:20:00)!

I've narrowed it down to a call to sock_recv_into() which I did not modify for this PR. So, I don't think it's my code causing the issue, but I could be wrong.

@ameily
Copy link
Contributor Author

ameily commented Sep 15, 2018

This PR is now rebased onto upstream master.

@vstinner
Copy link
Member

Amazing, thanks!

@vladima: Would you be interested to review this PR?

@asvetlov: Here is the PR that we were talking about ;-)

@asvetlov
Copy link
Contributor

asvetlov commented Sep 17, 2018

Cool!
Is it ready to review?

@ameily
Copy link
Contributor Author

ameily commented Sep 18, 2018

@asvetlov from my perspective, yes, this PR is ready for review.

Copy link
Member

@zooba zooba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a couple of linter-style comments, but apart from the borrowed references, it looks pretty good to me.

@ameily
Copy link
Contributor Author

ameily commented Sep 22, 2018

@zooba I believe I addressed your feedback and made my changes PEP 7 compliant.

@ameily
Copy link
Contributor Author

ameily commented Sep 26, 2018

@asvetlov Sorry, force of habit. I changed my calls to just be super().

Copy link
Member

@1st1 1st1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'd like to fix in this PR is to remove mock tests completely and replace them with proper functional tests. Most of existing mock tests in asyncio will either be removed before 3.8 or replaced with proper functional tests.

Mock tests are tightly coupled with the implementation and prevent us from freely refactoring it. IMO It's better to have one or two functional tests than an array of mock tests.

I recommend you to take a the uvloop project and copy the UDP tests from there and maybe come up with a couple more, if needed.

@ameily ameily force-pushed the fix-issue-29883 branch from 9c9f7d2 to 0c935b0 Compare May 2, 2019 22:17
@ameily
Copy link
Contributor Author

ameily commented May 2, 2019

@vstinner I believe I fixed the memory leak. I also rebased onto upstream/master.

python  -m test test_asyncio -R 3:3 -m test.test_asyncio.test_events.ProactorEventLoopTests.test_create_datagram_endpoint

0:00:00 load avg: 0.00 [1/1] test_asyncio
beginning 6 repetitions
123456
......
test_asyncio passed

== Tests result: SUCCESS ==

1 test OK.

Total duration: 3 sec 282 ms
Tests result: SUCCESS

@vstinner
Copy link
Member

vstinner commented May 6, 2019

@vstinner I believe I fixed the memory leak. I also rebased onto upstream/master.

Strange. I don't know why it leaks on my Windows VM.


This PR has other flaws:

  • protocol is not paused/resumed
  • _proactor.recv() is called with an hardcoded size: 4096 bytes. It should be possible to configure that. Jumbo frames allows up to 9000 bytes per network packet. By the way, how did you pick the number 4096? _SelectorDatagramTransport uses 256 KiB by default: maybe it's way too much, I don't know.
  • repr() uses len(self._buffer) which is wrong for a collections.deque
  • not sure if it's an issue: abort() method is missing. Is it part of asyncio API or not?

I have a WIP change on top of this PR, but I didn't have time to finish it yet :-(

@asvetlov
Copy link
Contributor

asvetlov commented May 7, 2019

@vstinner could you provide a link to your WIP?
It can help a lot.

I think we should be consistent.
Let's use 256 KiB for both Unix and Windows implementations.
Can change it later in the one place.
Putting this constant into asyncio/constants.py would be nice.

abort() is a part of DatagramTransport interface.

@alexchandel
Copy link
Contributor

@vstinner which function originates the leak? afaict Overlapped_getresult does not leak.

@alexchandel
Copy link
Contributor

@ameily looks like it failed some tests in test_asyncio, all similar to:

======================================================================
ERROR: test_fatal_error_connected (test.test_asyncio.test_proactor_events.ProactorDatagramTransportTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/vsts/agent/2.150.3/work/1/s/Lib/unittest/mock.py", line 1226, in patched
    return func(*args, **keywargs)
  File "/Users/vsts/agent/2.150.3/work/1/s/Lib/test/test_asyncio/test_proactor_events.py", line 1008, in test_fatal_error_connected
    transport = self.datagram_transport(address=('0.0.0.0', 1))
  File "/Users/vsts/agent/2.150.3/work/1/s/Lib/test/test_asyncio/test_proactor_events.py", line 740, in datagram_transport
    transport = _SelectorDatagramTransport(self.loop, self.sock,
NameError: name '_SelectorDatagramTransport' is not defined

@asvetlov
Copy link
Contributor

Closing.
The work is finished by #13440
Thank you very much @ameily and all reviewers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.