Don't send events on close() #3374

bogdanm · 2016-12-06T12:06:36Z

Don't send events on close()
It's currently possible to generate a socket event when a non-blocking socket is closed:

_pending is set to 0 in https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L22
when the socket is created.
close() calls event() in https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/Socket.cpp#L66
event() increments _pending, and since _pending is 1 it will call _callback() in https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L167

However, if send() (for example) is called, this can happen:

send() is called and sets _pending to 0.
when the data is sent, event() is called, which sets _pending to 1 and calls _callback().
if close() is called at this point, there won't be an event generated for close() anymore,
since _pending will be set to 2.

(same thing for recv)

This PR changes the initial value of _pending to 1 instead of 0, so that
events are never generated for close().

bogdanm · 2016-12-06T12:07:32Z

@marcuschangarm @0xc0170

bogdanm · 2016-12-06T12:08:36Z

Fixes #3367

0xc0170 · 2016-12-06T14:28:11Z

cc @c1728p9 @geky

geky · 2016-12-06T17:29:01Z

These socket functions are tricky (just looking at the git blame has a different commit for almost every single line), and I'm a little hesitant on having the close event only trigger sometimes (it will still trigger when a socket is in a recv/send loop), but having the initial state of the socket = state after a successful receive is logically sound to me.

This patch seems good to go 👍

marcuschangarm · 2016-12-06T17:34:49Z

@geky Funny you should say that, what started this was we only saw the close event sometimes and not all the time! 😄

I'm fine either way, but it should be consistent!

From @bogdanm it sounds like close() on other systems is non-blocking and doesn't generate events.

geky · 2016-12-06T17:40:02Z

Ah! just commented on #3367

There isn't a "close" event. The event generated by close is really a "recv/send" event to inform the user that the result of a recv call has changed (in this case erroring since the socket is closed).

marcuschangarm · 2016-12-06T17:42:47Z

Excellent, sounds like we are all in agreement then! 😃

geky · 2016-12-06T19:43:32Z

Actually, could this be updated to be consistent in all three socket classes?

bogdanm · 2016-12-07T09:57:03Z

Done. I force-pushed the modified version that changes TCPSocket, TCPServer and UDPSocket.

0xc0170 · 2016-12-07T10:11:31Z

/morph test-nightly

mbed-bot · 2016-12-07T12:34:13Z

Result: FAILURE

Your command has finished executing! Here's what you wrote!

/morph test-nightly

Output

mbed Build Number: 1208

Test failed!

0xc0170 · 2016-12-07T13:06:08Z

@bogdanm The test reports K64F-GCC_ARM.features-feature_lwip-tests-mbedmicro-net-nist_internet_time_service.features-feature_lwip-tests-mbedmicro-net-nist_internet_time_service (from K64F-GCC_ARM) .

Restarting , just to be certain. I checked nighly 3 days ago, was green.

/morph test-nightly

bogdanm · 2016-12-07T14:48:09Z

Please do not merge yet. There might still be issues with this PR.

0xc0170 · 2016-12-07T15:33:22Z

Please do not merge yet. There might still be issues with this PR.

Let us know. If this is critical, it shall be integrated within today. Can you @bogdanm specify what are the issues?

It's currently possible to generate a socket event when a non-blocking socket is closed: 1. _pending is set to 0 in https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L22 when the socket is created. 2. close() calls event() in https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/Socket.cpp#L66 3. event() increments _pending, and since _pending is 1 it will call _callback() in https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L167 However, if send() (for example) is called, this can happen: - send() is called and sets _pending to 0. - when the data is sent, event() is called, which sets _pending to 1 and calls _callback(). - if close() is called at this point, there won't be an event generated for close() anymore, since _pending will be set to 2. Same thing for recv. Also, same thing for TCPServer and UDPSocket. This PR changes the initial value of _pending to 1 instead of 0, so that events are never generated for close().

bogdanm · 2016-12-07T15:54:19Z

Alright, pushed yet another version that actually sets _pending to 1 in all cases. This version should be good to go.

geky · 2016-12-07T15:57:57Z

/morph test-nightly

mbed-bot · 2016-12-07T16:05:55Z

Result: FAILURE

Your command has finished executing! Here's what you wrote!

/morph test-nightly

geky · 2016-12-07T17:09:58Z

/morph test-nightly

bogdanm · 2016-12-07T17:14:15Z

I think I might've confused the CI by doing a force push ?

geky · 2016-12-07T17:15:08Z

It should handle force pushes correctly, I'm hoping this was just a spurious issue

mbed-bot · 2016-12-07T18:13:45Z

Result: SUCCESS

Your command has finished executing! Here's what you wrote!

/morph test-nightly

Output

mbed Build Number: 1217

All builds and test passed!

geky · 2016-12-07T18:25:30Z

Yep looks like it, should be good to go! @0xc0170

mbed-bot · 2016-12-07T19:35:57Z

Result: FAILURE

Your command has finished executing! Here's what you wrote!

/morph test-nightly

Output

mbed Build Number: 1218

Test failed!

bogdanm · 2016-12-08T09:30:49Z

Looks like this failed because of the same problematic NIST test.

Release mbed OS 5.3.0-rc3 Ports for Upcoming Targets Fixes and Changes 3373: Fix DEBUG target keyword for GCC_ARM #3373 3374: Don't send events on close() #3374

kjbracey · 2017-01-20T07:30:49Z

mbed client guys are seeing networking issues with 5.3.2 and lwIP Ethernet, and analysis so far seems to suggest they're caused by one of this change or #3457 or #3458.

Trying to get to the bottom of it, and can't actually see a way any of these could directly cause their problem, either in isolation or in combination - it's possible there's just a pre-existing unreliability, either in NSAPI or the app.

But while going over all three PRs with a fine-tooth comb to investigate, this change stood out as wrong to me. _pending non-zero means "we have sent up a callback since the last major network call, so don't send another".

By setting it to non-zero on start-up, you're suppressing events until they make a network call that sets _pending to zero. After that, original behaviour resumes.

That will certainly have the effect of making an immediate close() after open not generate an callback, but it hits both more and less than you were intending.

AFAICT, TCPServer sockets that just listen won't get a connection callback, UDPSockets that just bind and just wait for incoming data won't get a receive callback.

And it doesn't totally suppress events from close - it's still possible that close could send an event. (eg close immediately after non-blocking send).

I think the correct fix is to set _callback to NULL in the actual close call, before calling event(), to always suppress the close callback, without affecting the rest of the event system.

Does this analysis sound correct? If it is, it must mean we don't have a test coverage hole - no non-blocking server-type use. (CC @mikaleppanen who's checking that)

Still, I don't see how this change affects the observed problem.

(And I'm also a bit wary that this _pending mechanism might be suppressing too many events, what with a wide variety of activity and events for both RX and TX directions. I kind of wonder whether we should be making sure _pending gets set to 0 on every single socket operation. It's vaguely possible something like that might be the mbed client problem.)

geky · 2017-01-20T12:48:49Z

That analysis sounds correct to me.

My earlier comment on the initial state of the socket seems wrong now. On a leave from a recv, the _pending flag should be zero unless a packet was recved during the call.

Either way, zeroing the callback is a much more robust solution, will make a pr when I can.

kjbracey · 2017-01-20T13:24:16Z

I think we got down to the basic mbed-client update problem - they were inadvertently relying on lwIP's "connected" event being suppressed by this change.

#3457 made the connect() call zero _pending, so the connected event was let through again, and that extra event confused their state machine.

geky · 2017-01-20T13:40:24Z

Ah! That's good to know.

Given the netsocket api/implementations, spurious interrupts are still a very real possibility. Consider what would happen if a small packet was recieved between these two lines:
https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L163-L164

Unfortunately, even if we strictly controlled when interrupts were let through, the netsocket api would need a bit of restructuring/testing before we could garuntee when interrupts occur. State transistions based off the socket's attach function are probably a programming error and should instead be based of the return value of the socket's send/recv.

@kjbracey-arm, thanks for clearing this up.

0xc0170 · 2017-01-20T13:43:15Z

Thank you @kjbracey-arm ! 👍

kjbracey · 2017-01-20T14:00:43Z

Agreed - abstraction layers can't control when the underlying stack generates events, and any attempt to interfere to improve the situation can make the situation worse. As this change did.

I did see another report of a bad pattern - calling send() an HTTP request, expecting a subsequent event from it, and hence not bothering to do anything when that event was received. Then when response data arrived, its event is suppressed because _pending is still one.

It' s arguable that _pending is being unhelpful rather than helpful in that case, but the application is definitely in the wrong - it can't expect distinct "sent" and "received" events. (I would argue you can't necessarily expect "sent" events at all, unless you know you filled the output buffer and got EWOULDBLOCK).

geky · 2017-01-20T14:11:21Z

The intention of the _pending flag (the name may be contributing to the confusion) is to help network stack implementations rather than users. It allows drivers to interrupt whenever without worrying about overflowing queues at higher levels.

I was thinking of how to help users avoid these mistakes in the future. One option would be to change the _pending limit from 1 to something like 3 in debug mode, but that may be borderline crazy.

kjbracey · 2017-01-20T14:26:28Z

That's what I kind of figured - the ESP8266 or similar attaching it straight to the serial interrupt.

I can't help but feel that such stacks should be exercising a little more constraint. My gut feeling is that an abstraction layer choosing to suppress events from lower layers (by deducing which ones are spurious) is potentially error-prone, and ESP8266 should not rely on that behaviour.

geky · 2017-01-20T15:31:02Z

Maybe, though it wouldn't be the easiest change to the ESP8266 driver for it to understand what type of events it's recieving (there is currently no state, the driver is just reactive to interrupts).

The ESP8266 could do the coalescing, but it's convenient to have coalesce in the mbed socket layer since the code can be reused accross any combination of network stack + user code.

geky · 2017-01-20T15:36:05Z

I'm also concerned we may be making a documentation error. It seems the name/usage of callbacks/similarity to other attach functions in mbed is causing quite a bit of confusion on what the attach function actually provides.

This may be remedied by introducing higher-level stream based apis. Would a select function be more useful to the client team? (I can't say we're going to provide one soon, and the attach function probably won't go away, but I'm curious if it's worth consideration).

kjbracey · 2017-01-23T07:01:46Z

As it happens, I am working on a revision of the FileHandle/Stream stuff for other reasons - there is certainly overlap here, including the possiblity of sockets being FileHandles, and I want to discuss that.

Even with that though, it's not clear that majorly any improved event functionality is implementable in general on the sort of network stacks we have to deal with. I'm thinking we need a solid "poll" for FileHandle, and am sure we could do that for driver-based devices, but no idea how many network stacks could reliably provide that. I still think the extremely restricted callback semantics we've already specified are probably the lowest common denominator.

The only real ESP8266 event is packet reception isn't it? The chip spontaneously sends data when a packet arrives. So a serial interrupt is fine for that, except you should possible be masking off serial interrupts except when idle, so issuing socket calls doesn't generate events.

On the other hand, there is no "socket has become writable" event. I guess people just haven't noticed that one.

geky approved these changes Dec 6, 2016

View reviewed changes

bogdanm mentioned this pull request Dec 7, 2016

A socket close operation doesn't always send an event #3367

Closed

0xc0170 added the needs: CI label Dec 7, 2016

geky added ready for merge and removed needs: CI labels Dec 7, 2016

0xc0170 added release-version: 5.3.0-rc3 and removed ready for merge labels Dec 8, 2016

0xc0170 merged commit 1b019c3 into ARMmbed:master Dec 8, 2016

geky mentioned this pull request Jan 20, 2017

nsapi: Change initial state of sockets to allow events #3619

Merged

Don't send events on close() #3374

Don't send events on close() #3374

Uh oh!

Conversation

bogdanm commented Dec 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bogdanm commented Dec 6, 2016

Uh oh!

bogdanm commented Dec 6, 2016

Uh oh!

0xc0170 commented Dec 6, 2016

Uh oh!

geky commented Dec 6, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcuschangarm commented Dec 6, 2016

Uh oh!

geky commented Dec 6, 2016

Uh oh!

marcuschangarm commented Dec 6, 2016

Uh oh!

geky commented Dec 6, 2016

Uh oh!

bogdanm commented Dec 7, 2016

Uh oh!

0xc0170 commented Dec 7, 2016

Uh oh!

mbed-bot commented Dec 7, 2016

Result: FAILURE

Output

Uh oh!

0xc0170 commented Dec 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bogdanm commented Dec 7, 2016

Uh oh!

0xc0170 commented Dec 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bogdanm commented Dec 7, 2016

Uh oh!

geky commented Dec 7, 2016

Uh oh!

mbed-bot commented Dec 7, 2016

Result: FAILURE

Uh oh!

geky commented Dec 7, 2016

Uh oh!

bogdanm commented Dec 7, 2016

Uh oh!

geky commented Dec 7, 2016

Uh oh!

mbed-bot commented Dec 7, 2016

Result: SUCCESS

Output

Uh oh!

geky commented Dec 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbed-bot commented Dec 7, 2016

Result: FAILURE

Output

Uh oh!

bogdanm commented Dec 8, 2016

Uh oh!

kjbracey commented Jan 20, 2017

Uh oh!

geky commented Jan 20, 2017

Uh oh!

kjbracey commented Jan 20, 2017

Uh oh!

geky commented Jan 20, 2017

Uh oh!

0xc0170 commented Jan 20, 2017

Uh oh!

kjbracey commented Jan 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

geky commented Jan 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

bogdanm commented Dec 6, 2016 •

edited

Loading

geky commented Dec 6, 2016 •

edited

Loading

0xc0170 commented Dec 7, 2016 •

edited

Loading

0xc0170 commented Dec 7, 2016 •

edited

Loading

geky commented Dec 7, 2016 •

edited

Loading

kjbracey commented Jan 20, 2017 •

edited

Loading

geky commented Jan 20, 2017 •

edited

Loading