Skip to content

Don't send events on close() #3374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 8, 2016
Merged

Don't send events on close() #3374

merged 1 commit into from
Dec 8, 2016

Conversation

bogdanm
Copy link
Contributor

@bogdanm bogdanm commented Dec 6, 2016

Don't send events on close()
It's currently possible to generate a socket event when a non-blocking socket is closed:

  1. _pending is set to 0 in https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L22
    when the socket is created.
  2. close() calls event() in https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/Socket.cpp#L66
  3. event() increments _pending, and since _pending is 1 it will call _callback() in https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L167

However, if send() (for example) is called, this can happen:

  • send() is called and sets _pending to 0.
  • when the data is sent, event() is called, which sets _pending to 1 and calls _callback().
  • if close() is called at this point, there won't be an event generated for close() anymore,
    since _pending will be set to 2.

(same thing for recv)

This PR changes the initial value of _pending to 1 instead of 0, so that
events are never generated for close().

@bogdanm
Copy link
Contributor Author

bogdanm commented Dec 6, 2016

@marcuschangarm @0xc0170

@bogdanm
Copy link
Contributor Author

bogdanm commented Dec 6, 2016

Fixes #3367

@0xc0170
Copy link
Contributor

0xc0170 commented Dec 6, 2016

cc @c1728p9 @geky

@geky
Copy link
Contributor

geky commented Dec 6, 2016

These socket functions are tricky (just looking at the git blame has a different commit for almost every single line), and I'm a little hesitant on having the close event only trigger sometimes (it will still trigger when a socket is in a recv/send loop), but having the initial state of the socket = state after a successful receive is logically sound to me.

This patch seems good to go 👍

@marcuschangarm
Copy link
Contributor

@geky Funny you should say that, what started this was we only saw the close event sometimes and not all the time! 😄

I'm fine either way, but it should be consistent!

From @bogdanm it sounds like close() on other systems is non-blocking and doesn't generate events.

@geky
Copy link
Contributor

geky commented Dec 6, 2016

Ah! just commented on #3367

There isn't a "close" event. The event generated by close is really a "recv/send" event to inform the user that the result of a recv call has changed (in this case erroring since the socket is closed).

@marcuschangarm
Copy link
Contributor

Excellent, sounds like we are all in agreement then! 😃

@geky
Copy link
Contributor

geky commented Dec 6, 2016

Actually, could this be updated to be consistent in all three socket classes?

@bogdanm
Copy link
Contributor Author

bogdanm commented Dec 7, 2016

Done. I force-pushed the modified version that changes TCPSocket, TCPServer and UDPSocket.

@0xc0170
Copy link
Contributor

0xc0170 commented Dec 7, 2016

/morph test-nightly

@mbed-bot
Copy link

mbed-bot commented Dec 7, 2016

Result: FAILURE

Your command has finished executing! Here's what you wrote!

/morph test-nightly

Output

mbed Build Number: 1208

Test failed!

@0xc0170
Copy link
Contributor

0xc0170 commented Dec 7, 2016

@bogdanm The test reports K64F-GCC_ARM.features-feature_lwip-tests-mbedmicro-net-nist_internet_time_service.features-feature_lwip-tests-mbedmicro-net-nist_internet_time_service (from K64F-GCC_ARM) .

Restarting , just to be certain. I checked nighly 3 days ago, was green.

/morph test-nightly

@bogdanm
Copy link
Contributor Author

bogdanm commented Dec 7, 2016

Please do not merge yet. There might still be issues with this PR.

@0xc0170
Copy link
Contributor

0xc0170 commented Dec 7, 2016

Please do not merge yet. There might still be issues with this PR.

Let us know. If this is critical, it shall be integrated within today. Can you @bogdanm specify what are the issues?

It's currently possible to generate a socket event when a non-blocking socket is closed:

1. _pending is set to 0 in https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L22
   when the socket is created.
2. close() calls event() in https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/Socket.cpp#L66
3. event() increments _pending, and since _pending is 1 it will call _callback() in https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L167

However, if send() (for example) is called, this can happen:

- send() is called and sets _pending to 0.
- when the data is sent, event() is called, which sets _pending to 1 and calls _callback().
- if close() is called at this point, there won't be an event generated for close() anymore,
  since _pending will be set to 2.

Same thing for recv. Also, same thing for TCPServer and UDPSocket.

This PR changes the initial value of _pending to 1 instead of 0, so that
events are never generated for close().
@bogdanm
Copy link
Contributor Author

bogdanm commented Dec 7, 2016

Alright, pushed yet another version that actually sets _pending to 1 in all cases. This version should be good to go.

@geky
Copy link
Contributor

geky commented Dec 7, 2016

/morph test-nightly

@mbed-bot
Copy link

mbed-bot commented Dec 7, 2016

Result: FAILURE

Your command has finished executing! Here's what you wrote!

/morph test-nightly

@geky
Copy link
Contributor

geky commented Dec 7, 2016

/morph test-nightly

@bogdanm
Copy link
Contributor Author

bogdanm commented Dec 7, 2016

I think I might've confused the CI by doing a force push ?

@geky
Copy link
Contributor

geky commented Dec 7, 2016

It should handle force pushes correctly, I'm hoping this was just a spurious issue

@mbed-bot
Copy link

mbed-bot commented Dec 7, 2016

Result: SUCCESS

Your command has finished executing! Here's what you wrote!

/morph test-nightly

Output

mbed Build Number: 1217

All builds and test passed!

@geky
Copy link
Contributor

geky commented Dec 7, 2016

Yep looks like it, should be good to go! @0xc0170

@mbed-bot
Copy link

mbed-bot commented Dec 7, 2016

Result: FAILURE

Your command has finished executing! Here's what you wrote!

/morph test-nightly

Output

mbed Build Number: 1218

Test failed!

@bogdanm
Copy link
Contributor Author

bogdanm commented Dec 8, 2016

Looks like this failed because of the same problematic NIST test.

@0xc0170 0xc0170 merged commit 1b019c3 into ARMmbed:master Dec 8, 2016
bulislaw added a commit that referenced this pull request Dec 15, 2016
Release mbed OS 5.3.0-rc3

Ports for Upcoming Targets

Fixes and Changes

3373: Fix DEBUG target keyword for GCC_ARM #3373
3374: Don't send events on close() #3374
@kjbracey
Copy link
Contributor

mbed client guys are seeing networking issues with 5.3.2 and lwIP Ethernet, and analysis so far seems to suggest they're caused by one of this change or #3457 or #3458.

Trying to get to the bottom of it, and can't actually see a way any of these could directly cause their problem, either in isolation or in combination - it's possible there's just a pre-existing unreliability, either in NSAPI or the app.

But while going over all three PRs with a fine-tooth comb to investigate, this change stood out as wrong to me. _pending non-zero means "we have sent up a callback since the last major network call, so don't send another".

By setting it to non-zero on start-up, you're suppressing events until they make a network call that sets _pending to zero. After that, original behaviour resumes.

That will certainly have the effect of making an immediate close() after open not generate an callback, but it hits both more and less than you were intending.

AFAICT, TCPServer sockets that just listen won't get a connection callback, UDPSockets that just bind and just wait for incoming data won't get a receive callback.

And it doesn't totally suppress events from close - it's still possible that close could send an event. (eg close immediately after non-blocking send).

I think the correct fix is to set _callback to NULL in the actual close call, before calling event(), to always suppress the close callback, without affecting the rest of the event system.

Does this analysis sound correct? If it is, it must mean we don't have a test coverage hole - no non-blocking server-type use. (CC @mikaleppanen who's checking that)

Still, I don't see how this change affects the observed problem.

(And I'm also a bit wary that this _pending mechanism might be suppressing too many events, what with a wide variety of activity and events for both RX and TX directions. I kind of wonder whether we should be making sure _pending gets set to 0 on every single socket operation. It's vaguely possible something like that might be the mbed client problem.)

@geky
Copy link
Contributor

geky commented Jan 20, 2017

That analysis sounds correct to me.

My earlier comment on the initial state of the socket seems wrong now. On a leave from a recv, the _pending flag should be zero unless a packet was recved during the call.

Either way, zeroing the callback is a much more robust solution, will make a pr when I can.

@kjbracey
Copy link
Contributor

I think we got down to the basic mbed-client update problem - they were inadvertently relying on lwIP's "connected" event being suppressed by this change.

#3457 made the connect() call zero _pending, so the connected event was let through again, and that extra event confused their state machine.

@geky
Copy link
Contributor

geky commented Jan 20, 2017

Ah! That's good to know.

Given the netsocket api/implementations, spurious interrupts are still a very real possibility. Consider what would happen if a small packet was recieved between these two lines:
https://github.com/ARMmbed/mbed-os/blob/master/features/netsocket/TCPSocket.cpp#L163-L164

Unfortunately, even if we strictly controlled when interrupts were let through, the netsocket api would need a bit of restructuring/testing before we could garuntee when interrupts occur. State transistions based off the socket's attach function are probably a programming error and should instead be based of the return value of the socket's send/recv.

@kjbracey-arm, thanks for clearing this up.

@0xc0170
Copy link
Contributor

0xc0170 commented Jan 20, 2017

Thank you @kjbracey-arm ! 👍

@kjbracey
Copy link
Contributor

kjbracey commented Jan 20, 2017

Agreed - abstraction layers can't control when the underlying stack generates events, and any attempt to interfere to improve the situation can make the situation worse. As this change did.

I did see another report of a bad pattern - calling send() an HTTP request, expecting a subsequent event from it, and hence not bothering to do anything when that event was received. Then when response data arrived, its event is suppressed because _pending is still one.

It' s arguable that _pending is being unhelpful rather than helpful in that case, but the application is definitely in the wrong - it can't expect distinct "sent" and "received" events. (I would argue you can't necessarily expect "sent" events at all, unless you know you filled the output buffer and got EWOULDBLOCK).

@geky
Copy link
Contributor

geky commented Jan 20, 2017

The intention of the _pending flag (the name may be contributing to the confusion) is to help network stack implementations rather than users. It allows drivers to interrupt whenever without worrying about overflowing queues at higher levels.

I was thinking of how to help users avoid these mistakes in the future. One option would be to change the _pending limit from 1 to something like 3 in debug mode, but that may be borderline crazy.

@kjbracey
Copy link
Contributor

That's what I kind of figured - the ESP8266 or similar attaching it straight to the serial interrupt.

I can't help but feel that such stacks should be exercising a little more constraint. My gut feeling is that an abstraction layer choosing to suppress events from lower layers (by deducing which ones are spurious) is potentially error-prone, and ESP8266 should not rely on that behaviour.

@geky
Copy link
Contributor

geky commented Jan 20, 2017

Maybe, though it wouldn't be the easiest change to the ESP8266 driver for it to understand what type of events it's recieving (there is currently no state, the driver is just reactive to interrupts).

The ESP8266 could do the coalescing, but it's convenient to have coalesce in the mbed socket layer since the code can be reused accross any combination of network stack + user code.

@geky
Copy link
Contributor

geky commented Jan 20, 2017

I'm also concerned we may be making a documentation error. It seems the name/usage of callbacks/similarity to other attach functions in mbed is causing quite a bit of confusion on what the attach function actually provides.

This may be remedied by introducing higher-level stream based apis. Would a select function be more useful to the client team? (I can't say we're going to provide one soon, and the attach function probably won't go away, but I'm curious if it's worth consideration).

@kjbracey
Copy link
Contributor

As it happens, I am working on a revision of the FileHandle/Stream stuff for other reasons - there is certainly overlap here, including the possiblity of sockets being FileHandles, and I want to discuss that.

Even with that though, it's not clear that majorly any improved event functionality is implementable in general on the sort of network stacks we have to deal with. I'm thinking we need a solid "poll" for FileHandle, and am sure we could do that for driver-based devices, but no idea how many network stacks could reliably provide that. I still think the extremely restricted callback semantics we've already specified are probably the lowest common denominator.

The only real ESP8266 event is packet reception isn't it? The chip spontaneously sends data when a packet arrives. So a serial interrupt is fine for that, except you should possible be masking off serial interrupts except when idle, so issuing socket calls doesn't generate events.

On the other hand, there is no "socket has become writable" event. I guess people just haven't noticed that one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants