Skip to content

Cellular: fix ATHandler destructor possible crash on delete #10684

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 4, 2019

Conversation

jarvte
Copy link
Contributor

@jarvte jarvte commented May 28, 2019

Description

In some multithread cases there is possibility that process_oob function was called after ATHandler was deleted. Fix is to cancel eventqueue if it's holding an event and wait if oob processing is ongoing.

Pull request type

[X] Fix
[ ] Refactor
[ ] Target update
[ ] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

Reviewers

@kjbracey-arm @mirelachirica

Release Notes

@ciarmcom ciarmcom requested review from kjbracey, mirelachirica and a team May 28, 2019 11:00
@ciarmcom
Copy link
Member

@jarvte, thank you for your changes.
@mirelachirica @kjbracey-arm @ARMmbed/mbed-os-wan @ARMmbed/mbed-os-maintainers please review.

set_file_handle(NULL);

// cancel possible oob event from queue and wait if oob is ongoing
_queue.cancel(_event_id);
Copy link
Contributor

@kjbracey kjbracey May 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the cancel reports success, then you need to set oob_queued to false, no? Maybe make this conditional on oob_queued, to avoid stale event_id.

(Could _oob_queued be replaced by a null event_id check? Don't need separate variables?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As cancel is void function we decided not to call it.
Replaced _oob_queued and _oob_pending with _event_id.
Force pushed changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equeue should be handling stale event_ids.

Each event gets its own generation count which is packed into the id. If the generation count doesn't match equeue_cancel is a noop.

equeue_cancel(0) is also a noop, much like free(NULL).

set_file_handle(NULL);

while (_event_id > 0) {
#ifdef AT_HANDLER_MUTEX
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are in a bit of a pickle here.

We can't tell if cancel worked, so if we do call it we don't know if we need to wait for event_id to be cleared. That's why it got removed.

But if the OOBs are running on the same thread as this destructor, which I believe is quite possible in some event-driven apps, we must call cancel (and we can assume it would work in that case). We can't wait - we'd deadlock the event queue.

As a minimal change, this could be

#ifdef AT_HANDLER_MUTEX
    while (_event_id > 0) {
        _oobCv.wait();
    }
#else 
    if (_event_id > 0) { 
        event_cancel(_event_id);
    }
#endif 

But that still isn't any good if using an event queue with AT_HANDLER_MUTEX defined.

Ideally I'd like a a success/error return from cancel. But I recall this was discussed before - @kivaisan was looking at it? I'm not sure if the conclusion was that it wasn't possible to get a solid return, or that a different solution was found.

Copy link
Contributor

@geky geky Jun 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that would be better fixed by using the event's destructor?
https://github.com/ARMmbed/mbed-os/blob/master/events/equeue/equeue.h#L161

This tells you when the event has expired, either by completing or being canceled.

It's indirectly exposed through the C++ API via function objects, though it's messy:

bool dead = false;
class DeathHook {
    void operator()(void) {
        // do stuff
    }
    ~DeathHook() {
        dead = true;
    }
};

int id = queue.call(DeathHook());
queue.cancel(id);
printf("is dead? %d\n", dead);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does that example work at present? You're passing a temporary by value, and that temporary gets copied then destroyed, so I think dead is set immediately after the call().

I think it could work if EventQueue was upgraded to use C++11 perfect forwarding. Or, more simply, if the constructor/destructor reference counted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, I actually have no idea. I think you're right this would need reference counting to work.

Alternatively we could poke an explicit destructor-register function through the C++ API like in the C one.

Or you might be able to extend the Event class.

@@ -212,7 +213,8 @@ class ATHandler {
protected:
void event();
#ifdef AT_HANDLER_MUTEX
PlatformMutex _fileHandleMutex;
rtos::Mutex _fileHandleMutex;
rtos::ConditionVariable _oobCv;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of _oobCv is unclear to me, could add here some description. If _oobCv is supposed to guard _event_id then event() and unlock() need locks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oobCv notifies oob completion (it's notified when event_id becomes -1).

I agree that it is assuming event_id is protected by the _fileHandleMutex. That is held as unlock is entered, so access there is fine. If event() can be called without that held, then yes it should have a lock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event is actually called from IRQ context, which means mutex is not possible. But it also means it's pretty safe - this is no worse than the previous _oob_queued version, at least. Suggest not changing at this point.

@0xc0170
Copy link
Contributor

0xc0170 commented Aug 14, 2019

@jarvte Any update for this PR, is this for 5.14?

@jarvte
Copy link
Contributor Author

jarvte commented Aug 14, 2019

@jarvte Any update for this PR, is this for 5.14?

Yes, for 5.14. No time to do this now.

@0xc0170
Copy link
Contributor

0xc0170 commented Aug 26, 2019

5.14 is closing soon, and this has not seen any update. Should we keep the label here or this shall be closed as "will be updated later and reopened" ?

@mirelachirica mirelachirica force-pushed the fix_athandler_destr branch 2 times, most recently from f318f60 to 40faf90 Compare August 28, 2019 07:34
@0xc0170
Copy link
Contributor

0xc0170 commented Aug 29, 2019

@mirelachirica @jarvte We need reviews here and the status. If no update made by later today, we'll remove 5.14 label and move this to the next minor version. We are close to the code freeze.
Let us know

@mirelachirica
Copy link
Contributor

I have rebased the branch and it builds/passes now UT. What comes to rest of the content and changes to ATHandler, they need approval from @kjbracey-arm, @geky, @AnttiKauppila.

@kjbracey
Copy link
Contributor

Couple of my and @geky's comments are still outstanding - use the callback helper to simplify code; use 0 as invalid event-id, not positive/negative comparisons.

@0xc0170
Copy link
Contributor

0xc0170 commented Aug 30, 2019

Removed 5.14 label, can move to the next patch release.

@jarvte
Copy link
Contributor Author

jarvte commented Aug 30, 2019

we are investigating. unittests were executed @jarvte ? Looks like they fail but can't spot the failure.

@0xc0170 I made the original pr but @mirelachirica is working on this one now. @mirelachirica ?

@mirelachirica
Copy link
Contributor

we are investigating. unittests were executed @jarvte ? Looks like they fail but can't spot the failure.

I have executed UT localy and they build and pass. I do not see what is the failure here.

@mirelachirica
Copy link
Contributor

mirelachirica commented Aug 30, 2019

It is a valgrind error. I will check it.

==12455== Memcheck, a memory error detector
==12455== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==12455== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==12455== Command: ./features-cellular-framework-AT-athandler -o junit
==12455==
==12455== Conditional jump or move depends on uninitialised value(s)
==12455== at 0x1D055A: mbed::ATHandler::unlock() (ATHandler.cpp:376)
...
==12455== Uninitialised value was created by a stack allocation
==12455== at 0x1AD76B: TestATHandler_test_ATHandler_cmd_start_Test::TestBody() (athandlertest.cpp:352)

==12455== For counts of detected and suppressed errors, rerun with: -v
==12455== ERROR SUMMARY: 2 errors from 1 contexts (suppressed: 0 from 0)

@kjbracey
Copy link
Contributor

kjbracey commented Aug 30, 2019

_recv_pos and _recv_len aren't reset on construction.

Yes they are, via a reset_buffer. Hmm. Next...

@mirelachirica
Copy link
Contributor

Valgrind errors for athandler unittest are now fixed.

@0xc0170
Copy link
Contributor

0xc0170 commented Sep 2, 2019

CI restarted

@mbed-ci
Copy link

mbed-ci commented Sep 2, 2019

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 6
Build artifacts

@@ -0,0 +1,11 @@
#include "cmsis_os2.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code files need license on top of them, please add to all new files

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added license where it was missing.

In some multithread cases there is possibility that process_oob function
was called after ATHandler was deleted. Fix is to wait if oob processing
is ongoing.
@0xc0170
Copy link
Contributor

0xc0170 commented Sep 3, 2019

CI started

@mbed-ci
Copy link

mbed-ci commented Sep 3, 2019

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 7
Build artifacts

@0xc0170 0xc0170 merged commit aba0760 into ARMmbed:master Sep 4, 2019
@jarvte jarvte deleted the fix_athandler_destr branch January 17, 2020 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants