LoRaWAN: Fixing premature RX2 closure #10196

hasnainvirk · 2019-03-22T13:50:51Z

Description

A bug while setting up RX start timers would result in premature closusre of RX2 window. The 'ack_Timeout_timer' would be invoked prematurely and
at that time RX2 window may be being demodulating. This resulted in
massive instability with any test that relied on Confirmed traffic or
lower data rates.

To fix the issue, we must know the length of the RX window in
milliseconds and for this purpose we have extended the
'get_rx_window_params(...)' API. The length of the time the window
may remain open must be accounted for while setting up
'ack_timeout_timer'.

While calculating ack timeout, we were ending up getting a random value
which may become less than 2 seconds. This is not allowed as per v1.0.2
specification.

To fix the issue we now take the random number from 0 to 2000 ms and
then add that to the fixed 2000 ms ack timeout value, guaranteeing a
value at least equal to 2000 ms.

Compliance Testing

Compliance test were run with Nucleo-476RG and SX1276. All tests passed.

Pull request type

[X] Fix
[ ] Refactor
[ ] Target update
[ ] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

Reviewers

Release Notes

While calculating ack timeout, we were ending up getting a random value which may become less than 2 seconds. This is not allowed as per v1.0.2 specification. To fix the issue we now take the random number from 0 to 2000 ms and then add that to the fixed 2000 ms ack timeout value, guaranteeing a value at least equal to 2000 ms.

A bug while setting up RX start timers would result in premature closusre of RX2 window. The 'ack_Timeout_timer' would be invoked prematurely and at that time RX2 window may be being demodulating. This resulted in massive instability with any test that relied on Confirmed traffic or lower data rates. To fix the issue, we must know the length of the RX window in milliseconds and for this purpose we have extended the 'get_rx_window_params(...)' API. The length of the time the window may remain open must be accounted for while setting up 'ack_timeout_timer'.

get_rx_window_params() API is changed. To reflect that changing the stub.

hasnainvirk · 2019-03-22T13:52:15Z

@AnttiKauppila @kjbracey-arm Please review. @kimlep01 @0xc0170 @cmonr FYI.

ciarmcom · 2019-03-22T14:00:39Z

@hasnainvirk, thank you for your changes.
@ARMmbed/mbed-os-wan @ARMmbed/mbed-os-maintainers please review.

hasnainvirk · 2019-03-22T14:43:08Z

@0xc0170 Failure doesn't seem to be related to the PR.

cmonr · 2019-03-26T02:53:16Z

CI started

cmonr · 2019-03-26T02:53:44Z

Whoops. Started CI early, since still waiting on reviewers, but will let it go since load is low.

mbed-ci · 2019-03-26T06:45:59Z

Test run: SUCCESS

Summary: 13 of 13 test jobs passed
Build number : 1
Build artifacts

cmonr · 2019-03-26T15:39:25Z

LGTM.

@hasnainvirk @AnttiKauppila Marked this for 5.13, since the public function has changed.

Will merge in a bit.

LoRaWAN: Fixing premature RX2 closure

AnttiKauppila · 2019-03-27T07:47:48Z

@cmonr Internal function has been modified and API is untouched. This should go to next patch release.

cmonr · 2019-03-27T19:35:59Z

@AnttiKauppila Release label updated.

I knew that ATHandler was internal, but so is LoRaPHY? Is that because users should be expected to interface with LoRaWAN/LoRa, and not the Phy?

CC @ARMmbed/mbed-os-maintainers

mattbrown015 · 2019-11-05T10:52:20Z

Hi @hasnainvirk,

I've been away from Mbed-OS and LoRaWAN for a while, which is why I haven't been bothering you! :-)

I just came across this issue and wondered if I should find a way to get this fix in to our software build.

We released our product back in January '19 and haven't done much with it since then. We're still building with a version of Mbed-OS from Jan '19 (#d20b591).

We were pretty happy but as we've accrued operating hours we have seen a few failures. We've had over 100 devices operating for more than a month so that must be quite a lot of hours.

We've seen a few devices 'go quiet'. When we get them back we can see that LoRaWANInterface::send is returning LORAWAN_STATUS_WOULD_BLOCK and LoRaWANInterface::cancel_sending doesn't work. It appears the LoRaWAN stack is stuck in this state until we restart the device. Our application code appears to be working fine as it samples the sensor, writes the debug output to serial, etc.

I've not managed to reproduce the problem in the lab and therefore catch the debug output at the point of the first failure. Hence it could be anything causing the problem, including hardware issues like an unhappy crystal.

Do you think this issue would cause the LoRaWAN stack to get stuck in a state where it permanently thinks it is sending?

I'll see if I can find a way to artificially mess with the timers to make what you describe in the issue happen.

Thanks,
Matt

0xc0170 · 2019-11-05T11:56:16Z

@mattbrown015 Please create a new issue once you get details or open a question on our forum.

@hasnainvirk is not anymore working on Mbed OS, but will include here @ARMmbed/mbed-os-wan

mattbrown015 · 2019-11-05T12:04:41Z

Hi @0xc0170

@hasnainvirk is not anymore working on Mbed OS

Oh, that's a shame but I'm sure his absence will be more than covered by the rest of the team. Thanks for letting me know.

Please create a new issue once you get details

Yes, I haven't done that so far because I've got so little evidence I'm not sure what to say.

I saw this issue and #9951 and got a little over excited. Now I've thought about it some more it doesn't seem very likely that it is what is causing my problem.

Thanks,
Matt

mattbrown015 · 2019-11-28T11:28:04Z

I appreciate this is a very old PR and Hasnain is no longer on the project but I thought I'd add some notes. Mainly for my own benefit but perhaps others will come here and be interested.

The first part of the PR says:

This resulted in massive instability with any test that relied on Confirmed traffic or lower data rates.

This is not my experience but I think this might be because my end-device is connected to TTN. The fact that my RX2 downlinks are working doesn't mean that this PR isn't correct.

I set up my end-device to transmit large confirmed messages at DR0/SF12. The packet payload was 39 bytes and uplink TOA was 2466ms.

Pretty much every uplink that I watched was confirmed as you would expect in an RX2 downlink.

I also cheated and hard coded LoRaPHY::get_ack_timeout to return -1000 i.e. make the ack timeout as short as possible. This made no difference and every uplink was still confirmed.

I think perhaps, the difference between my test and Hasnain's is that my downlinks are transmitted at DR3/SF9, TOA 165ms. Hasnain's failing tests may well be transmitting the downlink at the same SF as the uplink i.e. DR0/SF12 and therefore have a much larger TOA and demodulation time.

I also tried making ack_timeout_timer timeout so short it would definitely fire during the RX2 window. This messed up the RX2 window and meant the acks weren't received but it didn't cause the stack to crash.

The second part of this PR says:

While calculating ack timeout, we were ending up getting a random value which may become less than 2 seconds. This is not allowed as per v1.0.2 specification.

I don't understand this statement from Hasnain. From LoRaWAN10220160812_1316_1Specification.pdf:

If an end-device does not receive a frame with the ACK bit set in one of the two receive windows immediately following the uplink transmission it may resend the same frame with the same payload and frame counter again at least ACK_TIMEOUT seconds after the second reception window.

and from LoRaWAN Regional Parameters v1.0.2rB as published_1939_1.pdf:

ACK_TIMEOUT 2 +/- 1 s (random delay between 1 and 3 seconds)

Hence I think ACK_TIMEOUT should be at least 1 second.

So previously LoRaPHY::get_ack_timeout would return -1 to 1 giving a total timeout of between 1 and 3. Now LoRaPHY::get_ack_timeout returns 0 to 1 giving a total timeout of between 2 and 3.

Does this make any practical difference? Probably not! Hopefully any end-device using confirmed uplinks expects to receive the ack and hence the ack timeout isn't used very often.

Hasnain Virk added 3 commits March 22, 2019 15:15

Unit test fix LoRaPHY_stub

4c1b8a8

get_rx_window_params() API is changed. To reflect that changing the stub.

ciarmcom requested review from a team March 22, 2019 14:00

ciarmcom added the needs: review label Mar 22, 2019

AnttiKauppila approved these changes Mar 26, 2019

View reviewed changes

cmonr approved these changes Mar 26, 2019

View reviewed changes

cmonr added release-version: 5.13.0-rc1 ready for merge and removed needs: review labels Mar 26, 2019

cmonr pushed a commit to cmonr/mbed-os that referenced this pull request Mar 26, 2019

Merge pull request ARMmbed#10196 from hasnainvirk/premature_rx2_fix

f4890bc

LoRaWAN: Fixing premature RX2 closure

cmonr merged commit a0a265a into ARMmbed:master Mar 27, 2019

cmonr removed the ready for merge label Mar 27, 2019

hasnainvirk deleted the premature_rx2_fix branch March 27, 2019 06:37

hasnainvirk mentioned this pull request Mar 27, 2019

LoRaWAN stack halts when encountering 'RX1 Demodulation ongoing, skip RX2 window opening' #9951

Closed

cmonr added release-version: 5.12.1 and removed release-version: 5.13.0-rc1 labels Mar 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LoRaWAN: Fixing premature RX2 closure #10196

LoRaWAN: Fixing premature RX2 closure #10196

Uh oh!

hasnainvirk commented Mar 22, 2019 •

edited

Loading

Uh oh!

hasnainvirk commented Mar 22, 2019 •

edited

Loading

Uh oh!

ciarmcom commented Mar 22, 2019

Uh oh!

hasnainvirk commented Mar 22, 2019

Uh oh!

cmonr commented Mar 26, 2019

Uh oh!

cmonr commented Mar 26, 2019

Uh oh!

mbed-ci commented Mar 26, 2019

Uh oh!

cmonr commented Mar 26, 2019

Uh oh!

AnttiKauppila commented Mar 27, 2019

Uh oh!

cmonr commented Mar 27, 2019

Uh oh!

mattbrown015 commented Nov 5, 2019

Uh oh!

0xc0170 commented Nov 5, 2019

Uh oh!

mattbrown015 commented Nov 5, 2019

Uh oh!

mattbrown015 commented Nov 28, 2019

Uh oh!

Uh oh!

LoRaWAN: Fixing premature RX2 closure #10196

LoRaWAN: Fixing premature RX2 closure #10196

Uh oh!

Conversation

hasnainvirk commented Mar 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Compliance Testing

Pull request type

Reviewers

Release Notes

Uh oh!

hasnainvirk commented Mar 22, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ciarmcom commented Mar 22, 2019

Uh oh!

hasnainvirk commented Mar 22, 2019

Uh oh!

cmonr commented Mar 26, 2019

Uh oh!

cmonr commented Mar 26, 2019

Uh oh!

mbed-ci commented Mar 26, 2019

Test run: SUCCESS

Uh oh!

cmonr commented Mar 26, 2019

Uh oh!

AnttiKauppila commented Mar 27, 2019

Uh oh!

cmonr commented Mar 27, 2019

Uh oh!

mattbrown015 commented Nov 5, 2019

Uh oh!

0xc0170 commented Nov 5, 2019

Uh oh!

mattbrown015 commented Nov 5, 2019

Uh oh!

mattbrown015 commented Nov 28, 2019

Uh oh!

Uh oh!

hasnainvirk commented Mar 22, 2019 •

edited

Loading

hasnainvirk commented Mar 22, 2019 •

edited

Loading