Skip to content

LoRaWAN: Fixing premature RX2 closure #10196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 27, 2019

Conversation

hasnainvirk
Copy link
Contributor

@hasnainvirk hasnainvirk commented Mar 22, 2019

Description

A bug while setting up RX start timers would result in premature closusre of RX2 window. The 'ack_Timeout_timer' would be invoked prematurely and
at that time RX2 window may be being demodulating. This resulted in
massive instability with any test that relied on Confirmed traffic or
lower data rates.

To fix the issue, we must know the length of the RX window in
milliseconds and for this purpose we have extended the
'get_rx_window_params(...)' API. The length of the time the window
may remain open must be accounted for while setting up
'ack_timeout_timer'.

While calculating ack timeout, we were ending up getting a random value
which may become less than 2 seconds. This is not allowed as per v1.0.2
specification.

To fix the issue we now take the random number from 0 to 2000 ms and
then add that to the fixed 2000 ms ack timeout value, guaranteeing a
value at least equal to 2000 ms.

Compliance Testing

Compliance test were run with Nucleo-476RG and SX1276. All tests passed.

Pull request type

[X] Fix
[ ] Refactor
[ ] Target update
[ ] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

Reviewers

Release Notes

Hasnain Virk added 3 commits March 22, 2019 15:15
While calculating ack timeout, we were ending up getting a random value
which may become less than 2 seconds. This is not allowed as per v1.0.2
specification.

To fix the issue we now take the random number from 0 to 2000 ms and
then add that to the fixed 2000 ms ack timeout value, guaranteeing a
value at least equal to 2000 ms.
A bug while setting up RX start timers would result in premature closusre
of RX2 window. The 'ack_Timeout_timer' would be invoked prematurely and
at that time RX2 window may be being demodulating. This resulted in
massive instability with any test that relied on Confirmed traffic or
lower data rates.

To fix the issue, we must know the length of the RX window in
milliseconds and for this purpose we have extended the
'get_rx_window_params(...)' API. The length of the time the window
may remain open must be accounted for while setting up
'ack_timeout_timer'.
get_rx_window_params() API is changed. To reflect that changing the
stub.
@hasnainvirk
Copy link
Contributor Author

hasnainvirk commented Mar 22, 2019

@AnttiKauppila @kjbracey-arm Please review. @kimlep01 @0xc0170 @cmonr FYI.

@ciarmcom ciarmcom requested review from a team March 22, 2019 14:00
@ciarmcom
Copy link
Member

@hasnainvirk, thank you for your changes.
@ARMmbed/mbed-os-wan @ARMmbed/mbed-os-maintainers please review.

@hasnainvirk
Copy link
Contributor Author

@0xc0170 Failure doesn't seem to be related to the PR.

@cmonr
Copy link
Contributor

cmonr commented Mar 26, 2019

CI started

@cmonr
Copy link
Contributor

cmonr commented Mar 26, 2019

Whoops. Started CI early, since still waiting on reviewers, but will let it go since load is low.

@mbed-ci
Copy link

mbed-ci commented Mar 26, 2019

Test run: SUCCESS

Summary: 13 of 13 test jobs passed
Build number : 1
Build artifacts

@cmonr
Copy link
Contributor

cmonr commented Mar 26, 2019

LGTM.

@hasnainvirk @AnttiKauppila Marked this for 5.13, since the public function has changed.

Will merge in a bit.

cmonr pushed a commit to cmonr/mbed-os that referenced this pull request Mar 26, 2019
@cmonr cmonr merged commit a0a265a into ARMmbed:master Mar 27, 2019
@hasnainvirk hasnainvirk deleted the premature_rx2_fix branch March 27, 2019 06:37
@AnttiKauppila
Copy link

@cmonr Internal function has been modified and API is untouched. This should go to next patch release.

@cmonr
Copy link
Contributor

cmonr commented Mar 27, 2019

@AnttiKauppila Release label updated.

I knew that ATHandler was internal, but so is LoRaPHY? Is that because users should be expected to interface with LoRaWAN/LoRa, and not the Phy?

CC @ARMmbed/mbed-os-maintainers

@mattbrown015
Copy link
Contributor

Hi @hasnainvirk,

I've been away from Mbed-OS and LoRaWAN for a while, which is why I haven't been bothering you! :-)

I just came across this issue and wondered if I should find a way to get this fix in to our software build.

We released our product back in January '19 and haven't done much with it since then. We're still building with a version of Mbed-OS from Jan '19 (#d20b591).

We were pretty happy but as we've accrued operating hours we have seen a few failures. We've had over 100 devices operating for more than a month so that must be quite a lot of hours.

We've seen a few devices 'go quiet'. When we get them back we can see that LoRaWANInterface::send is returning LORAWAN_STATUS_WOULD_BLOCK and LoRaWANInterface::cancel_sending doesn't work. It appears the LoRaWAN stack is stuck in this state until we restart the device. Our application code appears to be working fine as it samples the sensor, writes the debug output to serial, etc.

I've not managed to reproduce the problem in the lab and therefore catch the debug output at the point of the first failure. Hence it could be anything causing the problem, including hardware issues like an unhappy crystal.

Do you think this issue would cause the LoRaWAN stack to get stuck in a state where it permanently thinks it is sending?

I'll see if I can find a way to artificially mess with the timers to make what you describe in the issue happen.

Thanks,
Matt

@0xc0170
Copy link
Contributor

0xc0170 commented Nov 5, 2019

@mattbrown015 Please create a new issue once you get details or open a question on our forum.

@hasnainvirk is not anymore working on Mbed OS, but will include here @ARMmbed/mbed-os-wan

@mattbrown015
Copy link
Contributor

Hi @0xc0170

@hasnainvirk is not anymore working on Mbed OS

Oh, that's a shame but I'm sure his absence will be more than covered by the rest of the team. Thanks for letting me know.

Please create a new issue once you get details

Yes, I haven't done that so far because I've got so little evidence I'm not sure what to say.

I saw this issue and #9951 and got a little over excited. Now I've thought about it some more it doesn't seem very likely that it is what is causing my problem.

Thanks,
Matt

@mattbrown015
Copy link
Contributor

I appreciate this is a very old PR and Hasnain is no longer on the project but I thought I'd add some notes. Mainly for my own benefit but perhaps others will come here and be interested.

The first part of the PR says:

This resulted in massive instability with any test that relied on Confirmed traffic or lower data rates.

This is not my experience but I think this might be because my end-device is connected to TTN. The fact that my RX2 downlinks are working doesn't mean that this PR isn't correct.

I set up my end-device to transmit large confirmed messages at DR0/SF12. The packet payload was 39 bytes and uplink TOA was 2466ms.

Pretty much every uplink that I watched was confirmed as you would expect in an RX2 downlink.

I also cheated and hard coded LoRaPHY::get_ack_timeout to return -1000 i.e. make the ack timeout as short as possible. This made no difference and every uplink was still confirmed.

I think perhaps, the difference between my test and Hasnain's is that my downlinks are transmitted at DR3/SF9, TOA 165ms. Hasnain's failing tests may well be transmitting the downlink at the same SF as the uplink i.e. DR0/SF12 and therefore have a much larger TOA and demodulation time.

I also tried making ack_timeout_timer timeout so short it would definitely fire during the RX2 window. This messed up the RX2 window and meant the acks weren't received but it didn't cause the stack to crash.

The second part of this PR says:

While calculating ack timeout, we were ending up getting a random value which may become less than 2 seconds. This is not allowed as per v1.0.2 specification.

I don't understand this statement from Hasnain. From LoRaWAN10220160812_1316_1Specification.pdf:

If an end-device does not receive a frame with the ACK bit set in one of the two receive windows immediately following the uplink transmission it may resend the same frame with the same payload and frame counter again at least ACK_TIMEOUT seconds after the second reception window.

and from LoRaWAN Regional Parameters v1.0.2rB as published_1939_1.pdf:

ACK_TIMEOUT 2 +/- 1 s (random delay between 1 and 3 seconds)

Hence I think ACK_TIMEOUT should be at least 1 second.

So previously LoRaPHY::get_ack_timeout would return -1 to 1 giving a total timeout of between 1 and 3. Now LoRaPHY::get_ack_timeout returns 0 to 1 giving a total timeout of between 2 and 3.

Does this make any practical difference? Probably not! Hopefully any end-device using confirmed uplinks expects to receive the ack and hence the ack timeout isn't used very often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants