-
Notifications
You must be signed in to change notification settings - Fork 3k
LoRaWAN: Fixing premature RX2 closure #10196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
While calculating ack timeout, we were ending up getting a random value which may become less than 2 seconds. This is not allowed as per v1.0.2 specification. To fix the issue we now take the random number from 0 to 2000 ms and then add that to the fixed 2000 ms ack timeout value, guaranteeing a value at least equal to 2000 ms.
A bug while setting up RX start timers would result in premature closusre of RX2 window. The 'ack_Timeout_timer' would be invoked prematurely and at that time RX2 window may be being demodulating. This resulted in massive instability with any test that relied on Confirmed traffic or lower data rates. To fix the issue, we must know the length of the RX window in milliseconds and for this purpose we have extended the 'get_rx_window_params(...)' API. The length of the time the window may remain open must be accounted for while setting up 'ack_timeout_timer'.
get_rx_window_params() API is changed. To reflect that changing the stub.
@AnttiKauppila @kjbracey-arm Please review. @kimlep01 @0xc0170 @cmonr FYI. |
@hasnainvirk, thank you for your changes. |
@0xc0170 Failure doesn't seem to be related to the PR. |
CI started |
Whoops. Started CI early, since still waiting on reviewers, but will let it go since load is low. |
Test run: SUCCESSSummary: 13 of 13 test jobs passed |
LGTM. @hasnainvirk @AnttiKauppila Marked this for 5.13, since the public function has changed. Will merge in a bit. |
LoRaWAN: Fixing premature RX2 closure
@cmonr Internal function has been modified and API is untouched. This should go to next patch release. |
@AnttiKauppila Release label updated. I knew that ATHandler was internal, but so is LoRaPHY? Is that because users should be expected to interface with LoRaWAN/LoRa, and not the Phy? CC @ARMmbed/mbed-os-maintainers |
Hi @hasnainvirk, I've been away from Mbed-OS and LoRaWAN for a while, which is why I haven't been bothering you! :-) I just came across this issue and wondered if I should find a way to get this fix in to our software build. We released our product back in January '19 and haven't done much with it since then. We're still building with a version of Mbed-OS from Jan '19 (#d20b591). We were pretty happy but as we've accrued operating hours we have seen a few failures. We've had over 100 devices operating for more than a month so that must be quite a lot of hours. We've seen a few devices 'go quiet'. When we get them back we can see that I've not managed to reproduce the problem in the lab and therefore catch the debug output at the point of the first failure. Hence it could be anything causing the problem, including hardware issues like an unhappy crystal. Do you think this issue would cause the LoRaWAN stack to get stuck in a state where it permanently thinks it is sending? I'll see if I can find a way to artificially mess with the timers to make what you describe in the issue happen. Thanks, |
@mattbrown015 Please create a new issue once you get details or open a question on our forum. @hasnainvirk is not anymore working on Mbed OS, but will include here @ARMmbed/mbed-os-wan |
Hi @0xc0170
Oh, that's a shame but I'm sure his absence will be more than covered by the rest of the team. Thanks for letting me know.
Yes, I haven't done that so far because I've got so little evidence I'm not sure what to say. I saw this issue and #9951 and got a little over excited. Now I've thought about it some more it doesn't seem very likely that it is what is causing my problem. Thanks, |
I appreciate this is a very old PR and Hasnain is no longer on the project but I thought I'd add some notes. Mainly for my own benefit but perhaps others will come here and be interested. The first part of the PR says:
This is not my experience but I think this might be because my end-device is connected to TTN. The fact that my RX2 downlinks are working doesn't mean that this PR isn't correct. I set up my end-device to transmit large confirmed messages at DR0/SF12. The packet payload was 39 bytes and uplink TOA was 2466ms. Pretty much every uplink that I watched was confirmed as you would expect in an RX2 downlink. I also cheated and hard coded I think perhaps, the difference between my test and Hasnain's is that my downlinks are transmitted at DR3/SF9, TOA 165ms. Hasnain's failing tests may well be transmitting the downlink at the same SF as the uplink i.e. DR0/SF12 and therefore have a much larger TOA and demodulation time. I also tried making The second part of this PR says:
I don't understand this statement from Hasnain. From LoRaWAN10220160812_1316_1Specification.pdf:
and from LoRaWAN Regional Parameters v1.0.2rB as published_1939_1.pdf:
Hence I think ACK_TIMEOUT should be at least 1 second. So previously Does this make any practical difference? Probably not! Hopefully any end-device using confirmed uplinks expects to receive the ack and hence the ack timeout isn't used very often. |
Description
A bug while setting up RX start timers would result in premature closusre of RX2 window. The 'ack_Timeout_timer' would be invoked prematurely and
at that time RX2 window may be being demodulating. This resulted in
massive instability with any test that relied on Confirmed traffic or
lower data rates.
While calculating ack timeout, we were ending up getting a random value
which may become less than 2 seconds. This is not allowed as per v1.0.2
specification.
Compliance Testing
Compliance test were run with Nucleo-476RG and SX1276. All tests passed.
Pull request type
Reviewers
Release Notes