-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Library gets stuck in OPMODE 888 #547
Comments
Thanks for the report @jpmeijers If you can get a link map, and then print out the contents of the LMIC job queues, that will help determine the state of the LMIC, which will in turn help figure out where we got stuck. The contents of It's a little awkward; you need to add a helper utility in oslmic.c, because the key structure is static and therefore hidden. In oslmic.c: osjob_t ** os_getJobQueue(bit_t fTimed) {
return fTimed ? &OS.scheduledjobs : &OS.runnablejobs;
} Then in your sketch: extern "C" { osjob_t ** os_getJobQueue(bit_t); }
if (failure) {
Serial.print("now="); Serial.println(os_getTime());
for (bit_t fType = 0; f < 2; ++fType) {
auto pHead = os_getJobQueue(fType);
Serial.println(fType ? "runnable queue:" : "scheduled queue:");
for (; pHead != NULL; pHead = pHead->next) {
Seral.print("0x"); Serial.print(pHead, HEX);
Serial.print(": .job=0x"); Serial.print((uintptr_t) pHead->job);
Serial.println(", deadline="); Serial.print(pHead->deadline);
}
}
} Which BSP are you using, by the way? The Arduino Zero BSP? |
Strangely enough I get a different issue when running the device on my desk. I can't yet reproduce the stuck in mode 888 which I've seen in the field yet. After running for between 30 and 60 minutes the link dead state was detected:
A while later I started seeing something else happening. Every time I enqueue a packet for TX, LMIC goes into the "prevent TX lining up after beacon" mode. This can take 2 minutes before it passed and the TX is performed.
and only two minutes later
|
Thinking about this, it could be that I'm running at SF12 now, and then being blocked from the channel for 2 minutes is correct. So this might not yet be the issue I'm looking for. |
Thanks for testing.
When you say "lining up after beacon" what are you referring to? Are you running in Class B mode? In Class A, the delays between packets are inserted due to duty-cycle limitations. I think you're referring to the comment on Here are the conditions:
You might want to add |
I let the previous code run since yesterday, and it stopped running at 20:51 (local time).
The log output stopped at 20:15:48, and the last packet was also received by TTN at the same time. From the code that I posted originally it seems like we are stuck in my tx_blocking function, and specifically inside:
Could it be that |
Yes, I believe I'm reproducing the issue I saw in the field. After the previous lockup I reset the board, after which it ran just over 2 hours before it locked up again. |
Well, no, there's no loop in I suggest you add (after incrementing if (loopCount % 1000) { // <--- adjust the 100000 as needed
// consider this, too:
// Serial.print(millis()); -- this will show whether time is changing but limited to a small range
Serial.print('.'); // <=== this is just to make the code path visible. Proves that we're not stuck.
} It is possible that time is not advancing, which could happen because interrupts are disabled. Also, the fact that you're stuck in the |
So far the debugging output still points to My current theory is that it's brownouts that is causing either a lockup or reset of the board sometimes when the LoRa radio transmits. On some boards I have brownout disabled, which will explain the board locking up when a brownout happens. On other boards I have seen unexplained resets, which can also be explained by brownout detection resetting the board. I however still do not know why I only started to see this recently. Maybe all the fixes applied to LMiC also made the radio consume more power. My boards clearly do not have good enough filtering on Vcc, and it seems like my current batch Duracell AA batteries are not as good as they should be. Be weary of 12-pack "special offer" batteries. I have now added an extra 10µF cap just after the batteries. I'll report back in a couple of hours if that helped or not. |
Any news? Certainly, the LMIC will do more back-to-back transmits, because if there's a downlink that needs an ack, it will immediately reply (2.3.2 LMIC would very likely ignore it). I can see that there's a need for power management in this area, so perhaps we can teach the LMIC to be less aggressive. However, I think we should first move the LMIC to an explicit FSM (rather than an implicit one defined by the callback sequence). (And first of all, of course, fix #524!) |
It's difficult to say if it's still occurring or not. At least it's much less with the extra cap on Vcc than before. I had a board running fine on a serial terminal logging the output until just above 813 million milliseconds. The logging however stopped due to load shedding (power outage) and the laptop battery running out. The board however did lock up at some point after that because there is an LED that should flash that just remained on. So I still can't report anything. I was unaware of #524. My code depends a lot on interrupts, especially the sleep logic that uses the WDT ISR. Those could quite possibly interfere. I totally agree about using an FSM. It does however sound like a lot of extra work to change. |
I'm once again trying to debug this further. Even with all the extra caps (currently 220uF on the RFM95) I still get that the board locks up while running I also tried pushing the SPI frequency down to 100kHz without any positive effect. It maybe even locked up during an earlier transmit. Also I noticed that the lockup happens during the 3rd call to |
Hello guys, sorry to barge in like that. But I have a few questions:
ARM Cortex-M0+ based micros() and delay() are based on a SysTick interrupt counter. If you don't allow interrupts often and for long enough you don't have correct micros. They will change (hardware register changes), but not the millis part (interrupt counter), so you can easily even happen in the past from time to time. My "beast" is STM32L051C8x based custom made board with sx1276 as a modem. Never saw any lockups as you describe them, but my calls to os_runloop_once() are spaced apart more than 10ms away and I govern the LoRa JOIN/TX/RX/DutyCycle states from outside os_runloop, thanks to events callback. |
Thanks for the comments @altishchenko. Definitely a few things to think about there. Funny thing is that I have this issue on both a AT SAMD21 and a ATmega 1284p board. Roughly what I found on the ATmega1284p was the following: As a sanity check I swapped out LMIC for the Matthijs Kooijman one - trying to change as little as possible to my code. I did this yesterday and so far the board has been running for 63000s. |
@jpmeijers While debugging this issue can you still try delay()? Even delay(2) might do the trick. This is needed to make sure that disable irqs all too often for a long period of time may be the cause of the resulting race condition and lockup. |
I've added 10ms delays between the |
@jpmeijers , @terrillmoore
In my RTE this leads to HardFault() exception and CPU reset, your mileage may vary. As a quick measure, I suggest adding the following lines to
Here flag |
@jpmeijers How is your board doing? Still up and running? |
I believe it's positive. The board runs for hours on end without locking up. I however get a brown out detection restart sometimes. I haven't seen the brown out on the kooijman library. And considering I have 220uF on the radio's Vcc this is very strange. |
I've enabled LMIC_USE_INTERRUPTS today. With that enabled my board does not want to join the network. I double check the pinout of the three DIOs and buzzed them through to make sure they are connected. DIO=PB40, DIO1=PB3, DIO2=PB2 on the ATmega1284p. These pins all have PCINTs associated with them. Switching LMIC_USE_INTERRUPTS off makes the board join again. Update: |
@jpmeijers Good it is working somehow, bad about interrupts - can't help you there. I am running on a custom made dedicated module where every pin is at my command. Though, I have this: for some reason with interrupts enabled (and I have them always on) my module refuses to join at the default JoinDr of the library (misses the window most probably). BOR you are getting is on a main CPU or on a modem? Or CPU BOR caused by a modem? Could it be somewhere around TCXO wakeup times, if it is there at all? My modem is SX1276 and by my board design it usually requires around 2ms to come up to senses, so I have this in my code:
|
@jpmeijers The other library you use does not use interrupts. |
Closing issue for now as many things changed between the OP and the current v3.2.0 version. |
Description
When using this fork of LMIC my board will get stuck, unable to transmit LoRa messages after either a couple of minutes, but sometimes up to an hour after startup and a successful OTAA join.
I will block for 60 seconds for LMIC to send out data, after which I will give up and continue executing the rest of my code.
os_runloop_once();
is still called very frequently, but the LMIC state will remain in 888 (OP_TXRXPEND), even at the start of the next time I want to transmit a message.setup()
loop()
tx_blocking(data, dataLength, port)
Serial output
Environment
The text was updated successfully, but these errors were encountered: