-
Notifications
You must be signed in to change notification settings - Fork 24
[ST - NUCLEO_F429 + wifi] : not able to connect to device connector when trace is on #28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@geky Can you please have a look into this ? This is a serious issue as all of our WiFi driver use this as reference design for porting and bug on this driver will result in having all of other ported drivers to carry the same issue. |
Did a bit of investigation, I think we've tracked down the issue. Short story: Long story: Since then, one of our new members took over the issue earlier this week and we were able to narrow down the issue to the fact that we were sending a larger packet then necessary for the DNS query. Since the DNS header contains the number of entries in the packet, we thought this was innocuous, but apparently some DNS servers use the packet size instead, causing spurious format errors based on the contents on the uninitialized memory. @kjbracey-arm actually put up a fix for this as a part of a separate issue ARMmbed/mbed-os#4274 So why does this issue crop up after an arbitrary delay? My best guess is that the memory is used by some other process in the meanwhile, changing the uninitialized memory from innocuous zeros to more problematic data. @anttiylitokola can you check if updating the firmware fixes these issues? If so I'll create an esp8266 release to indicate that we're currently at a stable point. |
@geky, No luck with the latest firmware either. I think the firmware update went through fine since I got the green finish button in ESP8266 download tool. Tried also with mbed-os patch (ARMmbed/mbed-os#4274) but no difference there. Sometimes the DNS query goes through but then the next send event fails with socket error. |
@anttiylitokola, thanks for checking, the mbed OS patch should have fixed the issue on the old firmware, so the issue must be elsewhere. Will continue looking. Sidenote: I can't reproduce the issue with your delay patch, but can reproduce ARMmbed/mbed-os-example-client#223 with trace enabled (on NUCLEO_F429ZI). |
Hm, this line seems to be the problem: The trace log halts after that line, and no more logging comes from trace or the esp8266 (the esp8266 continues the dns resolution on the K64F):
Comment out that line and the client example proceeds to work just fine:
Dropping into the debugger quickly reveals some red flags. Here's the stack trace:
Well that looks like interrupt context. Here's how socket_event_handler is registered:
Now I can't vouch for the pal, but the socket attach function (here, recently renamed to sigio, this is where the esp8266 becomes involved) does warn about interrupt context:
So a trace call is happening in interrupt context? That doesn't sound right. @jupe (I'm not really sure who's responsible for trace at the moment, so sorry to bother you), trace can't really support interrupt contexts right? Looking around it seems to end up calling the stdlib's puts function (here) and snprintf function (here), both of which are probably too heavy for interrupt contexts. Oh huh, looks like there was a similar issue with tr_debug in mbed-client's irq context: What a small world. So I'm guessing the solution is to just move those debug statements out of debugging. The M2MConnectionSecurityPimpl class already uses the nanostack event loop for this purpose, so either these debug statements can be moved there, or just dropped if they're not needed. It looks like the source of this issue came from here ARMmbed/mbed-client-classic#65. It seems like a pretty innocent mistake. @anttiylitokola, but this leaves me a bit confused, because the delay patch doesn't seem related at all. With a clean checkout and your patch applied, I can't reproduce your issue with the NUCLEO_F429ZI+GCC_ARM or K64F+GCC_ARM on any of the build profiles. Am I missing something? Do you still run into the issue with ARMmbed/mbed-os#4274 applied and the tr_debug lines commented out? |
yeah its not safe to call traces from interrupt context because it uses puts which takes awhile. Hmm could we detect somehow when call traces from irq and add some irq prefix for those prints to avoid "mysterious" problems in future? |
Hi @geky : my status is in ARMmbed/mbed-os-example-client#223 |
@geky, I'm still hitting the same problem. Just made a fresh clone of mbed-cloud-client-example-source-internal(TAG: RR1.2.0-EA) repo and the problem still occurs on my end. Firmware is updated, tr_debug removed from the interrupt context and no luck. Sometimes the bootstrap flow goes through but then it fails to mDS registration... What makes this weird is when I'm running mbed-client-testapp binary on the same device using exact same release and the server environment it works :) |
@jupe, ooh, that's a good idea, we were thinking about just asserting if in interrupt context, but didn't want to prevent actually debugging traces in irq. @anttiylitokola, that seems to be a different repository that I don't have access to, does https://github.com/ARMmbed/mbed-cloud-client-example also show the issue? |
@geky - you can use the repo you mentioned and use that same tag, you'll get the same content. |
I would vote for that assert, if it gives a clear enough error note! @jupe |
Thanks @JanneKiiskila, @teetak01 walked me through a setup that reproduces the issue, so I can at least see the issue happening. Still trying to track down exactly where it goes wrong. In regards to the assert, we can't simple assert on interrupts being enabled, since the serial callback still allows other interrupts to occur. But, we can assert on the state of the $sp register. |
It looks like this patch solved this issue: I can't take credit for finding this issue, but it seems that the mbed-cloud-client-example had a busy-loop that didn't leave enough processor time for the esp8266 to handle serial communication. An alternative solution may be to give the example thread a lower priority, but I'm not exactly sure what the mbed-cloud-client threading model is. Let me know if I'm missing anything, but if there aren't any other issues with the esp8266, I'll go ahead and slap on a new release tag and close this issue. |
Yes, indeed that patch did solve the issue. I think we now need to think if But, personally I feel this could be closed now. |
A busy loop shouldn't be the end of the world, if there's nothing in a lower-priority thread than the thing busy looping. In an RTOS like this where you have only hard thread priorities, you need to have some sort of system view of "highest thread priority in which bulk processing is permitted", because that will totally starve lower-priority threads. We don't really have any sort of system view on that I'm aware of, but in practice most code runs at Normal, and hence that could include unhelpful app spinning like this, or long-running crypto operations. So anything running below Normal could potentially get 100% starved. If that busy loop was at Normal, I would expect the ESP8266+serial drivers to cope, even if they use Normal priority. I'm slightly surprised that timeslicing between Normal isn't sufficient - it might need to move something up to higher than normal. I would normally expect core interrupt work to be deferred to RealTime. There might be some links to discussions about waking here too - lack of proper serial wake could be impacting performance. |
You're still going to have to end up putting the network data somewhere. Even if we give the ESP8266 its own high-priority thread, you will still have to deal with data arriving faster than the application can process it. You're going to actually need to move whatever component handles network data (client?) to a higher priority thread to avoid starvation. Now, the real solution if we want the network layer to handle starvation would be to rely on TCP flow control. Unfortunately, the ESP8266 doesn't provide software flow control and relies on the RTS/CTS serial lines, which aren't exposed on most of the ESP8266 packages. I'm open to other suggestions, also note UDP does not have this issue. |
Time moves forward, I'm going to go ahead and close this. There a few improvements coming in, so here is the version with the current state of things: v.1.3 |
There is now some timing issue most probably in the driver side. With following patch client gets stuck even the traces are disabled. With 5ms delay problem occurs every time, changing to 2ms client works fine.
[delay.zip]
Can someone in the driver side check this a bit more?
Cloned from ARMmbed/mbed-os-example-client#223.
Check the last comment from the above issue.
The text was updated successfully, but these errors were encountered: