Skip to content

[ST - NUCLEO_F429 + wifi] : not able to connect to device connector when trace is on #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
anttiylitokola opened this issue May 2, 2017 · 17 comments
Assignees
Labels

Comments

@anttiylitokola
Copy link

anttiylitokola commented May 2, 2017

There is now some timing issue most probably in the driver side. With following patch client gets stuck even the traces are disabled. With 5ms delay problem occurs every time, changing to 2ms client works fine.
[delay.zip]
Can someone in the driver side check this a bit more?

Cloned from ARMmbed/mbed-os-example-client#223.

Check the last comment from the above issue.

@yogpan01
Copy link

yogpan01 commented May 2, 2017

@geky Can you please have a look into this ? This is a serious issue as all of our WiFi driver use this as reference design for porting and bug on this driver will result in having all of other ported drivers to carry the same issue.
@MarceloSalazar FYI. I think you have observed lot of these issues with different WiFi drivers and combination of new boards and ESP drivers, please prirotize this issue.

@geky
Copy link
Contributor

geky commented May 5, 2017

Did a bit of investigation, I think we've tracked down the issue.

Short story:
Update to the most recent version of the esp8266 driver, this has been fixed since #21. This will require a firmware update.

Long story:
We found this odd DNS issue a while back when our network here in Austin changed ARMmbed/mbed-os#3926. But it didn't occur consistently, making it difficult to track down. As a part of the update to the 2.0 firmware, we adopted AT+CIPDOMAIN as a workaround to move the DNS queries to the device side.

Since then, one of our new members took over the issue earlier this week and we were able to narrow down the issue to the fact that we were sending a larger packet then necessary for the DNS query. Since the DNS header contains the number of entries in the packet, we thought this was innocuous, but apparently some DNS servers use the packet size instead, causing spurious format errors based on the contents on the uninitialized memory.

@kjbracey-arm actually put up a fix for this as a part of a separate issue ARMmbed/mbed-os#4274

So why does this issue crop up after an arbitrary delay? My best guess is that the memory is used by some other process in the meanwhile, changing the uninitialized memory from innocuous zeros to more problematic data.

@anttiylitokola can you check if updating the firmware fixes these issues?

If so I'll create an esp8266 release to indicate that we're currently at a stable point.

@anttiylitokola
Copy link
Author

@geky, No luck with the latest firmware either. I think the firmware update went through fine since I got the green finish button in ESP8266 download tool.
Firmware is taken from here:
https://developer.mbed.org/media/uploads/sarahmarshy/esp8266_v2.0_firmware.zip

Tried also with mbed-os patch (ARMmbed/mbed-os#4274) but no difference there. Sometimes the DNS query goes through but then the next send event fails with socket error.

@geky
Copy link
Contributor

geky commented May 8, 2017

@anttiylitokola, thanks for checking, the mbed OS patch should have fixed the issue on the old firmware, so the issue must be elsewhere. Will continue looking.

Sidenote: I can't reproduce the issue with your delay patch, but can reproduce ARMmbed/mbed-os-example-client#223 with trace enabled (on NUCLEO_F429ZI).

@geky
Copy link
Contributor

geky commented May 8, 2017

Hm, this line seems to be the problem:
https://github.com/ARMmbed/mbed-client-classic/blob/6b5141649ffb109e5f0a201b0866b6aff795e6d5/source/m2mconnectionhandlerpimpl.cpp#L86

The trace log halts after that line, and no more logging comes from trace or the esp8266 (the esp8266 continues the dns resolution on the K64F):

[DBG ][mClt]: init_socket - port 0
[DBG ][mClt]: init_socket - OUT
[DBG ][mClt]: resolve_server_address - Using TCP
AT> AT+CIPSTART=0,
[DBG ][mClt]: send_receive_event : _socket_state: ESocketStateConnectBeingCalled, ignoring event

Comment out that line and the client example proceeds to work just fine:

[DBG ][mClt]: init_socket - OUT
[DBG ][mClt]: resolve_server_address - Using TCP
AT> AT+CIPSTART=0,"TCP","169.45.82.18",5684
AT<
AT< AT+CIPSTART=0,"TCP","169.45.82.18",5684
AT< 0,CONNECT
AT<
AT= OK
[INFO][mClt]: pal_connect(): success
[DBG ][mClt]: M2MConnectionSecurityPimpl::init
[DBG ][mClt]: M2MConnectionSecurityPimpl::init - ret 0

Dropping into the debugger quickly reveals some red flags. Here's the stack trace:

#0     frame=0x2002fe28 pc=0x0000ad38 in M2MConnectionHandlerPimpl::send_receive_event (this=0x2000bfc8) at ./mbed-client/mbed-client-classic/source/m2mconnectionhandlerpimpl.cpp:87
#1-4   frame=0x2002fe30 pc=0x0000ada0 in socket_event_handler () at ./mbed-client/mbed-client-classic/source/m2mconnectionhandlerpimpl.cpp:105
#5-9   frame=0x2002fe78 pc=0x00042334 in TCPSocket::event (this=0x2000e680) at ./mbed-os/features/netsocket/TCPSocket.cpp:202
#10-14 frame=0x2002fed0 pc=0x00007d90 in ESP8266Interface::event (this=0x20000b9c <wifi>) at ./easy-connect/esp8266-driver/ESP8266Interface.cpp:279
#15-19 frame=0x2002ff20 pc=0x0000667e in BufferedSerial::rxIrq (this=0x20000ba4 <wifi+8>) at ./easy-connect/esp8266-driver/ESP8266/ATParser/BufferedSerial/BufferedSerial.cpp:123
#20    frame=0x2002ff78 pc=0x0001d058 in mbed::SerialBase::_irq_handler (id=536873892, irq_type=RxIrq) at ./mbed-os/drivers/SerialBase.cpp:88
#21    frame=0x2002ff98 pc=0x00048ce0 in uart_irq (transmit_empty=0, receive_full=32, index=3) at ./mbed-os/targets/TARGET_Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/serial_api.c:147
#22    frame=0x2002ffa8 pc=0x00048d98 in uart3_irq () at ./mbed-os/targets/TARGET_Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/serial_api.c:172
#23    <signal handler called>

Well that looks like interrupt context.

Here's how socket_event_handler is registered:

Now I can't vouch for the pal, but the socket attach function (here, recently renamed to sigio, this is where the esp8266 becomes involved) does warn about interrupt context:

The callback may be called in an interrupt context and should not
perform expensive operations such as recv/send calls.

So a trace call is happening in interrupt context? That doesn't sound right.

@jupe (I'm not really sure who's responsible for trace at the moment, so sorry to bother you), trace can't really support interrupt contexts right? Looking around it seems to end up calling the stdlib's puts function (here) and snprintf function (here), both of which are probably too heavy for interrupt contexts.

Oh huh, looks like there was a similar issue with tr_debug in mbed-client's irq context:
ARMmbed/mbed-client-classic#31

What a small world.

So I'm guessing the solution is to just move those debug statements out of debugging. The M2MConnectionSecurityPimpl class already uses the nanostack event loop for this purpose, so either these debug statements can be moved there, or just dropped if they're not needed.

It looks like the source of this issue came from here ARMmbed/mbed-client-classic#65. It seems like a pretty innocent mistake.

@anttiylitokola, but this leaves me a bit confused, because the delay patch doesn't seem related at all. With a clean checkout and your patch applied, I can't reproduce your issue with the NUCLEO_F429ZI+GCC_ARM or K64F+GCC_ARM on any of the build profiles. Am I missing something? Do you still run into the issue with ARMmbed/mbed-os#4274 applied and the tr_debug lines commented out?

@jupe
Copy link

jupe commented May 9, 2017

yeah its not safe to call traces from interrupt context because it uses puts which takes awhile. Hmm could we detect somehow when call traces from irq and add some irq prefix for those prints to avoid "mysterious" problems in future?

@adustm
Copy link
Member

adustm commented May 9, 2017

Hi @geky : my status is in ARMmbed/mbed-os-example-client#223

@anttiylitokola
Copy link
Author

@geky, I'm still hitting the same problem. Just made a fresh clone of mbed-cloud-client-example-source-internal(TAG: RR1.2.0-EA) repo and the problem still occurs on my end. Firmware is updated, tr_debug removed from the interrupt context and no luck. Sometimes the bootstrap flow goes through but then it fails to mDS registration...

What makes this weird is when I'm running mbed-client-testapp binary on the same device using exact same release and the server environment it works :)

@geky
Copy link
Contributor

geky commented May 9, 2017

@jupe, ooh, that's a good idea, we were thinking about just asserting if in interrupt context, but didn't want to prevent actually debugging traces in irq.

@anttiylitokola, that seems to be a different repository that I don't have access to, does https://github.com/ARMmbed/mbed-cloud-client-example also show the issue?

@JanneKiiskila
Copy link

@geky - you can use the repo you mentioned and use that same tag, you'll get the same content.

@JanneKiiskila
Copy link

I would vote for that assert, if it gives a clear enough error note! @jupe

@geky
Copy link
Contributor

geky commented May 11, 2017

Thanks @JanneKiiskila, @teetak01 walked me through a setup that reproduces the issue, so I can at least see the issue happening. Still trying to track down exactly where it goes wrong.

In regards to the assert, we can't simple assert on interrupts being enabled, since the serial callback still allows other interrupts to occur. But, we can assert on the state of the $sp register.

@geky
Copy link
Contributor

geky commented May 18, 2017

It looks like this patch solved this issue:
https://github.com/ARMmbed/mbed-cloud-client-example-internal/pull/12/commits/7bcf1f5cc393e68e2c848b38fab5d3b68fea8c62

I can't take credit for finding this issue, but it seems that the mbed-cloud-client-example had a busy-loop that didn't leave enough processor time for the esp8266 to handle serial communication. An alternative solution may be to give the example thread a lower priority, but I'm not exactly sure what the mbed-cloud-client threading model is.

Let me know if I'm missing anything, but if there aren't any other issues with the esp8266, I'll go ahead and slap on a new release tag and close this issue.

@JanneKiiskila
Copy link

JanneKiiskila commented May 19, 2017

Yes, indeed that patch did solve the issue. I think we now need to think if
a) having a busy loop that is a bug in the app, and thus app fix is the right way to solve OR
b) the system (mbed OS) should be resilient enough to handle situations like that?

But, personally I feel this could be closed now.

@kjbracey
Copy link

A busy loop shouldn't be the end of the world, if there's nothing in a lower-priority thread than the thing busy looping.

In an RTOS like this where you have only hard thread priorities, you need to have some sort of system view of "highest thread priority in which bulk processing is permitted", because that will totally starve lower-priority threads.

We don't really have any sort of system view on that I'm aware of, but in practice most code runs at Normal, and hence that could include unhelpful app spinning like this, or long-running crypto operations. So anything running below Normal could potentially get 100% starved.

If that busy loop was at Normal, I would expect the ESP8266+serial drivers to cope, even if they use Normal priority. I'm slightly surprised that timeslicing between Normal isn't sufficient - it might need to move something up to higher than normal. I would normally expect core interrupt work to be deferred to RealTime.

There might be some links to discussions about waking here too - lack of proper serial wake could be impacting performance.

@geky
Copy link
Contributor

geky commented May 19, 2017

You're still going to have to end up putting the network data somewhere.

Even if we give the ESP8266 its own high-priority thread, you will still have to deal with data arriving faster than the application can process it. You're going to actually need to move whatever component handles network data (client?) to a higher priority thread to avoid starvation.

Now, the real solution if we want the network layer to handle starvation would be to rely on TCP flow control. Unfortunately, the ESP8266 doesn't provide software flow control and relies on the RTS/CTS serial lines, which aren't exposed on most of the ESP8266 packages. I'm open to other suggestions, also note UDP does not have this issue.

@geky
Copy link
Contributor

geky commented Jul 6, 2017

Time moves forward, I'm going to go ahead and close this. There a few improvements coming in, so here is the version with the current state of things: v.1.3

@geky geky closed this as completed Jul 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants