[ST - NUCLEO_F429 + wifi] : not able to connect to device connector when trace is on #28

anttiylitokola · 2017-05-02T07:56:20Z

There is now some timing issue most probably in the driver side. With following patch client gets stuck even the traces are disabled. With 5ms delay problem occurs every time, changing to 2ms client works fine.
[delay.zip]
Can someone in the driver side check this a bit more?

Cloned from ARMmbed/mbed-os-example-client#223.

Check the last comment from the above issue.

yogpan01 · 2017-05-02T08:28:09Z

@geky Can you please have a look into this ? This is a serious issue as all of our WiFi driver use this as reference design for porting and bug on this driver will result in having all of other ported drivers to carry the same issue.
@MarceloSalazar FYI. I think you have observed lot of these issues with different WiFi drivers and combination of new boards and ESP drivers, please prirotize this issue.

geky · 2017-05-05T16:45:59Z

Did a bit of investigation, I think we've tracked down the issue.

Short story:
Update to the most recent version of the esp8266 driver, this has been fixed since #21. This will require a firmware update.

Long story:
We found this odd DNS issue a while back when our network here in Austin changed ARMmbed/mbed-os#3926. But it didn't occur consistently, making it difficult to track down. As a part of the update to the 2.0 firmware, we adopted AT+CIPDOMAIN as a workaround to move the DNS queries to the device side.

Since then, one of our new members took over the issue earlier this week and we were able to narrow down the issue to the fact that we were sending a larger packet then necessary for the DNS query. Since the DNS header contains the number of entries in the packet, we thought this was innocuous, but apparently some DNS servers use the packet size instead, causing spurious format errors based on the contents on the uninitialized memory.

@kjbracey-arm actually put up a fix for this as a part of a separate issue ARMmbed/mbed-os#4274

So why does this issue crop up after an arbitrary delay? My best guess is that the memory is used by some other process in the meanwhile, changing the uninitialized memory from innocuous zeros to more problematic data.

@anttiylitokola can you check if updating the firmware fixes these issues?

If so I'll create an esp8266 release to indicate that we're currently at a stable point.

anttiylitokola · 2017-05-06T09:56:51Z

@geky, No luck with the latest firmware either. I think the firmware update went through fine since I got the green finish button in ESP8266 download tool.
Firmware is taken from here:
https://developer.mbed.org/media/uploads/sarahmarshy/esp8266_v2.0_firmware.zip

Tried also with mbed-os patch (ARMmbed/mbed-os#4274) but no difference there. Sometimes the DNS query goes through but then the next send event fails with socket error.

geky · 2017-05-08T18:38:13Z

@anttiylitokola, thanks for checking, the mbed OS patch should have fixed the issue on the old firmware, so the issue must be elsewhere. Will continue looking.

Sidenote: I can't reproduce the issue with your delay patch, but can reproduce ARMmbed/mbed-os-example-client#223 with trace enabled (on NUCLEO_F429ZI).

geky · 2017-05-08T22:41:43Z

Hm, this line seems to be the problem:
https://github.com/ARMmbed/mbed-client-classic/blob/6b5141649ffb109e5f0a201b0866b6aff795e6d5/source/m2mconnectionhandlerpimpl.cpp#L86

The trace log halts after that line, and no more logging comes from trace or the esp8266 (the esp8266 continues the dns resolution on the K64F):

[DBG ][mClt]: init_socket - port 0
[DBG ][mClt]: init_socket - OUT
[DBG ][mClt]: resolve_server_address - Using TCP
AT> AT+CIPSTART=0,
[DBG ][mClt]: send_receive_event : _socket_state: ESocketStateConnectBeingCalled, ignoring event

Comment out that line and the client example proceeds to work just fine:

[DBG ][mClt]: init_socket - OUT
[DBG ][mClt]: resolve_server_address - Using TCP
AT> AT+CIPSTART=0,"TCP","169.45.82.18",5684
AT<
AT< AT+CIPSTART=0,"TCP","169.45.82.18",5684
AT< 0,CONNECT
AT<
AT= OK
[INFO][mClt]: pal_connect(): success
[DBG ][mClt]: M2MConnectionSecurityPimpl::init
[DBG ][mClt]: M2MConnectionSecurityPimpl::init - ret 0

Dropping into the debugger quickly reveals some red flags. Here's the stack trace:

#0     frame=0x2002fe28 pc=0x0000ad38 in M2MConnectionHandlerPimpl::send_receive_event (this=0x2000bfc8) at ./mbed-client/mbed-client-classic/source/m2mconnectionhandlerpimpl.cpp:87
#1-4   frame=0x2002fe30 pc=0x0000ada0 in socket_event_handler () at ./mbed-client/mbed-client-classic/source/m2mconnectionhandlerpimpl.cpp:105
#5-9   frame=0x2002fe78 pc=0x00042334 in TCPSocket::event (this=0x2000e680) at ./mbed-os/features/netsocket/TCPSocket.cpp:202
#10-14 frame=0x2002fed0 pc=0x00007d90 in ESP8266Interface::event (this=0x20000b9c <wifi>) at ./easy-connect/esp8266-driver/ESP8266Interface.cpp:279
#15-19 frame=0x2002ff20 pc=0x0000667e in BufferedSerial::rxIrq (this=0x20000ba4 <wifi+8>) at ./easy-connect/esp8266-driver/ESP8266/ATParser/BufferedSerial/BufferedSerial.cpp:123
#20    frame=0x2002ff78 pc=0x0001d058 in mbed::SerialBase::_irq_handler (id=536873892, irq_type=RxIrq) at ./mbed-os/drivers/SerialBase.cpp:88
#21    frame=0x2002ff98 pc=0x00048ce0 in uart_irq (transmit_empty=0, receive_full=32, index=3) at ./mbed-os/targets/TARGET_Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/serial_api.c:147
#22    frame=0x2002ffa8 pc=0x00048d98 in uart3_irq () at ./mbed-os/targets/TARGET_Freescale/TARGET_MCUXpresso_MCUS/TARGET_MCU_K64F/serial_api.c:172
#23    <signal handler called>

Well that looks like interrupt context.

Here's how socket_event_handler is registered:

M2MConnectionHandlerPimpl calls pal_asynchronousSocket - m2mconnectionhandlerpimpl.cpp:711
pal_asynchronousSocket calls pal_plat_asynchronousSocket - pal_network.c:372
pal_plat_asynchronousSocket calls Socket::attach - pal_plat_network.cpp:729

Now I can't vouch for the pal, but the socket attach function (here, recently renamed to sigio, this is where the esp8266 becomes involved) does warn about interrupt context:

The callback may be called in an interrupt context and should not
perform expensive operations such as recv/send calls.

So a trace call is happening in interrupt context? That doesn't sound right.

@jupe (I'm not really sure who's responsible for trace at the moment, so sorry to bother you), trace can't really support interrupt contexts right? Looking around it seems to end up calling the stdlib's puts function (here) and snprintf function (here), both of which are probably too heavy for interrupt contexts.

Oh huh, looks like there was a similar issue with tr_debug in mbed-client's irq context:
ARMmbed/mbed-client-classic#31

What a small world.

So I'm guessing the solution is to just move those debug statements out of debugging. The M2MConnectionSecurityPimpl class already uses the nanostack event loop for this purpose, so either these debug statements can be moved there, or just dropped if they're not needed.

It looks like the source of this issue came from here ARMmbed/mbed-client-classic#65. It seems like a pretty innocent mistake.

@anttiylitokola, but this leaves me a bit confused, because the delay patch doesn't seem related at all. With a clean checkout and your patch applied, I can't reproduce your issue with the NUCLEO_F429ZI+GCC_ARM or K64F+GCC_ARM on any of the build profiles. Am I missing something? Do you still run into the issue with ARMmbed/mbed-os#4274 applied and the tr_debug lines commented out?

jupe · 2017-05-09T04:09:27Z

yeah its not safe to call traces from interrupt context because it uses puts which takes awhile. Hmm could we detect somehow when call traces from irq and add some irq prefix for those prints to avoid "mysterious" problems in future?

adustm · 2017-05-09T09:48:25Z

Hi @geky : my status is in ARMmbed/mbed-os-example-client#223

anttiylitokola · 2017-05-09T10:38:35Z

@geky, I'm still hitting the same problem. Just made a fresh clone of mbed-cloud-client-example-source-internal(TAG: RR1.2.0-EA) repo and the problem still occurs on my end. Firmware is updated, tr_debug removed from the interrupt context and no luck. Sometimes the bootstrap flow goes through but then it fails to mDS registration...

What makes this weird is when I'm running mbed-client-testapp binary on the same device using exact same release and the server environment it works :)

geky · 2017-05-09T14:37:56Z

@jupe, ooh, that's a good idea, we were thinking about just asserting if in interrupt context, but didn't want to prevent actually debugging traces in irq.

@anttiylitokola, that seems to be a different repository that I don't have access to, does https://github.com/ARMmbed/mbed-cloud-client-example also show the issue?

JanneKiiskila · 2017-05-10T10:19:23Z

@geky - you can use the repo you mentioned and use that same tag, you'll get the same content.

JanneKiiskila · 2017-05-10T10:39:24Z

I would vote for that assert, if it gives a clear enough error note! @jupe

geky · 2017-05-11T15:44:53Z

Thanks @JanneKiiskila, @teetak01 walked me through a setup that reproduces the issue, so I can at least see the issue happening. Still trying to track down exactly where it goes wrong.

In regards to the assert, we can't simple assert on interrupts being enabled, since the serial callback still allows other interrupts to occur. But, we can assert on the state of the $sp register.

geky · 2017-05-18T16:21:15Z

It looks like this patch solved this issue:
https://github.com/ARMmbed/mbed-cloud-client-example-internal/pull/12/commits/7bcf1f5cc393e68e2c848b38fab5d3b68fea8c62

I can't take credit for finding this issue, but it seems that the mbed-cloud-client-example had a busy-loop that didn't leave enough processor time for the esp8266 to handle serial communication. An alternative solution may be to give the example thread a lower priority, but I'm not exactly sure what the mbed-cloud-client threading model is.

Let me know if I'm missing anything, but if there aren't any other issues with the esp8266, I'll go ahead and slap on a new release tag and close this issue.

JanneKiiskila · 2017-05-19T06:39:06Z

Yes, indeed that patch did solve the issue. I think we now need to think if
a) having a busy loop that is a bug in the app, and thus app fix is the right way to solve OR
b) the system (mbed OS) should be resilient enough to handle situations like that?

But, personally I feel this could be closed now.

kjbracey · 2017-05-19T08:05:03Z

A busy loop shouldn't be the end of the world, if there's nothing in a lower-priority thread than the thing busy looping.

In an RTOS like this where you have only hard thread priorities, you need to have some sort of system view of "highest thread priority in which bulk processing is permitted", because that will totally starve lower-priority threads.

We don't really have any sort of system view on that I'm aware of, but in practice most code runs at Normal, and hence that could include unhelpful app spinning like this, or long-running crypto operations. So anything running below Normal could potentially get 100% starved.

If that busy loop was at Normal, I would expect the ESP8266+serial drivers to cope, even if they use Normal priority. I'm slightly surprised that timeslicing between Normal isn't sufficient - it might need to move something up to higher than normal. I would normally expect core interrupt work to be deferred to RealTime.

There might be some links to discussions about waking here too - lack of proper serial wake could be impacting performance.

geky · 2017-05-19T20:44:40Z

You're still going to have to end up putting the network data somewhere.

Even if we give the ESP8266 its own high-priority thread, you will still have to deal with data arriving faster than the application can process it. You're going to actually need to move whatever component handles network data (client?) to a higher priority thread to avoid starvation.

Now, the real solution if we want the network layer to handle starvation would be to rely on TCP flow control. Unfortunately, the ESP8266 doesn't provide software flow control and relies on the RTS/CTS serial lines, which aren't exposed on most of the ESP8266 packages. I'm open to other suggestions, also note UDP does not have this issue.

geky · 2017-07-06T15:30:48Z

Time moves forward, I'm going to go ahead and close this. There a few improvements coming in, so here is the version with the current state of things: v.1.3

This was referenced May 2, 2017

[ST - NUCLEO_F429 + wifi] : not able to connect to device connector when trace is on ARMmbed/mbed-os-example-client#223

Closed

[NUCLEO_L476RG] Unable to connect when using Wifi (ESP8266) + TCP (default) ARMmbed/mbed-os-example-client#226

Closed

yogpan01 added the bug label May 2, 2017

yogpan01 assigned geky May 2, 2017

geky mentioned this issue May 9, 2017

Fix usage of tracing in callback handler ARMmbed/mbed-client-classic#74

Merged

geky closed this as completed Jul 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ST - NUCLEO_F429 + wifi] : not able to connect to device connector when trace is on #28

[ST - NUCLEO_F429 + wifi] : not able to connect to device connector when trace is on #28

anttiylitokola commented May 2, 2017 •

edited by JanneKiiskila

Loading

yogpan01 commented May 2, 2017

geky commented May 5, 2017 •

edited

Loading

anttiylitokola commented May 6, 2017

geky commented May 8, 2017 •

edited

Loading

geky commented May 8, 2017

jupe commented May 9, 2017

adustm commented May 9, 2017

anttiylitokola commented May 9, 2017

geky commented May 9, 2017

JanneKiiskila commented May 10, 2017

JanneKiiskila commented May 10, 2017

geky commented May 11, 2017

geky commented May 18, 2017

JanneKiiskila commented May 19, 2017 •

edited

Loading

kjbracey commented May 19, 2017

geky commented May 19, 2017 •

edited

Loading

geky commented Jul 6, 2017 •

edited

Loading

[ST - NUCLEO_F429 + wifi] : not able to connect to device connector when trace is on #28

[ST - NUCLEO_F429 + wifi] : not able to connect to device connector when trace is on #28

Comments

anttiylitokola commented May 2, 2017 • edited by JanneKiiskila Loading

yogpan01 commented May 2, 2017

geky commented May 5, 2017 • edited Loading

anttiylitokola commented May 6, 2017

geky commented May 8, 2017 • edited Loading

geky commented May 8, 2017

jupe commented May 9, 2017

adustm commented May 9, 2017

anttiylitokola commented May 9, 2017

geky commented May 9, 2017

JanneKiiskila commented May 10, 2017

JanneKiiskila commented May 10, 2017

geky commented May 11, 2017

geky commented May 18, 2017

JanneKiiskila commented May 19, 2017 • edited Loading

kjbracey commented May 19, 2017

geky commented May 19, 2017 • edited Loading

geky commented Jul 6, 2017 • edited Loading

anttiylitokola commented May 2, 2017 •

edited by JanneKiiskila

Loading

geky commented May 5, 2017 •

edited

Loading

geky commented May 8, 2017 •

edited

Loading

JanneKiiskila commented May 19, 2017 •

edited

Loading

geky commented May 19, 2017 •

edited

Loading

geky commented Jul 6, 2017 •

edited

Loading