-
Notifications
You must be signed in to change notification settings - Fork 56
Network problems with 7 or more clients connected to the mbed Device Connector #266
Comments
ARM Internal Ref: IOTCLT-1854 |
@rspelta are you using the same mbed Device Connector (mDC) account for all devices? |
I'm seeing enough "MAC TX fail" and "Source route error" messages in that log to make me think you have some sort of radio driver or hardware problem. Communication between the nodes does not seem to be as reliable as I'd expect. "MAC TX fail" means "I tried to transmit to a neighbour, but didn't get an Ack after multiple attempts". "Source route error A->B" means "LoWPAN router A is telling me (the border router) that they had a MAC TX error (see above) when forwarding to B". I'd run a simple ping test from outside the mesh to the nodes to see what sort of average packet loss you're getting - both with minimum size packets and larger ones (eg 500 or 1000 bytes). The aim should be to have only a couple of percent loss at minimum size, and try to get under 10% at 1000 bytes. |
I've just spent a little while reviewing the Spirit driver code, just to see if I can see any obvious flaws. (Hard to say much without knowing the hardware, but I can look for general issues.) I'm a bit wary about the software ack handling - can be tricky. There is one specific problem that could be affecting performance now - it seems to me the acks are sent with a common send() routine that enables hardware CSMA-CD. An ack should be being sent 192us after transmission completion, without CSMA. Backing off the ack will greatly reduce the chance of packets being successfully acknowledged. Other notes on ack reception - you're calling TX_DONE whenever tx_sequence == seq_number, whether you were expecting an ack or not. This could cause stack confusion in various ways (eg if you were backing off while someone else used the same sequence number). You should only process an ACK when you actually expect one (TX completed, and AR bit was set in it) Also, while expecting an ack, it can be beneficial to report TX_FAIL and stop expecting when you receive anything other then an ack with the expected sequence number. The stack will eventually time out if it doesn't get TX_DONE, but receipt of anything else (including a wrongly-numbered ack) indicates a lack of acknowledgment, which can be reported immediately. |
Date: Mon, 3 Jul 2017 10:14:23 +0000 From: Kevin Bracey <notifications@github.com> ARMmbed/mbed-os-example-client#266 (comment) I've just spent a little while reviewing the Spirit driver code, just to see if I can see any obvious flaws. (Hard to say much without knowing the hardware, but I can look for general issues.) I'm a bit wary about the software ack handling - can be tricky. There is one specific problem that could be affecting performance now - it seems to me the acks are sent with a common send() routine that enables hardware CSMA-CD. An ack should be being sent 192us after transmission completion, without CSMA. Backing off the ack will greatly reduce the chance of packets being successfully acknowledged. Other notes on ack reception - you're calling TX_DONE whenever tx_sequence == seq_number, whether you were expecting an ack or not. This could cause stack confusion in various ways (eg if you were backing off while someone else used the same sequence number). You should only process an ACK when you actually expect one (TX completed, and AR bit was set in it) Also, while expecting an ack, it can be beneficial to report TX_FAIL and stop expecting when you receive anything other then an ack with the expected sequence number. The stack will eventually time out if it doesn't get TX_DONE, but receipt of anything else (including a wrongly-numbered ack) indicates a lack of acknowledgment, which can be reported immediately.
Thanks a lot @kjbracey-arm! I have immediately tried to implement all your indications. Before pushing them, could you pls. be so kind and review these. Above all, pls. let me know if I got your points correctly, am missing something, and if you think the patch fixes the issues you have listed. |
Hi, I used the last patch from @betzw. I done the tests with 9 sensor-node boards. One mbed Device Connector (mDC) account for all devices. Now it seems more stable, all the 9 clients are listed on the website and I didn't see any client disappear from the list. However I still read these messages in the log: You can see the log of the border router (trace setted on "info level"): And the log of one client (trace setted on "info level"): @MarceloSalazar next test will be with more clients connected so I will try your advices in order to have more info. |
Hi all, I would conclude that this issue is NOT an mbed-client issue, it's just the fact it can't tolerate (especially the TLS handshakes are a very, very touch area) flaky networks. @kjbracey-arm knows the gory details and somehow I feel there would be some options for making it more fault-tolerant (issue towards mbedtls perhaps - https://github.com/ARMmbed/mbedtls) to make the situation better, client can't do much about that TLS handshake thing - we just try to do it and if it fails -> we can't connect to the server. |
You will always see some "MCPS Data fail" messages - quite a few if the network is busy. Once a 6LoWPAN network accessing the mbed server has stabilised, I wouldn't expect to see many fails, as it shouldn't be that busy. Looking at the logs, my feeling is that performance is still a bit below par, but I'd need to see some proper stats (eg ping tests, as suggested above). Looking at those logs, one thing occurred to me, so I just checked the driver. It doesn't actually report "CCA_FAIL", so we can't tell the difference between "channel busy" (225) and "no ack" (233). The driver should be enhanced to distinguish the cases - handling IRQ_MAX_BO_CCA_REACH. Will help diagnostics, and may help performance (Nanostack will respond differently). I also wonder if your NumBackoffs parameter is set too low. The default for IEEE 802.15.4 is 4, and we tend to use 5. You've got it set to 3. Might be worth checking the other numbers. The driver should tell Nanostack the number of CSMA retries it did. Maybe you can't get that number out of the driver on success, so saying 0 would be okay, but you should say "4" on a CCA_FAIL if that's your NumBackoffs setting. (ie 4 retries, 5 attempts total). |
The number of backoffs can easily be changed here. |
I have reflected a bit about CCA & backoff and would like to ask @kjbracey-arm how the Nanostack interfaces to persistent CCA (i.e how to implement integration of a RF supporting persistent CCA into the Nanostack)? |
Persistent CCA is not a thing I've ever come across before reading that Spirit1 datasheet. We assume backoff as per the 802.15.4 spec - a limited number of short-duration CCA attempts at random intervals. If I'm understanding correctly, that Spirit1 mode continuously monitors and transmits as soon as it's quiet? If so seems like it would just tend to cause collisions to me, as multiple nodes could tend to go simultaneously at the end of a transmission. Nanostack supports both drivers that don't have their own automatic backoff and those that do - if you report "CCA_FAIL" with 0 retries, then Nanostack will backoff and retry itself, counting its attempts. If you report "CCA_FAIL" with sufficient retries, then Nanostack will not retry. (This mechanism is a bit wonky - eg Nanostack can be configured for 4 retries, but there's no way of it telling the driver how many it wants. If you did 3, Nanostack would decide that's not enough, so it will call you again, so presumably you'd do another 3 for a total of 6. Also the combination of Nanostack's random backoff plus whatever random backoff you have probably isn't ideal). |
I guess if you did want to try that persistent CCA, you'd just enable it, and then you would presumably only report CCA_FAIL (with an artificially high "retries" to stop Nanostack trying again) if you decided to time out in the driver. |
Well, currently persistent CCA is enabled in the Spirti1 driver. |
At first impressions persistent CCA sounds like a bad idea to me in a busy network, because the PHY has no collision detection, only collision avoidance. And the collision avoidance only works if nodes don't tend to transmit simultaneously. Persistent CCA seems like it would encourage simultaneous transmissions. It seems like a fudge to try to get higher bandwidth and lower latency, at the cost of higher power and working less well with multiple nodes. It's not standard 802.15.4 and I would suggest disabling it. |
I tried the ping test. I started with only one client connected to the mesh. I have seen that even with one node sometimes there is a problem. My test is:
What I have seen is that when all works correctly I have a ping 32B in 30ms and a ping 1000B reply in 330ms with no packets lost. If the timing rise up to 2-3 seconds then the client will starts to lose packets reaching 25%-35% packets lost. If I reset the border router (not the client) then it return to work well with no packet lost and timing 30ms-330ms. here the worst-case: |
Ciao Roberto,
|
I used the spirit driver commit with "5 max nr of backoffs" (140e6470983229aebac2de7256616c3c13f37c4b). When I start both (border router + client) all is ok. If I ping the client the response time is immediate. Risposta da 2002:4f15:d31f:e472:9b99:9999:9999:9999: durata=610ms After a while (time random) the client will start to losing packets about 25%-35%. If I reset the client this doesn't change the situation. If I reset the border router the normal condition is restored. note: time is random but if there are more clients connected then this issue happen before. |
Just to observe, seeing your comment about the commit "5 max nr of backoffs" - I don't think the number of backoffs parameter is used if in persistent mode - there are no backoffs, so I wouldn't expect that particular patch to change anything. I still think the most important thing to try here is getting back to normal non-persistent 802.15.4 with backoffs, unless there is some reason to believe that doesn't work with this chipset. But with reference to the test above - to pin it down can you do the same test pinging the border router itself? Both when it's the only device, and when there are clients attached. Also, after the client pinging starts going weird, what do pings to the border router do? |
when it's the only router: C:\Users\Roberto>ping -6 -t -l 32 2002:5709:48f6:e472:280:e1ff:fe23:39 Esecuzione di Ping 2002:5709:48f6:e472:280:e1ff:fe23:39 con 32 byte di dati: Statistiche Ping per 2002:5709:48f6:e472:280:e1ff:fe23:39: C:\Users\Roberto>ping -6 -t -l 1000 2002:5709:48f6:e472:280:e1ff:fe23:39 Esecuzione di Ping 2002:5709:48f6:e472:280:e1ff:fe23:39 con 1000 byte di dati: Statistiche Ping per 2002:5709:48f6:e472:280:e1ff:fe23:39: with 1 client attached: Esecuzione di Ping 2002:5709:48f6:e472:9b99:9999:9999:9999 con 32 byte di dati: Statistiche Ping per 2002:5709:48f6:e472:9b99:9999:9999:9999: C:\Users\Roberto>ping -6 -t -l 1000 2002:5709:48f6:e472:9b99:9999:9999:9999 Esecuzione di Ping 2002:5709:48f6:e472:9b99:9999:9999:9999 con 1000 byte di dati: Statistiche Ping per 2002:5709:48f6:e472:9b99:9999:9999:9999: BORDER ROUTER: When I connected the second client, ping starts going weird: CLIENT: after 10 minutes Esecuzione di Ping 2002:5709:48f6:e472:280:e1ff:fe23:39 con 1000 byte di dati: Statistiche Ping per 2002:5709:48f6:e472:280:e1ff:fe23:39: CLIENT Esecuzione di Ping 2002:5709:48f6:e472:9b99:9999:9999:9999 con 32 byte di dati: Statistiche Ping per 2002:5709:48f6:e472:9b99:9999:9999:9999: TRACE BORDER ROUTER Even with this problem on the mbed device connector I can see both clients connected |
PS. BORDER-ROUTER: CLIENT Both clients re-connected to the mbed connector successfully |
Weird. What if you turn off both clients after it's gone funny? Does the border router ever go back to normal? |
If I turn off both clients the border router doesn't go back to normal Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata<1ms |
Can you intrusively debug the router - stop it to see where it's spending its time? Might not work if it's sleeping rather than busy. Alternatively, as I'm guessing whoever's taking time is probably doing it in Nanostack's event loop, you could add trace to it, to try to see which event handler is consuming time. Add trace to eventOS_scheduler_dispatch_event(), to print out event ID and type before calling the function pointer, and see if you can visually spot any big delays. (There should be a regular event 10 times a second, so any excessively long-running handlers should obviously block that). Or maybe flag up automatically if the function pointer takes an unusually long time - measure time with eventOS_event_timer_ticks(), and report any handlers that take more than 500ms (50 ticks). |
@kjbracey-arm @hasnainvirk - should/could the eventloop actually have a WARN printout, if it spots any event that uses more than an acceptable amount of time? |
@JanneKiiskila That is kind of general problem that is already under discussion here. We could add it, but requires that the common event loops will be accepted into the mbed OS first. |
Maybe I have found a quite deterministic way to reproduce increasing values for Setup:
Procedure:
After four/five re-starts of the client I get the following
|
Another interesting observation I made this morning after a long run of the above describe setup during last night was that this morning the client was no longer listed by the mbed connector under Connected Devices, while its output trace didn't show up any suspicious output, on the contrary the output continued as usual with:
Any ideas? cc @nikapov-ST |
@betzw Maybe I have found a quite deterministic way to reproduce increasing values for ping6. @kjbracey-arm Does PR#4768 totally resolve the weird border router delay issues? With Atmel is there this problem? Is it a good idea if someone could try with Atmel in order to see if this problem happens? I don't have the ATMEL AT233 15.4 shield here. |
Hello, I sniffed the packets from the ethernet cable between the border router and the router connected to the internet (not by SIM card but via cable). When the border router is weird (ping problem) the log is this one: I have the client trace of the picture above, I found out:
|
Just for my understanding:
|
2002:5236:726c:e472:9b999:9999:9999:9999 -> client These packets are sniffed between the internet router and the border router. |
Analyzig the trace, to me it seems as if the client misses sniff entries from no. 77 up to no. 143 (both included). Do you have any information if and what other traffic might be going on in this period? |
No, they aren't missed. Here the list of the packets from the trace and the sniffed packets. [no.] [input/outuput] [tls message (copied from the client trace)] |
update: |
@rspelta please let us know whether the issue is fixed now. |
Hi all, today I downloaded the last version of the "nanostack-border-router" with the last commit from the sal-nanostack-driver-stm32-eth and stm-spirit1-rf-driver repositories.
Results:
Note for @betzw: |
Great success! 🥇 Should that modification be passed on as a PR to sal-nanostack-driver-stm32-eth (and merged in)? |
Is it possible that issue #298 and the necessity for high ring buffer sizes are two different faces of the same issue? |
Reading the packets from the ethernet I don't think so. The issue #298 happen when the client sends all the right packets but the mbed server doesn't replies (in this case the border router is innocent). About the size of the buffer, what I see from the sniffing is that after the "Server Hello Done" message from the server, the client sends 5 packets very quickly and the border router must read them very quickly in order to send them to the mbed server. So if there are a lot of clients that send 5 packets may be that the border router fails to read all of them with a buffer size too short. |
@rspelta not sure I understood your point above. In the meanwhile I have made the number of TX/RX DMA buffer elements configurable (via mbed config system). |
The issue 298 for me is a problem from the mbed server not from the border router because I don't see the reply from the mbed server. |
Does this mean that once the mbed server issue gets resolved you think that you might downsize the ring buffer again? |
I think it can work better. The configurable TX/RX DMA buffer is a perfect solution for my case. |
All, do you think we can close this issue? |
For me ok. Thanks you all! |
Any remaining problems visible on the radio side at present? I don't know if all my long-ago comments on radio configuration were addressed - if not, remember to keep them in mind if you see more problems as node count increases. |
Not at the moment, but surely if there would be any problems that arise, I will check them. Thank you |
Closing as there is no more problems to solve. |
Hello, guys Tanks |
@jWladimir this repository is only for the Client-application. Please check the border-router repository here: https://github.com/ARMmbed/nanostack-border-router Also, please open a new issue if something does not work. |
Border router application already has example config for F429ZI and Spirit RF module. https://github.com/ARMmbed/nanostack-border-router/blob/master/configs/6lowpan_Spirit1_RF.json Also the example configuration for 6LoWPAN already has Spirit RF module enabled: https://github.com/ARMmbed/mbed-os-example-client/blob/master/configs/mesh_6lowpan_subg.json#L9 For debugging connectivity problems, please first use mbed-os-example-mesh-minimal application. Once the connectivity is proven to work, continue to use the Client example. |
Description
With seven or more clients connected to the mbed Device Connector some of them are not listed temporally or disappear until I reset the client.
We are using mbed-os-example-client for 10 boards, 3 of them are NUCLEO_F429ZI+X-NUCLEO-IDS01A4 boards and the others are sensor-node boards (it's a new kind of board, we are developing on it, @MarceloSalazar knows it).
The border router uses the nanostack-border-route repository and it is a NUCLEO_F429ZI+X-NUCLEO-IDS01A4, it uses a ethernet cable connected to the internet router (with enabled 6to4 tunneling). The mesh is via Spirit, we have worked with Wolfgang (@betzw) in order to have the spirit driver stable.
The NUCLEO_F429ZI boards use the lastest commit of the repositories, the sensor-node board come from the versions:
mbed-os-sensor-node (5f27acc)
|- easy-connect (6fb5842becae)
| |- atmel-rf-driver (57f22763f4d3)
| |- esp8266-driver (4ed87bf7fe37)
| |
- ESP8266\ATParser (269f14532b98) | |- mcr20a-rf-driver (d8810e105d7d) |
- stm-spirit1-rf-driver (ac7a4f477222)|- mbed-client (f8f0fc8b9321)
| |- mbed-client-c (c739b8cbcc57)
| |- mbed-client-classic (f673b8b60779)
|
- mbed-client-mbed-tls (7e1b6d815038) |- mbed-os (ed4febefdede)
- pal (4e46c0ea8706)Every client has its security.h file and a different MAC address.
We have done this test:
here you can read the trace of the border router:
https://pastebin.com/ti8U3HgS
From the sniffer seems that the lost clients (I mean who disappear from the mbed connector list) continue to communicate with the border router. We don't have information about how the nanostack works so we have problem to get a better idea what is going on.
To try to understand what happen we have:
Our goal is to have more than 10 boards connected to the mbed Device Connector.
What can we do?
The text was updated successfully, but these errors were encountered: