Skip to content
This repository was archived by the owner on Apr 24, 2019. It is now read-only.

Network problems with 7 or more clients connected to the mbed Device Connector #266

Closed
rspelta opened this issue Jun 30, 2017 · 64 comments
Closed

Comments

@rspelta
Copy link

rspelta commented Jun 30, 2017

Description

  • Type: Bug?

With seven or more clients connected to the mbed Device Connector some of them are not listed temporally or disappear until I reset the client.

We are using mbed-os-example-client for 10 boards, 3 of them are NUCLEO_F429ZI+X-NUCLEO-IDS01A4 boards and the others are sensor-node boards (it's a new kind of board, we are developing on it, @MarceloSalazar knows it).
The border router uses the nanostack-border-route repository and it is a NUCLEO_F429ZI+X-NUCLEO-IDS01A4, it uses a ethernet cable connected to the internet router (with enabled 6to4 tunneling). The mesh is via Spirit, we have worked with Wolfgang (@betzw) in order to have the spirit driver stable.

The NUCLEO_F429ZI boards use the lastest commit of the repositories, the sensor-node board come from the versions:

mbed-os-sensor-node (5f27acc)
|- easy-connect (6fb5842becae)
| |- atmel-rf-driver (57f22763f4d3)
| |- esp8266-driver (4ed87bf7fe37)
| | - ESP8266\ATParser (269f14532b98) | |- mcr20a-rf-driver (d8810e105d7d) | - stm-spirit1-rf-driver (ac7a4f477222)
|- mbed-client (f8f0fc8b9321)
| |- mbed-client-c (c739b8cbcc57)
| |- mbed-client-classic (f673b8b60779)
| - mbed-client-mbed-tls (7e1b6d815038) |- mbed-os (ed4febefdede) - pal (4e46c0ea8706)

Every client has its security.h file and a different MAC address.
We have done this test:

  1. turn on the border router and wait until it is ready
  2. turn on one client
  3. wait to see it is connected to the mbed connector website and try to request information via API console
  4. go to point 2

here you can read the trace of the border router:
https://pastebin.com/ti8U3HgS

From the sniffer seems that the lost clients (I mean who disappear from the mbed connector list) continue to communicate with the border router. We don't have information about how the nanostack works so we have problem to get a better idea what is going on.

To try to understand what happen we have:

  • a "sniffer" node in order to see the "frames on the air"
  • the console of the border router (attacched the log of the border, the trace is minimal we didn't turn it off completely)
  • the console of the nodes (just to know if it is registered to the mbed connector the first time)

Our goal is to have more than 10 boards connected to the mbed Device Connector.
What can we do?

@ciarmcom
Copy link
Member

ARM Internal Ref: IOTCLT-1854

@betzw
Copy link
Contributor

betzw commented Jun 30, 2017

@MarceloSalazar
Copy link
Contributor

MarceloSalazar commented Jul 2, 2017

@rspelta are you using the same mbed Device Connector (mDC) account for all devices?
If so, would you please to generate the credentials for multiple accounts? (lets say 2 or 3). I'd like to understand whether the problem is caused by a limitation on mDC or the Border Router not properly forwarding packets.
It would be useful to have a capture on the packets (tcpdump) sent between the BR and the backhaul network (tunnel).

@kjbracey
Copy link
Contributor

kjbracey commented Jul 3, 2017

I'm seeing enough "MAC TX fail" and "Source route error" messages in that log to make me think you have some sort of radio driver or hardware problem. Communication between the nodes does not seem to be as reliable as I'd expect.

"MAC TX fail" means "I tried to transmit to a neighbour, but didn't get an Ack after multiple attempts".

"Source route error A->B" means "LoWPAN router A is telling me (the border router) that they had a MAC TX error (see above) when forwarding to B".

I'd run a simple ping test from outside the mesh to the nodes to see what sort of average packet loss you're getting - both with minimum size packets and larger ones (eg 500 or 1000 bytes). The aim should be to have only a couple of percent loss at minimum size, and try to get under 10% at 1000 bytes.

@kjbracey
Copy link
Contributor

kjbracey commented Jul 3, 2017

I've just spent a little while reviewing the Spirit driver code, just to see if I can see any obvious flaws. (Hard to say much without knowing the hardware, but I can look for general issues.)

I'm a bit wary about the software ack handling - can be tricky.

There is one specific problem that could be affecting performance now - it seems to me the acks are sent with a common send() routine that enables hardware CSMA-CD. An ack should be being sent 192us after transmission completion, without CSMA. Backing off the ack will greatly reduce the chance of packets being successfully acknowledged.

Other notes on ack reception - you're calling TX_DONE whenever tx_sequence == seq_number, whether you were expecting an ack or not. This could cause stack confusion in various ways (eg if you were backing off while someone else used the same sequence number). You should only process an ACK when you actually expect one (TX completed, and AR bit was set in it)

Also, while expecting an ack, it can be beneficial to report TX_FAIL and stop expecting when you receive anything other then an ack with the expected sequence number. The stack will eventually time out if it doesn't get TX_DONE, but receipt of anything else (including a wrongly-numbered ack) indicates a lack of acknowledgment, which can be reported immediately.

betzw added a commit to betzw/stm-spirit1-rf-driver that referenced this issue Jul 3, 2017
Date: Mon, 3 Jul 2017 10:14:23 +0000
From: Kevin Bracey <notifications@github.com>

ARMmbed/mbed-os-example-client#266 (comment)

I've just spent a little while reviewing the Spirit driver code, just to see if I can see any obvious flaws. (Hard to say much without knowing the hardware, but I can look for general issues.)

I'm a bit wary about the software ack handling - can be tricky.

There is one specific problem that could be affecting performance now - it seems to me the acks are sent with a common send() routine that enables hardware CSMA-CD. An ack should be being sent 192us after transmission completion, without CSMA. Backing off the ack will greatly reduce the chance of packets being successfully acknowledged.

Other notes on ack reception - you're calling TX_DONE whenever tx_sequence == seq_number, whether you were expecting an ack or not. This could cause stack confusion in various ways (eg if you were backing off while someone else used the same sequence number). You should only process an ACK when you actually expect one (TX completed, and AR bit was set in it)

Also, while expecting an ack, it can be beneficial to report TX_FAIL and stop expecting when you receive anything other then an ack with the expected sequence number. The stack will eventually time out if it doesn't get TX_DONE, but receipt of anything else (including a wrongly-numbered ack) indicates a lack of acknowledgment, which can be reported immediately.
@betzw
Copy link
Contributor

betzw commented Jul 3, 2017

Thanks a lot @kjbracey-arm!

I have immediately tried to implement all your indications. Before pushing them, could you pls. be so kind and review these. Above all, pls. let me know if I got your points correctly, am missing something, and if you think the patch fixes the issues you have listed.

@rspelta
Copy link
Author

rspelta commented Jul 4, 2017

Hi, I used the last patch from @betzw. I done the tests with 9 sensor-node boards. One mbed Device Connector (mDC) account for all devices. Now it seems more stable, all the 9 clients are listed on the website and I didn't see any client disappear from the list.

However I still read these messages in the log:
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233

You can see the log of the border router (trace setted on "info level"):
https://pastebin.com/d01djmcZ

And the log of one client (trace setted on "info level"):
https://pastebin.com/z3yfHAgn

@MarceloSalazar next test will be with more clients connected so I will try your advices in order to have more info.
Thanks you all!

@JanneKiiskila
Copy link
Contributor

JanneKiiskila commented Jul 5, 2017

Hi all,

I would conclude that this issue is NOT an mbed-client issue, it's just the fact it can't tolerate (especially the TLS handshakes are a very, very touch area) flaky networks. @kjbracey-arm knows the gory details and somehow I feel there would be some options for making it more fault-tolerant (issue towards mbedtls perhaps - https://github.com/ARMmbed/mbedtls) to make the situation better, client can't do much about that TLS handshake thing - we just try to do it and if it fails -> we can't connect to the server.

@kjbracey
Copy link
Contributor

kjbracey commented Jul 5, 2017

You will always see some "MCPS Data fail" messages - quite a few if the network is busy.

Once a 6LoWPAN network accessing the mbed server has stabilised, I wouldn't expect to see many fails, as it shouldn't be that busy. Looking at the logs, my feeling is that performance is still a bit below par, but I'd need to see some proper stats (eg ping tests, as suggested above).

Looking at those logs, one thing occurred to me, so I just checked the driver. It doesn't actually report "CCA_FAIL", so we can't tell the difference between "channel busy" (225) and "no ack" (233). The driver should be enhanced to distinguish the cases - handling IRQ_MAX_BO_CCA_REACH. Will help diagnostics, and may help performance (Nanostack will respond differently).

I also wonder if your NumBackoffs parameter is set too low. The default for IEEE 802.15.4 is 4, and we tend to use 5. You've got it set to 3. Might be worth checking the other numbers.

The driver should tell Nanostack the number of CSMA retries it did. Maybe you can't get that number out of the driver on success, so saying 0 would be okay, but you should say "4" on a CCA_FAIL if that's your NumBackoffs setting. (ie 4 retries, 5 attempts total).

@betzw
Copy link
Contributor

betzw commented Jul 5, 2017

The number of backoffs can easily be changed here.

@betzw
Copy link
Contributor

betzw commented Jul 6, 2017

I have reflected a bit about CCA & backoff and would like to ask @kjbracey-arm how the Nanostack interfaces to persistent CCA (i.e how to implement integration of a RF supporting persistent CCA into the Nanostack)?

@kjbracey
Copy link
Contributor

kjbracey commented Jul 6, 2017

Persistent CCA is not a thing I've ever come across before reading that Spirit1 datasheet. We assume backoff as per the 802.15.4 spec - a limited number of short-duration CCA attempts at random intervals. If I'm understanding correctly, that Spirit1 mode continuously monitors and transmits as soon as it's quiet? If so seems like it would just tend to cause collisions to me, as multiple nodes could tend to go simultaneously at the end of a transmission.

Nanostack supports both drivers that don't have their own automatic backoff and those that do - if you report "CCA_FAIL" with 0 retries, then Nanostack will backoff and retry itself, counting its attempts. If you report "CCA_FAIL" with sufficient retries, then Nanostack will not retry.

(This mechanism is a bit wonky - eg Nanostack can be configured for 4 retries, but there's no way of it telling the driver how many it wants. If you did 3, Nanostack would decide that's not enough, so it will call you again, so presumably you'd do another 3 for a total of 6. Also the combination of Nanostack's random backoff plus whatever random backoff you have probably isn't ideal).

@kjbracey
Copy link
Contributor

kjbracey commented Jul 6, 2017

I guess if you did want to try that persistent CCA, you'd just enable it, and then you would presumably only report CCA_FAIL (with an artificially high "retries" to stop Nanostack trying again) if you decided to time out in the driver.

@betzw
Copy link
Contributor

betzw commented Jul 6, 2017

Well, currently persistent CCA is enabled in the Spirti1 driver.
Furthermore, it never reports a CCA_FAIL ... which might be a problem?!?

@kjbracey
Copy link
Contributor

kjbracey commented Jul 6, 2017

At first impressions persistent CCA sounds like a bad idea to me in a busy network, because the PHY has no collision detection, only collision avoidance. And the collision avoidance only works if nodes don't tend to transmit simultaneously. Persistent CCA seems like it would encourage simultaneous transmissions.

It seems like a fudge to try to get higher bandwidth and lower latency, at the cost of higher power and working less well with multiple nodes.

It's not standard 802.15.4 and I would suggest disabling it.

@rspelta
Copy link
Author

rspelta commented Jul 6, 2017

I tried the ping test. I started with only one client connected to the mesh. I have seen that even with one node sometimes there is a problem. My test is:

  1. turn on the border-router, then the client
  2. once the client is connected to the mbed connector I start to pinging the client continuely. 10 minutes with 32B lenght,10 minutes with 1000B lenght and other 10 minutes with 32B lenght.

What I have seen is that when all works correctly I have a ping 32B in 30ms and a ping 1000B reply in 330ms with no packets lost. If the timing rise up to 2-3 seconds then the client will starts to lose packets reaching 25%-35% packets lost. If I reset the border router (not the client) then it return to work well with no packet lost and timing 30ms-330ms.

here the worst-case:
ping: https://pastebin.com/9LTGNUQQ
router: https://pastebin.com/9EyXra9A
wireshark logs via dropbox

@betzw
Copy link
Contributor

betzw commented Jul 7, 2017

Ciao Roberto,
which version(s) of the Spirti1 driver did you use?
And could you pls. elaborate a bit about what you mean when saying:

the timing rise up to 2-3 seconds

@rspelta
Copy link
Author

rspelta commented Jul 7, 2017

I used the spirit driver commit with "5 max nr of backoffs" (140e6470983229aebac2de7256616c3c13f37c4b).
My PC for pinging is connected via eth cable to the same router with the border router.

When I start both (border router + client) all is ok. If I ping the client the response time is immediate.
packets of 32B -> time response: 17-30ms
packets of 1000B -> time response: 320-330ms
If I continue to ping the client, after a while (time is random) the time response become strange: Every pings have differents times responses like these:

Risposta da 2002:4f15:d31f:e472:9b99:9999:9999:9999: durata=610ms
Risposta da 2002:4f15:d31f:e472:9b99:9999:9999:9999: durata=2002ms
Risposta da 2002:4f15:d31f:e472:9b99:9999:9999:9999: durata=3005ms
Risposta da 2002:4f15:d31f:e472:9b99:9999:9999:9999: durata=699ms
Risposta da 2002:4f15:d31f:e472:9b99:9999:9999:9999: durata=17ms
Risposta da 2002:4f15:d31f:e472:9b99:9999:9999:9999: durata=1002ms
Risposta da 2002:4f15:d31f:e472:9b99:9999:9999:9999: durata=2405ms
Risposta da 2002:4f15:d31f:e472:9b99:9999:9999:9999: durata=597ms

After a while (time random) the client will start to losing packets about 25%-35%. If I reset the client this doesn't change the situation. If I reset the border router the normal condition is restored.

note: time is random but if there are more clients connected then this issue happen before.

@kjbracey
Copy link
Contributor

kjbracey commented Jul 7, 2017

Just to observe, seeing your comment about the commit "5 max nr of backoffs" - I don't think the number of backoffs parameter is used if in persistent mode - there are no backoffs, so I wouldn't expect that particular patch to change anything.

I still think the most important thing to try here is getting back to normal non-persistent 802.15.4 with backoffs, unless there is some reason to believe that doesn't work with this chipset.

But with reference to the test above - to pin it down can you do the same test pinging the border router itself? Both when it's the only device, and when there are clients attached. Also, after the client pinging starts going weird, what do pings to the border router do?

@rspelta
Copy link
Author

rspelta commented Jul 7, 2017

when it's the only router:

C:\Users\Roberto>ping -6 -t -l 32 2002:5709:48f6:e472:280:e1ff:fe23:39

Esecuzione di Ping 2002:5709:48f6:e472:280:e1ff:fe23:39 con 32 byte di dati:
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms

Statistiche Ping per 2002:5709:48f6:e472:280:e1ff:fe23:39:
Pacchetti: Trasmessi = 28, Ricevuti = 28,
Persi = 0 (0% persi),
Tempo approssimativo percorsi andata/ritorno in millisecondi:
Minimo = 0ms, Massimo = 0ms, Medio = 0ms`

C:\Users\Roberto>ping -6 -t -l 1000 2002:5709:48f6:e472:280:e1ff:fe23:39

Esecuzione di Ping 2002:5709:48f6:e472:280:e1ff:fe23:39 con 1000 byte di dati:
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms

Statistiche Ping per 2002:5709:48f6:e472:280:e1ff:fe23:39:
Pacchetti: Trasmessi = 15, Ricevuti = 15,
Persi = 0 (0% persi),
Tempo approssimativo percorsi andata/ritorno in millisecondi:
Minimo = 0ms, Massimo = 0ms, Medio = 0ms

with 1 client attached:
CLIENT
C:\Users\Roberto>ping -6 -t -l 32 2002:5709:48f6:e472:9b99:9999:9999:9999

Esecuzione di Ping 2002:5709:48f6:e472:9b99:9999:9999:9999 con 32 byte di dati:
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=21ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=20ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=19ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=17ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=19ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=17ms

Statistiche Ping per 2002:5709:48f6:e472:9b99:9999:9999:9999:
Pacchetti: Trasmessi = 17, Ricevuti = 17,
Persi = 0 (0% persi),
Tempo approssimativo percorsi andata/ritorno in millisecondi:
Minimo = 17ms, Massimo = 21ms, Medio = 18ms
Control-C

C:\Users\Roberto>ping -6 -t -l 1000 2002:5709:48f6:e472:9b99:9999:9999:9999

Esecuzione di Ping 2002:5709:48f6:e472:9b99:9999:9999:9999 con 1000 byte di dati:
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=334ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=334ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=342ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=335ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=341ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=335ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=338ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=332ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=337ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=340ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=336ms

Statistiche Ping per 2002:5709:48f6:e472:9b99:9999:9999:9999:
Pacchetti: Trasmessi = 11, Ricevuti = 11,
Persi = 0 (0% persi),
Tempo approssimativo percorsi andata/ritorno in millisecondi:
Minimo = 332ms, Massimo = 342ms, Medio = 336ms

BORDER ROUTER:
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms

When I connected the second client, ping starts going weird:
BORDER-ROUTER:
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Richiesta scaduta.
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=46ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=10ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=165ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=42ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=142ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=468ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=38ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=79ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=91ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=33ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=409ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=760ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=149ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=406ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=412ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=140ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=716ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=238ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=34ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=52ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=343ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=135ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=137ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=123ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=29ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=288ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=343ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=343ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=173ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=338ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=344ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=795ms

CLIENT:
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=668ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=544ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=528ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=856ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=1006ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=890ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=999ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=1007ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=384ms
Richiesta scaduta.
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=587ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=3038ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=1644ms

after 10 minutes
BORDER-ROUTER:
C:\Users\Roberto>ping -6 -t -l 1000 2002:5709:48f6:e472:280:e1ff:fe23:39

Esecuzione di Ping 2002:5709:48f6:e472:280:e1ff:fe23:39 con 1000 byte di dati:
Richiesta scaduta.
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=1555ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=1252ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=3423ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=987ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=1002ms
Richiesta scaduta.
Richiesta scaduta.
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=1337ms
Richiesta scaduta.
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata=3270ms

Statistiche Ping per 2002:5709:48f6:e472:280:e1ff:fe23:39:
Pacchetti: Trasmessi = 11, Ricevuti = 7,
Persi = 4 (36% persi),
Tempo approssimativo percorsi andata/ritorno in millisecondi:
Minimo = 987ms, Massimo = 3423ms, Medio = 1832ms

CLIENT
C:\Users\Roberto>ping -6 -t -l 32 2002:5709:48f6:e472:9b99:9999:9999:9999

Esecuzione di Ping 2002:5709:48f6:e472:9b99:9999:9999:9999 con 32 byte di dati:
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=3067ms
Richiesta scaduta.
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=245ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=1798ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=2345ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=2398ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=1252ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=3422ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=752ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=762ms
Richiesta scaduta.
Richiesta scaduta.
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=1346ms
Richiesta scaduta.
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=3268ms
Richiesta scaduta.
Richiesta scaduta.
Richiesta scaduta.
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=3304ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=3004ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=1252ms
Richiesta scaduta.
Richiesta scaduta.
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=3851ms
Richiesta scaduta.

Statistiche Ping per 2002:5709:48f6:e472:9b99:9999:9999:9999:
Pacchetti: Trasmessi = 25, Ricevuti = 15,
Persi = 10 (40% persi),
Tempo approssimativo percorsi andata/ritorno in millisecondi:
Minimo = 245ms, Massimo = 3851ms, Medio = 2137ms

TRACE BORDER ROUTER
[INFO][app ]: Starting NanoStack Border Router...
[INFO][app ]: Build date: Jul 6 2017 11:42:03
[INFO][app ]: Using ETH backhaul driver...
[ERR ][brro]: Backhaul interface down failed
[INFO][Eth ]: Ethernet cable connected.
[INFO][addr]: Tentative Address added to IF 2: fe80::280:e1ff:fe23:39
[INFO][addr]: DAD passed on IF 2: fe80::280:e1ff:fe23:39
[INFO][addr]: Tentative Address added to IF 2: 2002:5709:48f6:e472:280:e1ff:fe23:39
[INFO][icmp]: Route: 0:0:0:e472::/64 Lifetime: 1800 Pref: 0
[INFO][addr]: DAD passed on IF 2:
[INFO][brro]: Backhaul bootstrap ready, IPv6 = 2002:5709:48f6:e472:280:e1ff:fe23:39
[INFO][brro]: Backhaul interface addresses:
[INFO][brro]: [0] fe80::280:e1ff:fe23:39
[INFO][brro]: [1] 2002:5709:48f6:e472:280:e1ff:fe23:39
[INFO][brro]: RF channel: 1
[INFO][br ]: BR nwk base ready for start
[INFO][addr]: Address added to IF 1: fe80::ff:fe00:face
[INFO][br ]: Refresh Contexts
[INFO][br ]: Refresh Prefixs
[INFO][addr]: Address ad
[INFO][addr]: Address added to IF 1: fe80::f2f1:f2f3:f4f5:f6f7
[WARN][ip6r]: LL addr of 2002:5709:48f6:e472:9b99:9999:9999:9999 not found
[WARN][ip6r]: LL addr of not found
[WARN][ip6r]: LL addr of not found
[INFO][brro]: RF bootstrap ready, IPv6 = 2002:5709:48f6:e472:0:ff:fe00:face
[INFO][brro]: RF interface addresses:
[INFO][brro]: [0] fe80::ff:fe00:face
[INFO][brro]: [1] fe80::f2f1:f2f3:f4f5:f6f7
[INFO][brro]: [2] 2002:5709:48f6:e472:0:ff:fe00:face
[INFO][brro]: 6LoWPAN Border Router Bootstrap Complete.
[WARN][ip6r]: LL addr of not found
------------------no problem, no client, now I'm turning on the first client ----------------------
[INFO][rplc]: DAO from
[INFO][rplc]: Transmit DAO-ACK to: 2002:5709:48f6:e472:9b99:9999:9999:9999
[INFO][rplc]: DIO from fe80::9b99:9999:9999:9999, rank 100
[INFO][rplc]: DIO from , rank 100
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][rplc]: DIO from , rank 100
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][rplc]: DIO from , rank 100
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][rplc]: DIO from , rank 100
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][icmp]: Route: Lifetime: 1800 Pref: 0
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][rplc]: DIO from , rank 100
[INFO][rplc]: DAO from
----------- now I turning on the second client ----------------------------
[INFO][rplc]: Transmit DAO-ACK to: 2002:5709:48f6:e472:3133:3333:3333:3333
[INFO][rplc]: DIO from fe80::9b99:9999:9999:9999, rank 100
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233
[INFO][rplc]: DIO from fe80::3133:3333:3333:3333, rank 100
[INFO][rplc]: DIO from , rank 100
[INFO][rplc]: DIO from fe80::9b99:9999:9999:9999, rank 100
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][rplc]: DIO from fe80::9b99:9999:9999:9999, rank 102
[INFO][rplc]: DIO from fe80::3133:3333:3333:3333, rank 100
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233
[ERR ][mMCp]: MAC tx fail
[ERR ][6lAd]: MCPS Data fail by status 233
[INFO][rplc]: DIO from , rank 17b
[INFO][rplc]: DIO from fe80::3133:3333:3333:3333, rank 117
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333
[INFO][rplc]: DIO from fe80::9b99:9999:9999:9999, rank 100
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][rplc]: DIO from fe80::3133:3333:3333:3333, rank 118
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][rplc]: DIO from fe80::9b99:9999:9999:9999, rank 100
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333
[INFO][rplc]: DIO from fe80::3133:3333:3333:3333, rank 10b
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][icmp]: Route: 0:0:0:e472::/64 Lifetime: 1800 Pref: 0
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][br ]: Release Context
[INFO][br ]: Refresh Contexts
[INFO][br ]: Refresh Prefixs
[INFO][br ]: ABRO Update and NVM operation OK
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][rplc]: DIO from fe80::9b99:9999:9999:9999, rank 100
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333
[INFO][rplc]: DIO from , rank 100
----------- after 10 minutes --------------------
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333
[INFO][6Bo ]: Received MLE Advertisement from
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333
[INFO][6Bo ]: Received MLE Advertisement from fe80::9b99:9999:9999:9999
[INFO][6Bo ]: Received MLE Advertisement from fe80::3133:3333:3333:3333

Even with this problem on the mbed device connector I can see both clients connected

@rspelta
Copy link
Author

rspelta commented Jul 7, 2017

PS.
Resetting the router all returns back to normal:

BORDER-ROUTER:
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
Risposta da 2002:5709:48f6:e472:280:e1ff:fe23:39: durata<1ms
...

CLIENT
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=21ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms
Risposta da 2002:5709:48f6:e472:9b99:9999:9999:9999: durata=18ms

Both clients re-connected to the mbed connector successfully

@kjbracey
Copy link
Contributor

Weird. What if you turn off both clients after it's gone funny? Does the border router ever go back to normal?

@rspelta
Copy link
Author

rspelta commented Jul 12, 2017

If I turn off both clients the border router doesn't go back to normal

Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata<1ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=2360ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=2416ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=582ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=2613ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=3002ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=3002ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=1267ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata<1ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=726ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=33ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=1704ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=27ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=2012ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=12ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=963ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=950ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=1316ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=1665ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=3021ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=3002ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=744ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=42ms
Risposta da 2002:4f33:7063:e472:0:ff:fe00:face: durata=992ms

@kjbracey
Copy link
Contributor

Can you intrusively debug the router - stop it to see where it's spending its time? Might not work if it's sleeping rather than busy.

Alternatively, as I'm guessing whoever's taking time is probably doing it in Nanostack's event loop, you could add trace to it, to try to see which event handler is consuming time. Add trace to eventOS_scheduler_dispatch_event(), to print out event ID and type before calling the function pointer, and see if you can visually spot any big delays. (There should be a regular event 10 times a second, so any excessively long-running handlers should obviously block that).

Or maybe flag up automatically if the function pointer takes an unusually long time - measure time with eventOS_event_timer_ticks(), and report any handlers that take more than 500ms (50 ticks).

@JanneKiiskila
Copy link
Contributor

@kjbracey-arm @hasnainvirk - should/could the eventloop actually have a WARN printout, if it spots any event that uses more than an acceptable amount of time?

@SeppoTakalo
Copy link
Contributor

@JanneKiiskila That is kind of general problem that is already under discussion here. We could add it, but requires that the common event loops will be accepted into the mbed OS first.

@betzw
Copy link
Contributor

betzw commented Jul 19, 2017

Maybe I have found a quite deterministic way to reproduce increasing values for ping6.

Setup:

  • BR & Client: NUCLEO_F429ZI + X-NUCLEO-IDS01A4
  • IPv6 addr BR: 2001:470:1f13:280:0:ff:fe00:face
  • IPv6 add Client: 2001:470:1f13:280:201:203:405:607

Procedure:

  1. Run ping6 -s 1016 2001:470:1f13:280:0:ff:fe00:face -i 10 & ping6 -s 1016 2001:470:1f13:280:201:203:405:607 -i 10 in two different shells
  2. Start (plug in USB cable into PC) BR & wait for startup complete (6LoWPAN interface ready)

  1. Start (plug in USB cable into PC) Client & wait for object registered
  2. Plug off USB cable of Client from PC & wait ~30secs
  3. Goto 3.

After four/five re-starts of the client I get the following ping6 times:

  • BR: ~2800ms
  • Client: ~3500ms

See also attached screenshot:
screenshot from 2017-07-19 08-48-12

@betzw
Copy link
Contributor

betzw commented Jul 19, 2017

Another interesting observation I made this morning after a long run of the above describe setup during last night was that this morning the client was no longer listed by the mbed connector under Connected Devices, while its output trace didn't show up any suspicious output, on the contrary the output continued as usual with:

...

simulate button_click, new value of counter is 1134
[INFO][6Bo ]: Received MLE Advertisement from 
simulate button_click, new value of counter is 1135
simulate button_click, new value of counter is 1136
simulate button_click, new value of counter is 1137
[INFO][6Bo ]: Received MLE Advertisement from 
simulate button_click, new value of counter is 1138

Any ideas?

cc @nikapov-ST

@architech-boards
Copy link

@betzw Maybe I have found a quite deterministic way to reproduce increasing values for ping6.
I tryed your way and it works. I have the problem also in this way.

@kjbracey-arm Does PR#4768 totally resolve the weird border router delay issues?
No, I can compile without --profile=debug but this issue is not resolved

With Atmel is there this problem? Is it a good idea if someone could try with Atmel in order to see if this problem happens? I don't have the ATMEL AT233 15.4 shield here.

@rspelta
Copy link
Author

rspelta commented Jul 25, 2017

Hello, I sniffed the packets from the ethernet cable between the border router and the router connected to the internet (not by SIM card but via cable).
When the border router is working well and the ping is fast and stable we have this situation when the client connects to the mbed connector:
immagine
The connection is always perfect.

When the border router is weird (ping problem) the log is this one:
immagine
In the end the client is connected temporally to the mbed connector.

I have the client trace of the picture above, I found out:

  1. The client didn't received 4 times the packet "SERVER HELLO DONE" (lenght 87bytes) even if you can see them in the picture above. See the trace (it is commented in order to explain the names of the packets): here. So there are many retries because the packets are lost.
  2. Sometimes packets received are not in the right order. For example "Encrypted Handshake Message" is received before of "Change Cipher Spec" packet. But I think this problem is due to the UDP protocol.

@betzw
Copy link
Contributor

betzw commented Jul 26, 2017

Just for my understanding:

  • which IPv6 address belongs to the "6LoWPAN BR" and which to the "Internet Router"?
  • is the trace you sent the trace of the node/client or the BR?

@rspelta
Copy link
Author

rspelta commented Jul 26, 2017

2002:5236:726c:e472:9b999:9999:9999:9999 -> client
2607:f0d0:2601:52:20 -> mbed connector (server)
This is the DTLS protocol, the packets exchanged form the client to the mbed connector during the connection phase:

fig-2-dtls-key-agreement-with-fixed-diffie-hellman-over-elliptic-curves

These packets are sniffed between the internet router and the border router.
The trace is only for the client.

@betzw
Copy link
Contributor

betzw commented Jul 26, 2017

Analyzig the trace, to me it seems as if the client misses sniff entries from no. 77 up to no. 143 (both included). Do you have any information if and what other traffic might be going on in this period?

@rspelta
Copy link
Author

rspelta commented Jul 26, 2017

No, they aren't missed. Here the list of the packets from the trace and the sniffed packets.
The missed packets are "Server Hello Done"
'>' client output
'<' client input

[no.] [input/outuput] [tls message (copied from the client trace)]
63 > CLIENT HELLO START
64 < HELLO VERIFY REQUEST
72 > CLIENT HELLO
73 < SERVER HELLO
74 < CERTIFICATE
75 < SERVER KEY EXCHANGE
76 < CERTIFICATE REQUEST
77 LOST
130 < SERVER HELLO
132 < SERVER KEY EXCHANGE
143 > CLIENT HELLO
133 < CERTIFICATE
135 < CERTIFICATE REQUEST
136 LOST
146 < SERVER HELLO
147 < SERVER KEY EXCHANGE
148 < CERTIFICATE
149 LOST
150 < CERTIFICATE REQUEST
194 > CLIENT HELLO
195 < SERVER HELLO
196 < SERVER KEY EXCHANGE
197 < CERTIFICATE
198 < CERTIFICATE REQUEST
199 LOST
212 < SERVER HELLO
213 < SERVER KEY EXCHANGE
214 < CERTIFICATE
215 < SERVER HELLO DONE
216 < CERTIFICATE REQUEST
218 < SERVER HELLO
219 < SERVER KEY EXCHANGE
220 < CERTIFICATE
221 < SERVER HELLO DONE
242 > CERTIFICATE
222 < CERTIFICATE REQUEST
245 > CLIENT KEY EXCHANGE
246 > CERTIFICATE VERIFY
247 > CHANGE CHIPER SPEC
248 > ENCRYPTED HANDSHAKE
251 > CERTIFICATE
255 > CLIENT KEY EXCHANGE
256 > CERTIFICATE VERIFY
257 > CHANGE CHIPER SPEC
259 > ENCRYPTED HANDSHAKE
252 < ENCRYPTED HANDSHAKE
253 < CHANGE CHIPER SPEC
260 < ENCRYPTED HANDSHAKE
264 > APPLICATION DATA
261 < CHANGE CHIPER SPEC
265 < APPLICATION DATA

@rspelta
Copy link
Author

rspelta commented Jul 28, 2017

update:
We resoved the problem of the packets "Server Hello Done" missed. There will be a patch from ST.
There is still the problem of the ping.

@betzw
Copy link
Contributor

betzw commented Aug 7, 2017

You can find an updated version of the Ethernet Nanostack driver for NUCLEO_F429ZI here.

@rspelta could you pls. test this version of the driver on your side?

@MarceloSalazar
Copy link
Contributor

@rspelta please let us know whether the issue is fixed now.

@rspelta
Copy link
Author

rspelta commented Aug 23, 2017

Hi all, today I downloaded the last version of the "nanostack-border-router" with the last commit from the sal-nanostack-driver-stm32-eth and stm-spirit1-rf-driver repositories.
It worked only with the option "--profile=debug" (old known problem)
My test (repetuted 5 times) was:

  1. turn on border router connected via eth cable with internet router, waiting the boot is finished
  2. turn on contemporanely 20 clients
  3. after 10 minutes check on mbed device connector if all 20 clients are connected
  4. use API console via mbed device connector
  5. check ping timing of the border router
  6. turn off the clients and wait they disappear from the mbed device connector
  7. go to step 2 (without resetting the border router)

Results:

  • the ping problem is disappared, the timing is always under 1ms and the border router replies always, 0 packets lost.
  • The clients connected successfully (after 10 minutes from the power on) during the 5 tries were: 19, 20, 19, 17, 20
  • who was connected has replied quickly to the API console

Note for @betzw:
After this test I changed the source code of the sal-nanostack-driver-stm32-eth, I changed the size of the ring buffer TX and RX of the ethernet to 10. With this modify all 20 clients can connect always to the mbed device connector and with lesser time than before (about half time).

@JanneKiiskila
Copy link
Contributor

JanneKiiskila commented Aug 23, 2017

Great success! 🥇

Should that modification be passed on as a PR to sal-nanostack-driver-stm32-eth (and merged in)?

@betzw
Copy link
Contributor

betzw commented Aug 28, 2017

Is it possible that issue #298 and the necessity for high ring buffer sizes are two different faces of the same issue?

@rspelta
Copy link
Author

rspelta commented Aug 28, 2017

Reading the packets from the ethernet I don't think so. The issue #298 happen when the client sends all the right packets but the mbed server doesn't replies (in this case the border router is innocent).
However the issue #298 force the client to send more packets than the usual in order to connect to the mbed server. So with many clients, this issue doesn't help if the size of the ring buffer is too short.

About the size of the buffer, what I see from the sniffing is that after the "Server Hello Done" message from the server, the client sends 5 packets very quickly and the border router must read them very quickly in order to send them to the mbed server. So if there are a lot of clients that send 5 packets may be that the border router fails to read all of them with a buffer size too short.
This for me could exaplain why there are more fails with a short buffer rather than a bigger buffer.
Sounds good to you?

@betzw
Copy link
Contributor

betzw commented Aug 28, 2017

@rspelta not sure I understood your point above.

In the meanwhile I have made the number of TX/RX DMA buffer elements configurable (via mbed config system).

@rspelta
Copy link
Author

rspelta commented Aug 29, 2017

The issue 298 for me is a problem from the mbed server not from the border router because I don't see the reply from the mbed server.
But yes the high ring buffer size is needed also for that issue.

@betzw
Copy link
Contributor

betzw commented Aug 30, 2017

Does this mean that once the mbed server issue gets resolved you think that you might downsize the ring buffer again?

@rspelta
Copy link
Author

rspelta commented Aug 30, 2017

I think it can work better. The configurable TX/RX DMA buffer is a perfect solution for my case.
I will downsize the ring buffer only if there will be memory problem or if you suggest to do so.

@betzw
Copy link
Contributor

betzw commented Aug 30, 2017

All, do you think we can close this issue?

@rspelta
Copy link
Author

rspelta commented Aug 30, 2017

For me ok. Thanks you all!

@kjbracey
Copy link
Contributor

Any remaining problems visible on the radio side at present?

I don't know if all my long-ago comments on radio configuration were addressed - if not, remember to keep them in mind if you see more problems as node count increases.

@rspelta
Copy link
Author

rspelta commented Aug 31, 2017

Not at the moment, but surely if there would be any problems that arise, I will check them. Thank you

@SeppoTakalo
Copy link
Contributor

Closing as there is no more problems to solve.
Please raise a new issue ticket if anything new appears.

@jWladimir
Copy link

Hello, guys
I am have a problem.
I can not connect a node and a border router with a spirit1 radio.
I have a nucleo boards with stm F411, L152(for nodes) and F429(for BorderRouter)
But I can not find the working configurations for connecting any this modules.
On the NUCLEO_F429ZI with NUCLEO Spirit1 I made changes with resistors R4 to R7 and SB121 to SB122.
I do not understand what am I doing wrong?
Can you send me really working .json files for BorderRouter on the NUCLEO_F429ZI and nodes on the any NUCLEO_L152 or NUCLEO_F411 boards? (6LoWPAN or thread configures)

Tanks

@teetak01
Copy link
Contributor

@jWladimir this repository is only for the Client-application. Please check the border-router repository here: https://github.com/ARMmbed/nanostack-border-router

Also, please open a new issue if something does not work.

@SeppoTakalo
Copy link
Contributor

Border router application already has example config for F429ZI and Spirit RF module. https://github.com/ARMmbed/nanostack-border-router/blob/master/configs/6lowpan_Spirit1_RF.json

Also the example configuration for 6LoWPAN already has Spirit RF module enabled: https://github.com/ARMmbed/mbed-os-example-client/blob/master/configs/mesh_6lowpan_subg.json#L9

For debugging connectivity problems, please first use mbed-os-example-mesh-minimal application. Once the connectivity is proven to work, continue to use the Client example.
In any issues, please raise a new ticket. Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests