-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rest] /diagnostics service returns partial results, purges any diagnostics data older than 3 seconds. #1773
Comments
@ekobres Thanks for reporting this issue! I agree it's too short to collect diagnostic info in 3 seconds. Enlarge
Will you be able to contribute the option 1 or 2? |
Correct me if I am wrong, but the timeout is only the maximum time. If diagnostics reporting is completed in less time, then the call will complete sooner.Obviously 20 seconds seems too long for a healthy mesh, but it would return much faster normally.Also, hitting timeout could return a timeout error with partial data so the UI can hint diagnostics are slow.
I would be happy to contribute but I have never built OTBR. But let me play around with first getting an OTBR environment set up with nRF52840 dongle - I have one of those and an RPi3b+, so maybe I can figure it out.
|
The diagnostic request is multicasted to all router devices in the mesh, so there will be multiple unicast responses and we have no machenism to determine if we have received responses from all routers since we don't know how many devices are there. So OTBR has to wait for the timeout to try to receive all those responses. |
Maybe I am misunderstanding something then, because it seems the /diagnostics API is getting the number of routers from somewhere. The leader knows the roles of every node in the mesh, especially the routers. Nevertheless - the number of routers is right there in the diagnostics JSON:
|
The recently added This adds new APIs which use the underlying net-diag TMF commands to make it easier to discover topology. |
I have forked and tested some new values which provide a more reliable experience with the Topology page. I settled on 120 seconds for the kDiagResetTimeout and 10 seconds for the kDiagCollectTimeout. With these values I am able to get all of my mesh with 20 routers to populate with one or two reloads. Previously I was never able to get more than 11 routers to populate. |
@ekobres I missed the |
@wgtdkp Wow. It's fast, too. Thanks for pointing this out! |
@ekobres coming around on this issue because I have realized the existing web UI topology tool is giving me the same problems as originally described here, with sometimes terribly disconnected graphs. Is it possible we can use the "new" mesh-diag path in the web UI, such that the topology view would hit a new endpoint similar to |
It should be possible to leverage the newer Thread Diagnostics capabilities. Contributions are welcome! :D |
I wish I could contribute on this, but sadly do not have a OTBR device to deploy code to and test against. :( Is there any doc on getting a full dev environment set up with a minimal (hopefully inexpensive) device? |
@philipflesher , you can try the OTBR Codelab, which builds on Raspberry Pi |
Looks straightforward. I might get on this. Realizing if I'm going to make changes, however, that I would need to simulate at least a medium-sized network, probably with some delays and failures. Is there any dev path for creating simulated networks that include routers and end devices? |
Describe the bug
The Topology view in the Web GUI frequently only loads partial router data, showing an incomplete graph, even in a well-functioning network.
Inspection of the underlying
/diagnostics
service shows that there are 2 timeouts associated with returning and cleaning diagnostics information:ot-br-posix/src/rest/resource.cpp
Lines 78 to 82 in 172dcc1
As I read it, this means a
/diagnostics
API request will terminate after 2 seconds (fine), and at the callback, any diagnostics data older than 3 seconds will be purged from the list (maybe not fine.)Perhaps the design intent was for 20 seconds and 30 seconds? No notes or docs to clarify - but 2 & 3 seconds seems very optimistic.
Even a
networkdiagnostic get ff02::1 0 1 6
call takes 6 seconds to complete on my system.The actual behavior on my network with 21 nodes results in the Topology map showing a subset of actual network nodes, with as few as zero and as many as 13.
Looking at the code, and at the behavior, it seems that there's no way any information older than 3 seconds to make it back from a call to /Diagnostics.
To Reproduce
Run /diagnostics API on any thread network that routinely takes longer than 3 seconds to complete:
networkdiagnostic get ff02::1 0 1 6
run:
curl http://<your-otbr-instance>:8081/diagnostics
Inspect the ExtAddress entities and note that there are missing items.
or:
Open the topology viewer in the Web GUI and note whether all network nodes are displayed.
Expected behavior
/Diagnostics should enumerate available information to match the true network topology, instead it times out very quickly and returns partial results on anything larger than a trivially small network.
Console/log output
Node API information - note leader and number of routers.
Topology from same network:
Additional context
This is presumably only a real problem for the OTBR Web GUI Topology UI, as there is no published documentation for the REST API.
Recommended solution - test larger default values (e.g. 30 second
kDiagResetTimeout
timeout and 20 second `kDiagCollectTimeout' and consider adding timeout parameters to the API so apps can determine the response time and data freshness.Alternative solution - provide configuration parameters for the OTBR Web GUI to adjust these timeouts.
The text was updated successfully, but these errors were encountered: