[rest] /diagnostics service returns partial results, purges any diagnostics data older than 3 seconds. #1773

ekobres · 2023-03-01T15:41:10Z

Describe the bug
The Topology view in the Web GUI frequently only loads partial router data, showing an incomplete graph, even in a well-functioning network.

Inspection of the underlying /diagnostics service shows that there are 2 timeouts associated with returning and cleaning diagnostics information:

ot-br-posix/src/rest/resource.cpp

Lines 78 to 82 in 172dcc1

    
           // Timeout (in Microseconds) for deleting outdated diagnostics 
        
           static const uint32_t kDiagResetTimeout = 3000000; 
        
           // Timeout (in Microseconds) for collecting diagnostics 
        
           static const uint32_t kDiagCollectTimeout = 2000000;

As I read it, this means a /diagnostics API request will terminate after 2 seconds (fine), and at the callback, any diagnostics data older than 3 seconds will be purged from the list (maybe not fine.)

Perhaps the design intent was for 20 seconds and 30 seconds? No notes or docs to clarify - but 2 & 3 seconds seems very optimistic.

Even a networkdiagnostic get ff02::1 0 1 6 call takes 6 seconds to complete on my system.

The actual behavior on my network with 21 nodes results in the Topology map showing a subset of actual network nodes, with as few as zero and as many as 13.

Looking at the code, and at the behavior, it seems that there's no way any information older than 3 seconds to make it back from a call to /Diagnostics.

To Reproduce

Run /diagnostics API on any thread network that routinely takes longer than 3 seconds to complete:

networkdiagnostic get ff02::1 0 1 6

run:
curl http://<your-otbr-instance>:8081/diagnostics

Inspect the ExtAddress entities and note that there are missing items.

or:

Open the topology viewer in the Web GUI and note whether all network nodes are displayed.

Git commit id: Any - source is the same since initial check-in.
IEEE 802.15.4 hardware platform: SI Labs EFR32 (HA SkyConnect dongle)
Build steps: ARM64
Network topology: Single OTBR on Raspberry Pi 4b + 20 Nanoleaf A9 smart bulbs.

Expected behavior

/Diagnostics should enumerate available information to match the true network topology, instead it times out very quickly and returns partial results on anything larger than a trivially small network.

Console/log output

Node API information - note leader and number of routers.

curl http://x.x.x.x:8081/node

{
    "State":    2,
    "NumOfRouter":    21,
    "RlocAddress":    "fd7d:xxxx:xxxx:xxxx:0:ff:fe00:3c18",
    "ExtAddress":    "52954E48XXXXXXXX",
    "NetworkName":    "home-assistant",
    "Rloc16":    15384,
    "LeaderData":    {
        "PartitionId":    1937964234,
        "Weighting":    64,
        "DataVersion":    251,
        "StableDataVersion":    182,
        "LeaderRouterId":    18
    },
    "ExtPanId":    "A683XXXXXXXXXXXX"
}

Topology from same network:

Additional context

This is presumably only a real problem for the OTBR Web GUI Topology UI, as there is no published documentation for the REST API.

Recommended solution - test larger default values (e.g. 30 second kDiagResetTimeout timeout and 20 second `kDiagCollectTimeout' and consider adding timeout parameters to the API so apps can determine the response time and data freshness.

Alternative solution - provide configuration parameters for the OTBR Web GUI to adjust these timeouts.

The text was updated successfully, but these errors were encountered:

wgtdkp · 2023-03-14T07:55:31Z

@ekobres Thanks for reporting this issue! I agree it's too short to collect diagnostic info in 3 seconds. Enlarge kDiagCollectTimeout to 20 seconds will result in a /diagnostics taking 20 seconds to respond, which is probably not acceptable in general use cases.

Adding an argument to the API to specify the timeouts sounds like a reasonable solution.
The best solution is probably use persistent HTTP connection to stream the diagnostic info back to the client, but it requires significantly more efforts to support in OTBR Web.

Will you be able to contribute the option 1 or 2?

ekobres · 2023-03-14T14:15:03Z

Correct me if I am wrong, but the timeout is only the maximum time. If diagnostics reporting is completed in less time, then the call will complete sooner.Obviously 20 seconds seems too long for a healthy mesh, but it would return much faster normally.Also, hitting timeout could return a timeout error with partial data so the UI can hint diagnostics are slow. I would be happy to contribute but I have never built OTBR. But let me play around with first getting an OTBR environment set up with nRF52840 dongle - I have one of those and an RPi3b+, so maybe I can figure it out.

wgtdkp · 2023-03-15T04:02:42Z

Correct me if I am wrong, but the timeout is only the maximum time. If diagnostics reporting is completed in less time, then the call will complete sooner

The diagnostic request is multicasted to all router devices in the mesh, so there will be multiple unicast responses and we have no machenism to determine if we have received responses from all routers since we don't know how many devices are there. So OTBR has to wait for the timeout to try to receive all those responses.

ekobres · 2023-03-15T19:15:59Z

Maybe I am misunderstanding something then, because it seems the /diagnostics API is getting the number of routers from somewhere. The leader knows the roles of every node in the mesh, especially the routers. Nevertheless - the number of routers is right there in the diagnostics JSON:

"Connectivity":	{
			"ParentPriority":	0,
			"LinkQuality3":	1,
			"LinkQuality2":	2,
			"LinkQuality1":	1,
			"LeaderCost":	1,
			"IdSequence":	115,
			"ActiveRouters":	20,
			"SedBufferSize":	1280,
			"SedDatagramCount":	1
		}

abtink · 2023-03-15T19:22:20Z

The recently added mesh-diag APIs and CLI commands can help here:

[mesh-diag] new module and new API to discover network topology openthread#8682.

This adds new APIs which use the underlying net-diag TMF commands to make it easier to discover topology.
openthread/openthread#8460 tracks new features (related PRs).

ekobres · 2023-03-15T21:00:35Z

I have forked and tested some new values which provide a more reliable experience with the Topology page.
There is a fair amount of work that could be done to improve this - but with new timeout values we at least have a GUI that can capture all of the routers in a non-trivial thread mesh fairly consistently.

I settled on 120 seconds for the kDiagResetTimeout and 10 seconds for the kDiagCollectTimeout. With these values I am able to get all of my mesh with 20 routers to populate with one or two reloads. Previously I was never able to get more than 11 routers to populate.

wgtdkp · 2023-03-16T03:26:08Z

@ekobres I missed the ActiveRouters! But we still need to wait for time of kDiagCollectTimeout before receiving any responses.

wgtdkp · 2023-03-16T03:33:51Z

@abtink Yes the new APIs should be more useful, but it probably doesn't help this issue if a RESTful API is required (@ekobres you may want to try Abtin's API if RESTful API isn't mandatory).

ekobres · 2023-03-16T17:28:59Z

The recently added mesh-diag APIs and CLI commands can help here:

@wgtdkp Wow. It's fast, too. Thanks for pointing this out!

philipflesher · 2024-09-03T13:52:22Z

@ekobres coming around on this issue because I have realized the existing web UI topology tool is giving me the same problems as originally described here, with sometimes terribly disconnected graphs.

Is it possible we can use the "new" mesh-diag path in the web UI, such that the topology view would hit a new endpoint similar to /diagnostics, which would call the new path and then construct an appropriate graph?

jwhui · 2024-09-03T21:04:32Z

Is it possible we can use the "new" mesh-diag path in the web UI, such that the topology view would hit a new endpoint similar to /diagnostics, which would call the new path and then construct an appropriate graph?

It should be possible to leverage the newer Thread Diagnostics capabilities. Contributions are welcome! :D

philipflesher · 2024-09-03T21:55:26Z

I wish I could contribute on this, but sadly do not have a OTBR device to deploy code to and test against. :(

Is there any doc on getting a full dev environment set up with a minimal (hopefully inexpensive) device?

jwhui · 2024-09-03T22:03:48Z

@philipflesher , you can try the OTBR Codelab, which builds on Raspberry Pi

https://openthread.io/codelabs/openthread-border-router

philipflesher · 2024-09-04T14:25:46Z

Looks straightforward. I might get on this.

Realizing if I'm going to make changes, however, that I would need to simulate at least a medium-sized network, probably with some delays and failures. Is there any dev path for creating simulated networks that include routers and end devices?

ekobres changed the title ~~[rest] /Diagnostics service returns partial results, purges any diagnostics data older than 3 seconds.~~ [rest] /diagnostics service returns partial results, purges any diagnostics data older than 3 seconds. Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rest] /diagnostics service returns partial results, purges any diagnostics data older than 3 seconds. #1773

[rest] /diagnostics service returns partial results, purges any diagnostics data older than 3 seconds. #1773

ekobres commented Mar 1, 2023 •

edited

Loading

wgtdkp commented Mar 14, 2023

ekobres commented Mar 14, 2023 via email •

edited

Loading

wgtdkp commented Mar 15, 2023

ekobres commented Mar 15, 2023

abtink commented Mar 15, 2023

ekobres commented Mar 15, 2023

wgtdkp commented Mar 16, 2023

wgtdkp commented Mar 16, 2023 •

edited

Loading

ekobres commented Mar 16, 2023 •

edited

Loading

philipflesher commented Sep 3, 2024

jwhui commented Sep 3, 2024

philipflesher commented Sep 3, 2024

jwhui commented Sep 3, 2024

philipflesher commented Sep 4, 2024

[rest] /diagnostics service returns partial results, purges any diagnostics data older than 3 seconds. #1773

[rest] /diagnostics service returns partial results, purges any diagnostics data older than 3 seconds. #1773

Comments

ekobres commented Mar 1, 2023 • edited Loading

wgtdkp commented Mar 14, 2023

ekobres commented Mar 14, 2023 via email • edited Loading

wgtdkp commented Mar 15, 2023

ekobres commented Mar 15, 2023

abtink commented Mar 15, 2023

ekobres commented Mar 15, 2023

wgtdkp commented Mar 16, 2023

wgtdkp commented Mar 16, 2023 • edited Loading

ekobres commented Mar 16, 2023 • edited Loading

philipflesher commented Sep 3, 2024

jwhui commented Sep 3, 2024

philipflesher commented Sep 3, 2024

jwhui commented Sep 3, 2024

philipflesher commented Sep 4, 2024

ekobres commented Mar 1, 2023 •

edited

Loading

ekobres commented Mar 14, 2023 via email •

edited

Loading

wgtdkp commented Mar 16, 2023 •

edited

Loading

ekobres commented Mar 16, 2023 •

edited

Loading