Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rest] /diagnostics service returns partial results, purges any diagnostics data older than 3 seconds. #1773

Open
ekobres opened this issue Mar 1, 2023 · 14 comments

Comments

@ekobres
Copy link

ekobres commented Mar 1, 2023

Describe the bug
The Topology view in the Web GUI frequently only loads partial router data, showing an incomplete graph, even in a well-functioning network.

Inspection of the underlying /diagnostics service shows that there are 2 timeouts associated with returning and cleaning diagnostics information:

// Timeout (in Microseconds) for deleting outdated diagnostics
static const uint32_t kDiagResetTimeout = 3000000;
// Timeout (in Microseconds) for collecting diagnostics
static const uint32_t kDiagCollectTimeout = 2000000;

As I read it, this means a /diagnostics API request will terminate after 2 seconds (fine), and at the callback, any diagnostics data older than 3 seconds will be purged from the list (maybe not fine.)

Perhaps the design intent was for 20 seconds and 30 seconds? No notes or docs to clarify - but 2 & 3 seconds seems very optimistic.

Even a networkdiagnostic get ff02::1 0 1 6 call takes 6 seconds to complete on my system.

The actual behavior on my network with 21 nodes results in the Topology map showing a subset of actual network nodes, with as few as zero and as many as 13.

Looking at the code, and at the behavior, it seems that there's no way any information older than 3 seconds to make it back from a call to /Diagnostics.

To Reproduce

Run /diagnostics API on any thread network that routinely takes longer than 3 seconds to complete:

networkdiagnostic get ff02::1 0 1 6

run:
curl http://<your-otbr-instance>:8081/diagnostics

Inspect the ExtAddress entities and note that there are missing items.

or:

Open the topology viewer in the Web GUI and note whether all network nodes are displayed.

  1. Git commit id: Any - source is the same since initial check-in.
  2. IEEE 802.15.4 hardware platform: SI Labs EFR32 (HA SkyConnect dongle)
  3. Build steps: ARM64
  4. Network topology: Single OTBR on Raspberry Pi 4b + 20 Nanoleaf A9 smart bulbs.

Expected behavior

/Diagnostics should enumerate available information to match the true network topology, instead it times out very quickly and returns partial results on anything larger than a trivially small network.

Console/log output

Node API information - note leader and number of routers.

curl http://x.x.x.x:8081/node

{
    "State":    2,
    "NumOfRouter":    21,
    "RlocAddress":    "fd7d:xxxx:xxxx:xxxx:0:ff:fe00:3c18",
    "ExtAddress":    "52954E48XXXXXXXX",
    "NetworkName":    "home-assistant",
    "Rloc16":    15384,
    "LeaderData":    {
        "PartitionId":    1937964234,
        "Weighting":    64,
        "DataVersion":    251,
        "StableDataVersion":    182,
        "LeaderRouterId":    18
    },
    "ExtPanId":    "A683XXXXXXXXXXXX"
}

Topology from same network:

Screen Shot 2023-03-01 at 10 37 49 AM

Additional context

This is presumably only a real problem for the OTBR Web GUI Topology UI, as there is no published documentation for the REST API.

Recommended solution - test larger default values (e.g. 30 second kDiagResetTimeout timeout and 20 second `kDiagCollectTimeout' and consider adding timeout parameters to the API so apps can determine the response time and data freshness.

Alternative solution - provide configuration parameters for the OTBR Web GUI to adjust these timeouts.

@ekobres ekobres changed the title [rest] /Diagnostics service returns partial results, purges any diagnostics data older than 3 seconds. [rest] /diagnostics service returns partial results, purges any diagnostics data older than 3 seconds. Mar 1, 2023
@wgtdkp
Copy link
Member

wgtdkp commented Mar 14, 2023

@ekobres Thanks for reporting this issue! I agree it's too short to collect diagnostic info in 3 seconds. Enlarge kDiagCollectTimeout to 20 seconds will result in a /diagnostics taking 20 seconds to respond, which is probably not acceptable in general use cases.

  1. Adding an argument to the API to specify the timeouts sounds like a reasonable solution.
  2. The best solution is probably use persistent HTTP connection to stream the diagnostic info back to the client, but it requires significantly more efforts to support in OTBR Web.

Will you be able to contribute the option 1 or 2?

@ekobres
Copy link
Author

ekobres commented Mar 14, 2023 via email

@wgtdkp
Copy link
Member

wgtdkp commented Mar 15, 2023

Correct me if I am wrong, but the timeout is only the maximum time. If diagnostics reporting is completed in less time, then the call will complete sooner

The diagnostic request is multicasted to all router devices in the mesh, so there will be multiple unicast responses and we have no machenism to determine if we have received responses from all routers since we don't know how many devices are there. So OTBR has to wait for the timeout to try to receive all those responses.

@ekobres
Copy link
Author

ekobres commented Mar 15, 2023

Maybe I am misunderstanding something then, because it seems the /diagnostics API is getting the number of routers from somewhere. The leader knows the roles of every node in the mesh, especially the routers. Nevertheless - the number of routers is right there in the diagnostics JSON:

"Connectivity":	{
			"ParentPriority":	0,
			"LinkQuality3":	1,
			"LinkQuality2":	2,
			"LinkQuality1":	1,
			"LeaderCost":	1,
			"IdSequence":	115,
			"ActiveRouters":	20,
			"SedBufferSize":	1280,
			"SedDatagramCount":	1
		}

@abtink
Copy link
Member

abtink commented Mar 15, 2023

The recently added mesh-diag APIs and CLI commands can help here:

This adds new APIs which use the underlying net-diag TMF commands to make it easier to discover topology.
openthread/openthread#8460 tracks new features (related PRs).

@ekobres
Copy link
Author

ekobres commented Mar 15, 2023

I have forked and tested some new values which provide a more reliable experience with the Topology page.
There is a fair amount of work that could be done to improve this - but with new timeout values we at least have a GUI that can capture all of the routers in a non-trivial thread mesh fairly consistently.

I settled on 120 seconds for the kDiagResetTimeout and 10 seconds for the kDiagCollectTimeout. With these values I am able to get all of my mesh with 20 routers to populate with one or two reloads. Previously I was never able to get more than 11 routers to populate.

@wgtdkp
Copy link
Member

wgtdkp commented Mar 16, 2023

@ekobres I missed the ActiveRouters! But we still need to wait for time of kDiagCollectTimeout before receiving any responses.

@wgtdkp
Copy link
Member

wgtdkp commented Mar 16, 2023

@abtink Yes the new APIs should be more useful, but it probably doesn't help this issue if a RESTful API is required (@ekobres you may want to try Abtin's API if RESTful API isn't mandatory).

@ekobres
Copy link
Author

ekobres commented Mar 16, 2023

The recently added mesh-diag APIs and CLI commands can help here:

@wgtdkp Wow. It's fast, too. Thanks for pointing this out!

@philipflesher
Copy link

@ekobres coming around on this issue because I have realized the existing web UI topology tool is giving me the same problems as originally described here, with sometimes terribly disconnected graphs.

Is it possible we can use the "new" mesh-diag path in the web UI, such that the topology view would hit a new endpoint similar to /diagnostics, which would call the new path and then construct an appropriate graph?

@jwhui
Copy link
Member

jwhui commented Sep 3, 2024

Is it possible we can use the "new" mesh-diag path in the web UI, such that the topology view would hit a new endpoint similar to /diagnostics, which would call the new path and then construct an appropriate graph?

It should be possible to leverage the newer Thread Diagnostics capabilities. Contributions are welcome! :D

@philipflesher
Copy link

I wish I could contribute on this, but sadly do not have a OTBR device to deploy code to and test against. :(

Is there any doc on getting a full dev environment set up with a minimal (hopefully inexpensive) device?

@jwhui
Copy link
Member

jwhui commented Sep 3, 2024

@philipflesher , you can try the OTBR Codelab, which builds on Raspberry Pi

https://openthread.io/codelabs/openthread-border-router

@philipflesher
Copy link

Looks straightforward. I might get on this.

Realizing if I'm going to make changes, however, that I would need to simulate at least a medium-sized network, probably with some delays and failures. Is there any dev path for creating simulated networks that include routers and end devices?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants