Query core-command device taking long time on first execution #2526

jinlinGuan · 2020-04-28T08:23:16Z

🐞 Bug Report

Affected Services

The issue is located in:

core-command

Description and Minimal Reproduction

curl http://localhost:48082/api/v1/device was stuck(#2399), and that issue is fixed. Now it can query properly, but it's taking long time(10+s) on first run, which always causes blackbox-testing failure because the response time is expected to be less than 800ms.

🔥 Exception or Error

13:07:49 ❏ device
13:07:49 ↳ http://localhost:48082/api/v1/device
13:07:50   GET http://edgex-core-command:48082/api/v1/device [200 OK, 13.47KB, 15.4s]
13:08:05   ✓  Status code is 200
13:08:05   1. Response time is less than 800
13:08:05   ✓  Has Content-Type
13:08:05   ✓  Content-Type is application/json

🌍 Your Environment

Deployment Environment:
ARM64, AMD64
EdgeX Version:
Master

Anything else relevant?

The text was updated successfully, but these errors were encountered:

AnthonyMBonafide · 2020-04-28T14:50:11Z

I think this is a known complex issue with how we are bootstrapping the system. We touched on this issue in PR #2450. The fix for this is something we would possibly want to tackle in the future since a true fix would require some rework in a couple of modules.

@brandonforster can you verify this is the same underlying issue we ran into in the past?

brandonforster · 2020-04-28T14:57:11Z

Yes, that's indeed the problem.

Accordingly, @jinlinGuan, the correct fix for blackbox tests is to adjust the test timeout relative to what we expect initialization time to be. In this case, EdgeX has its client timeout configured to be 45000 ms, or 45 seconds. 800 ms is entirely too short and I expect this test will keep failing.

AnthonyMBonafide · 2020-04-29T15:39:42Z

Yes, that's indeed the problem.

Accordingly, @jinlinGuan, the correct fix for blackbox tests is to adjust the test timeout relative to what we expect initialization time to be. In this case, EdgeX has its client timeout configured to be 45000 ms, or 45 seconds. 800 ms is entirely too short and I expect this test will keep failing.

@jinlinGuan , is this solution acceptable and if so would you like this introduced in the Blackbox tests or is this something you or the QA team will handle?

jinlinGuan · 2020-04-30T07:07:01Z

Adjusting the timeout is meaningless for the performance testing, so let the testcase keep failing until the fix is ready.

brandonforster · 2020-04-30T15:21:15Z

Adjusting the timeout is meaningless for the performance testing, so let the testcase keep failing until the fix is ready.

This is the fix. We expect the first request to a service to take significantly longer than all the rest, and this is what we are going to ship in Geneva.

AnthonyMBonafide · 2020-04-30T18:42:59Z

The true fix for this is to either:

Add functionality within go-mod-bootstrap to allow the di container to create entities both lazily and eagerly. By leveraging eager initialization, it will result in the Consul lookup, which is what is causing the issue you are running into, to happen during startup and by the time the application is ready to accept requests we would have the client initialized.
Update the functionality in go-mod-bootstrap's registry.Client to get the URL's from Consul at startup and perform updates in the background as the application is running. This will result in the first request using the already obtained URL and alleviating the issue.

As a short term fix, we can simple perform an eager initialization within the applications. This solution was originally explored and provided a quick fix to the problem. However, this is something that @michaelestrin highly advice against as it does not address the root issue. See the comments in PR #2450

michaelestrin · 2020-04-30T19:01:01Z

Some thoughts.

The DI implementation already provides for both eager and lazy initialization. In this particular use case, eager isn't possible unless you know the array of clients you'll be communicating with. #2450

Test failure appears to be on ARM. ARM is s--l---o----w. Expecting to interject a Consul lookup in the first call on ARM will likely ALWAYS exceed the 800ms threshold.

Performance testing ARM? Pffft. (I guess you probably do want to quantify how slowly it runs.)

Symptom fix -- which I don't recommend -- would be to prime the pump in tests by forcing communication with each client (which would incur the Consul round-trip overhead) prior to executing the actual tests.

Root cause fix would be to do the Consul round-trip during service bootstrap.

"Good night and good luck."

cloudxxx8 · 2020-05-02T03:37:51Z

Thanks for all the investigations and explanation. I suggest to do the following things to resolve this issue:

Make the test of this API twice. The first call is with 45 seconds timeout as @brandonforster suggested, and the second call uses the original 800 ms timeout. With this adjustment, we can make sure the behavior is as expected.
Add a statement to the document of core-command to remind the users about this situation, because most users and applications won’t expect a rest call would take so long.
Is it acceptable?

brandonforster · 2020-05-02T03:41:13Z

@cloudxxx8 Yes, with one caveat: this is not just command, this is all services. The first call to any service should be expected to take a significantly longer amount of time than subsequent calls due to the Consul round trip.

cloudxxx8 · 2020-05-02T04:08:19Z

I see. In this case, may we add this statement into the general service page?
@lenny-intel do you notice this issue and have plan to improve this part in the next release?

lenny-goodell · 2020-06-08T18:17:49Z

@cloudxxx8 , I dug into the Client handling and now see the issue with lazy initialization of the URL.

In Fuji, if I recall correctly, it used the URL created here initially:
https://github.com/edgexfoundry/edgex-go/blob/master/internal/core/command/init.go#L74
And then periodically updated it. Now this URL is never used, thus the wait for first initialization here:
https://github.com/edgexfoundry/go-mod-core-contracts/blob/master/clients/urlclient/retry/client.go#L74

The issue is that we are throwing away that initial URL (which we have configured correctly).
I propose using the URL here so the client is initialized from the start as it was in Fuji.
https://github.com/edgexfoundry/go-mod-core-contracts/blob/master/clients/urlclient/retry/client.go#L42
Obviously the URL needs to be passed to this function.

@tsconn23, @jpwhitemn, we should discuss this more in this week's Core WG.

lenny-goodell · 2020-06-08T18:30:19Z

Ok, here is what Fuji is doing.
https://github.com/edgexfoundry/go-mod-core-contracts/blob/v0.1.33/clients/metadata/device.go#L91
It is fetching the URL immediately so that it is available for use w/o waiting for it to be initialized.

In both Fuji and Geneva the initial URL is only used when not monitoring, i.e. not using the Registry.

cloudxxx8 · 2020-07-06T09:50:51Z

fixed by #2595

jinlinGuan added the bug Something isn't working label Apr 28, 2020

jinlinGuan mentioned this issue May 4, 2020

Add testcase with logner timeout to core-command edgexfoundry/blackbox-testing#446

Merged

chr1shung mentioned this issue Jun 5, 2020

[metadata] If EnableValueDescriptorManagment=true, error is returned on first device profile creation #2580

Closed

cloudxxx8 closed this as completed Jul 6, 2020

jinlinGuan mentioned this issue Jul 6, 2020

Remove core-command the first call testcase edgexfoundry/blackbox-testing#462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query core-command device taking long time on first execution #2526

Query core-command device taking long time on first execution #2526

jinlinGuan commented Apr 28, 2020

AnthonyMBonafide commented Apr 28, 2020

brandonforster commented Apr 28, 2020 •

edited

Loading

AnthonyMBonafide commented Apr 29, 2020

jinlinGuan commented Apr 30, 2020

brandonforster commented Apr 30, 2020

AnthonyMBonafide commented Apr 30, 2020

michaelestrin commented Apr 30, 2020

cloudxxx8 commented May 2, 2020

brandonforster commented May 2, 2020

cloudxxx8 commented May 2, 2020

lenny-goodell commented Jun 8, 2020

lenny-goodell commented Jun 8, 2020

cloudxxx8 commented Jul 6, 2020

Query core-command device taking long time on first execution #2526

Query core-command device taking long time on first execution #2526

Comments

jinlinGuan commented Apr 28, 2020

🐞 Bug Report

Affected Services

Description and Minimal Reproduction

🔥 Exception or Error

🌍 Your Environment

AnthonyMBonafide commented Apr 28, 2020

brandonforster commented Apr 28, 2020 • edited Loading

AnthonyMBonafide commented Apr 29, 2020

jinlinGuan commented Apr 30, 2020

brandonforster commented Apr 30, 2020

AnthonyMBonafide commented Apr 30, 2020

michaelestrin commented Apr 30, 2020

cloudxxx8 commented May 2, 2020

brandonforster commented May 2, 2020

cloudxxx8 commented May 2, 2020

lenny-goodell commented Jun 8, 2020

lenny-goodell commented Jun 8, 2020

cloudxxx8 commented Jul 6, 2020

brandonforster commented Apr 28, 2020 •

edited

Loading