Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query core-command device taking long time on first execution #2526

Closed
jinlinGuan opened this issue Apr 28, 2020 · 13 comments
Closed

Query core-command device taking long time on first execution #2526

jinlinGuan opened this issue Apr 28, 2020 · 13 comments
Labels
bug Something isn't working

Comments

@jinlinGuan
Copy link
Contributor

🐞 Bug Report

Affected Services

The issue is located in:

core-command

Description and Minimal Reproduction

curl http://localhost:48082/api/v1/device was stuck(#2399), and that issue is fixed. Now it can query properly, but it's taking long time(10+s) on first run, which always causes blackbox-testing failure because the response time is expected to be less than 800ms.

🔥 Exception or Error

13:07:49 ❏ device
13:07:49 ↳ http://localhost:48082/api/v1/device
13:07:50   GET http://edgex-core-command:48082/api/v1/device [200 OK, 13.47KB, 15.4s]
13:08:05   ✓  Status code is 200
13:08:05   1. Response time is less than 800
13:08:05   ✓  Has Content-Type
13:08:05   ✓  Content-Type is application/json

🌍 Your Environment

Deployment Environment:
ARM64, AMD64
EdgeX Version:
Master

Anything else relevant?

@jinlinGuan jinlinGuan added the bug Something isn't working label Apr 28, 2020
@AnthonyMBonafide
Copy link
Contributor

I think this is a known complex issue with how we are bootstrapping the system. We touched on this issue in PR #2450. The fix for this is something we would possibly want to tackle in the future since a true fix would require some rework in a couple of modules.

@brandonforster can you verify this is the same underlying issue we ran into in the past?

@brandonforster
Copy link
Contributor

brandonforster commented Apr 28, 2020

Yes, that's indeed the problem.

Accordingly, @jinlinGuan, the correct fix for blackbox tests is to adjust the test timeout relative to what we expect initialization time to be. In this case, EdgeX has its client timeout configured to be 45000 ms, or 45 seconds. 800 ms is entirely too short and I expect this test will keep failing.

@AnthonyMBonafide
Copy link
Contributor

Yes, that's indeed the problem.

Accordingly, @jinlinGuan, the correct fix for blackbox tests is to adjust the test timeout relative to what we expect initialization time to be. In this case, EdgeX has its client timeout configured to be 45000 ms, or 45 seconds. 800 ms is entirely too short and I expect this test will keep failing.

@jinlinGuan , is this solution acceptable and if so would you like this introduced in the Blackbox tests or is this something you or the QA team will handle?

@jinlinGuan
Copy link
Contributor Author

Adjusting the timeout is meaningless for the performance testing, so let the testcase keep failing until the fix is ready.

@brandonforster
Copy link
Contributor

Adjusting the timeout is meaningless for the performance testing, so let the testcase keep failing until the fix is ready.

This is the fix. We expect the first request to a service to take significantly longer than all the rest, and this is what we are going to ship in Geneva.

@AnthonyMBonafide
Copy link
Contributor

The true fix for this is to either:

  1. Add functionality within go-mod-bootstrap to allow the di container to create entities both lazily and eagerly. By leveraging eager initialization, it will result in the Consul lookup, which is what is causing the issue you are running into, to happen during startup and by the time the application is ready to accept requests we would have the client initialized.

  2. Update the functionality in go-mod-bootstrap's registry.Client to get the URL's from Consul at startup and perform updates in the background as the application is running. This will result in the first request using the already obtained URL and alleviating the issue.

As a short term fix, we can simple perform an eager initialization within the applications. This solution was originally explored and provided a quick fix to the problem. However, this is something that @michaelestrin highly advice against as it does not address the root issue. See the comments in PR #2450

@michaelestrin
Copy link
Member

Some thoughts.

The DI implementation already provides for both eager and lazy initialization. In this particular use case, eager isn't possible unless you know the array of clients you'll be communicating with. #2450

Test failure appears to be on ARM. ARM is s--l---o----w. Expecting to interject a Consul lookup in the first call on ARM will likely ALWAYS exceed the 800ms threshold.

Performance testing ARM? Pffft. (I guess you probably do want to quantify how slowly it runs.)

Symptom fix -- which I don't recommend -- would be to prime the pump in tests by forcing communication with each client (which would incur the Consul round-trip overhead) prior to executing the actual tests.

Root cause fix would be to do the Consul round-trip during service bootstrap.

"Good night and good luck."

@cloudxxx8
Copy link
Member

Thanks for all the investigations and explanation. I suggest to do the following things to resolve this issue:

  1. Make the test of this API twice. The first call is with 45 seconds timeout as @brandonforster suggested, and the second call uses the original 800 ms timeout. With this adjustment, we can make sure the behavior is as expected.
  2. Add a statement to the document of core-command to remind the users about this situation, because most users and applications won’t expect a rest call would take so long.
    Is it acceptable?

@brandonforster
Copy link
Contributor

@cloudxxx8 Yes, with one caveat: this is not just command, this is all services. The first call to any service should be expected to take a significantly longer amount of time than subsequent calls due to the Consul round trip.

@cloudxxx8
Copy link
Member

I see. In this case, may we add this statement into the general service page?
@lenny-intel do you notice this issue and have plan to improve this part in the next release?

@lenny-goodell
Copy link
Member

@cloudxxx8 , I dug into the Client handling and now see the issue with lazy initialization of the URL.

In Fuji, if I recall correctly, it used the URL created here initially:
https://github.com/edgexfoundry/edgex-go/blob/master/internal/core/command/init.go#L74
And then periodically updated it. Now this URL is never used, thus the wait for first initialization here:
https://github.com/edgexfoundry/go-mod-core-contracts/blob/master/clients/urlclient/retry/client.go#L74

The issue is that we are throwing away that initial URL (which we have configured correctly).
I propose using the URL here so the client is initialized from the start as it was in Fuji.
https://github.com/edgexfoundry/go-mod-core-contracts/blob/master/clients/urlclient/retry/client.go#L42
Obviously the URL needs to be passed to this function.

@tsconn23, @jpwhitemn, we should discuss this more in this week's Core WG.

@lenny-goodell
Copy link
Member

Ok, here is what Fuji is doing.
https://github.com/edgexfoundry/go-mod-core-contracts/blob/v0.1.33/clients/metadata/device.go#L91
It is fetching the URL immediately so that it is available for use w/o waiting for it to be initialized.

In both Fuji and Geneva the initial URL is only used when not monitoring, i.e. not using the Registry.

@cloudxxx8
Copy link
Member

fixed by #2595

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants