Race condition when initializing external clients using bootstrap #2450

brandonforster · 2020-03-24T21:46:48Z

PR Checklist

Please check if your PR fulfills the following requirements:

The commit message follows our guidelines: https://wiki.edgexfoundry.org/display/FA/Contributor%27s+Guide
Tests for the changes have been added (for bug fixes / features)
Docs have been added / updated (for bug fixes / features)

PR Type

What kind of change does this PR introduce?

What is the current behavior?

A race occurs during bootstrap external client initialization between the bootstrap handler and the bootstrap registry component. If the external client is initialized JIT, the registry variable will no longer contain valid data. On subsequent calls, the registry variable is updated and execution proceeds as normal.

Issue Number: Fixes #2421

What is the new behavior?

The external client is initialized before the bootstrap update code and the race does not occur.

Does this PR introduce a breaking change?

Yes
No

Are there any specific instructions or things that should be known prior to reviewing?

To verify:

Repro the bug on master:
Run (in order) consul, redis, config-seed, core-data, core-metadata, core-command. Ensure the core services are configured for consul usage and not local configuration. Make the following curl command: curl http://localhost:48082/api/v1/device .

Checkout this PR code, follow the steps above, and ensure that device data is returned.

Other information

The system management agent contains code that is very similar to the problematic code identified in this PR in command and metadata. I do not have the familiarity with the SMA to say if it is going to be an issue there.

michaelestrin

Strongly recommend not to approve this PR. It addresses symptoms, not the root cause. Pushing code around to avoid a race condition indicates a lack of understanding of the actual root cause and does nothing to prevent the issue from cropping up elsewhere in the code.

You will likely have similar issues in the system management agent. Unfortunately, this "solution" isn't going to work there as the agent must do just-in-time initialization of clients.

I can't help but feel @brandonforster and I have discussed this exact issue in the past (perhaps in a slightly different context). Find and address the root cause. Don't move code around to avoid a race to be expeditious.

@tsconn23 @AnthonyMBonafide

brandonforster · 2020-03-25T00:47:02Z

@michaelestrin We absolutely have, and it's a fair point, but then should we change core-data (which is set up like this already) to do JIT right now? I believe we should harmonize all the services to do JIT, but I think committing code that we know will fail black-box tests is irresponsible. To that end, I think the best course of action here is to merge this PR, acknowledge there is a race (but one we will always win) and work on rebuilding the registry client to work the same way the URL client does to address the underlying race. I do not think that leaving black box tests broken for any period of time (like we did for the URL client) should be the course of action here. I think the master branch should always be passing blackbox tests, even if it requires a bandaid that we need to replace with something permanent.

michaelestrin · 2020-03-25T00:52:46Z

You're pushing code around and don't understand (or can't adequately describe) the root cause. If you don't understand the root cause, you can't be sure you'll always win the race.

This appears to be a result of a shortcoming with recent registry client changes. Maybe find the root cause and fix it?

I stand by my prior statement.

michaelestrin · 2020-03-25T00:57:27Z

FWIW, I'm not attacking you @brandonforster nor am I advocating for irresponsibility and broken black box tests. I'm not saying you caused the problem or are somehow responsible for the current state of the code. I'm simply saying the underlying root cause should be identified and addressed to prevent future similar issues from cropping up. There are places in the code -- specifically in the agent -- where JIT can't be avoided.

brandonforster · 2020-03-25T01:10:34Z

@michaelestrin I know you're not making a personal attack on me and I know you're not attacking my credentials or anything like that. You were correct about the best way to attack this problem from a root cause perspective on the URL client, and the code is now much better for it. All those points you made for the URL client apply 1:1 here and I have no doubt that undertaking that work you are pushing for would have the same dramatic improvement.

However, that work takes time I am just very concerned that we have not been passing black box tests on master since the bug that started the URL client work. After we added the URL client, there was a bug in the URL client code, and now there is this bug in registry that looks like a root cause solution could potentially be just as time consuming. I am not advocating for not doing the root cause work, au contraire, I would love to spearhead it. I just want the system to pass tests and not have these weird hiccups, and I don't think a //TODO redo this in service of passing our API tests is a bad thing.

AnthonyMBonafide · 2020-03-25T18:28:08Z

I briefly looked into this and I think we should focus our efforts on finding the root cause and fixing it. To address @brandonforster concern regarding the time it would take to implement a long term fix, how about we start looking into correcting the issue in this PR and add new commits which address the long term issue. That way if we get to a point and we want the quick fix we can simply just take the original commit(s), and if we can come up with a long term fix within a reasonable time frame we can squash all the commits. This would require some additional organization on your part as the author. If you decide to take that path, you can squash your existing 2 commits into one and update the commit message to describe that it is a quick fix. Then for the following commits you can add to that and we can work from there once we need to make a decision as what to do. I think we can get this done within a reasonable time frame. Race conditions and concurrency problems are horrible to find and fix, if we have a handle on it now it would be nice to remove one more complex issue from the system. I hope that all makes sense.

AnthonyMBonafide · 2020-03-25T18:41:39Z

Also, if you need any assistance with tracking down the issue and/or coming up with a solution please feel free to reach out. I am confident we can get this worked out.

brandonforster · 2020-03-25T18:57:16Z

@AnthonyMBonafide Sounds like a good approach. If the scope starts to look like the URL client refactor, we can reassess, but you're right to not worry about complexity we don't necessarily need yet.

codecov-io · 2020-03-25T19:31:45Z

Codecov Report

Merging #2450 into master will not change coverage by %.
The diff coverage is 0.00%.

@@           Coverage Diff           @@
##           master    #2450   +/-   ##
=======================================
  Coverage   37.73%   37.73%           
=======================================
  Files         156      156           
  Lines       12826    12826           
=======================================
  Hits         4840     4840           
  Misses       7728     7728           
  Partials      258      258

Impacted Files	Coverage Δ
internal/core/command/init.go	`0.00% <ø> (ø)`
internal/core/command/rest_device.go	`0.00% <0.00%> (ø)`
internal/core/metadata/init.go	`0.00% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update acc4df9...e9df57a. Read the comment docs.

AnthonyMBonafide

Since we have a path forward, when you are ready for the next review please re-request me. Thank you

brandonforster · 2020-03-27T15:44:15Z

I spoke with @AnthonyMBonafide, and it seems like the lazy initialization is causing a huge spin up penalty. The registry client has to be initialized, the URL client has to be initialized, we have to wait for the first URL poll, and we have to fetch that URL update on a successful poll and report it back down, all while the first-time request is still in flight.

I've found that the underlying problem here isn't with the registry client, the bootstrap, the URL client or any of our code, it's that all that code and work that has to be done takes too long to accomplish while that first-time request is in flight, and the HTTPResponseWriter times out while it's waiting for our code to finish. Observing the current function in the debugger, everything returns successfully and with valid device data, as expected, but the endpoint function falls over on the call to json.NewEncoder(w).Encode(devices) and returns the following error on my machine:

write tcp 127.0.0.1:48082->127.0.0.1:52314: i/o timeout

This causes the empty response from server, which goes away after the first call because all the clients have already been initialized. You can get a similiar effect with local configuration (which doesn't use any of the URL client retry code) by inserting a time.Sleep(time.Second*30) call into the endpoint function, and observing that the Encode still times out.

I believe that the best course of action is to remove lazy initialization of all clients, and take the time hit when it's least noticeable, during startup. This way, when a REST call is actually made, the HTTP stack is going to be the limiting factor in timing and not any of our business code, as it is today.

@michaelestrin, I'd really like your thoughts here since I know we were both pushing for lazy initialization.

brandonforster · 2020-03-27T15:51:40Z

Note that as part of the current changeset, I have renamed restDevice.go to rest_device.go to bring it in line with the other core service's naming of similar files.

michaelestrin · 2020-03-27T21:20:56Z

I believe that the best course of action is to remove lazy initialization of all clients, and take the time hit when it's least noticeable, during startup.

As I said before, this may work with core and support services but will not work with the agent. It's also limiting in that it prevents dynamic service discovery which will be a future issue when high-availability support is implemented (which seems to be gathering some steam).

@akramtexas ran into a similar issue recently with a new blackbox test he wrote taking longer than the socket timeout. This was ultimately solved by a default configuration change to lengthen the timeout. You might look into that as the long-term solution. Not sure that a 5 second timeout lengthened to 6..9 seconds would be a big deal.

I also wonder if this is a universal problem or is being contributed to by the work being done by the use case you're invoking that invokes a client call.

Anyway, that's my 2 cents @brandonforster.

AnthonyMBonafide · 2020-03-30T14:53:15Z

@michaelestrin, I think I may be missing some important piece here. With @brandonforster proposed solution, I am not seeing how this is will cause and issue for dynamic service discovery. As I see it he is suggesting that we still use the dependency injection container to pass around the pointer to the underlying client, which can still be updated in the background when a URL update happens within Consul. However, we would kickoff the lookup for the URL during startup of the application so we are not doing it at the last possible moment, which is responding to the first request which indirectly needs the client. To me that seems like something we would want to do anyway since getting the client(s) ready for use is something that should be taken care of before the service has started. I hope my understanding is correct.

As for system management agent, is there a reason we are not using the same initialization and update strategy as the other core-services?

brandonforster · 2020-03-30T17:05:14Z

Updated with changes @michaelestrin and @AnthonyMBonafide and I discussed on a call.

AnthonyMBonafide

LGTM!

Based on the conversation between @brandonforster and @michaelestrin, I think Michael's suggestion to apply the fix in this fashion is the best option.

… in the comments. Renamed restDevice to rest_device to bring it in line with the rest of the services. Signed-off-by: Brandon Forster <me@brandonforster.com>

sonarqubecloud · 2020-03-30T18:34:37Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities (and 0 Security Hotspots to review)
2 Code Smells

No Coverage information
3.2% Duplication

brandonforster · 2020-03-30T18:35:29Z

Sorry about all the pushes. I updated the timeout value to 45,000 ms, because I realized that the URLClient polling interval is currently defined as 15,000 ms, and we should allow for a few polling updates before timeout.

AnthonyMBonafide

LGTM!

brandonforster requested review from AnthonyMBonafide and jinlinGuan March 24, 2020 21:46

brandonforster self-assigned this Mar 24, 2020

brandonforster requested a review from jdharms March 24, 2020 21:46

michaelestrin reviewed Mar 25, 2020

View reviewed changes

AnthonyMBonafide reviewed Mar 26, 2020

View reviewed changes

brandonforster requested a review from AnthonyMBonafide March 27, 2020 15:50

brandonforster requested a review from michaelestrin March 30, 2020 17:03

AnthonyMBonafide previously approved these changes Mar 30, 2020

View reviewed changes

brandonforster dismissed AnthonyMBonafide’s stale review via d59a14e March 30, 2020 18:32

#2421: Significantly increased command's client timeout and wrote why…

3c73fc5

… in the comments. Renamed restDevice to rest_device to bring it in line with the rest of the services. Signed-off-by: Brandon Forster <me@brandonforster.com>

AnthonyMBonafide approved these changes Apr 1, 2020

View reviewed changes

brandonforster merged commit 3377038 into edgexfoundry:master Apr 1, 2020

brandonforster deleted the issue_2421 branch April 1, 2020 15:19

AnthonyMBonafide mentioned this pull request Apr 28, 2020

Query core-command device taking long time on first execution #2526

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition when initializing external clients using bootstrap #2450

Race condition when initializing external clients using bootstrap #2450

brandonforster commented Mar 24, 2020 •

edited

Loading

michaelestrin left a comment

brandonforster commented Mar 25, 2020

michaelestrin commented Mar 25, 2020

michaelestrin commented Mar 25, 2020

brandonforster commented Mar 25, 2020

AnthonyMBonafide commented Mar 25, 2020 •

edited

Loading

AnthonyMBonafide commented Mar 25, 2020

brandonforster commented Mar 25, 2020

codecov-io commented Mar 25, 2020 •

edited

Loading

AnthonyMBonafide left a comment

brandonforster commented Mar 27, 2020 •

edited

Loading

brandonforster commented Mar 27, 2020

michaelestrin commented Mar 27, 2020

AnthonyMBonafide commented Mar 30, 2020

brandonforster commented Mar 30, 2020 •

edited

Loading

AnthonyMBonafide left a comment

sonarqubecloud bot commented Mar 30, 2020

brandonforster commented Mar 30, 2020

AnthonyMBonafide left a comment

Race condition when initializing external clients using bootstrap #2450

Race condition when initializing external clients using bootstrap #2450

Conversation

brandonforster commented Mar 24, 2020 • edited Loading

PR Checklist

PR Type

What is the current behavior?

What is the new behavior?

Does this PR introduce a breaking change?

Are there any specific instructions or things that should be known prior to reviewing?

Other information

michaelestrin left a comment

Choose a reason for hiding this comment

brandonforster commented Mar 25, 2020

michaelestrin commented Mar 25, 2020

michaelestrin commented Mar 25, 2020

brandonforster commented Mar 25, 2020

AnthonyMBonafide commented Mar 25, 2020 • edited Loading

AnthonyMBonafide commented Mar 25, 2020

brandonforster commented Mar 25, 2020

codecov-io commented Mar 25, 2020 • edited Loading

Codecov Report

AnthonyMBonafide left a comment

Choose a reason for hiding this comment

brandonforster commented Mar 27, 2020 • edited Loading

brandonforster commented Mar 27, 2020

michaelestrin commented Mar 27, 2020

AnthonyMBonafide commented Mar 30, 2020

brandonforster commented Mar 30, 2020 • edited Loading

AnthonyMBonafide left a comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Mar 30, 2020

brandonforster commented Mar 30, 2020

AnthonyMBonafide left a comment

Choose a reason for hiding this comment

brandonforster commented Mar 24, 2020 •

edited

Loading

AnthonyMBonafide commented Mar 25, 2020 •

edited

Loading

codecov-io commented Mar 25, 2020 •

edited

Loading

brandonforster commented Mar 27, 2020 •

edited

Loading

brandonforster commented Mar 30, 2020 •

edited

Loading