Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when initializing external clients using bootstrap #2450

Merged
merged 1 commit into from
Apr 1, 2020
Merged

Race condition when initializing external clients using bootstrap #2450

merged 1 commit into from
Apr 1, 2020

Conversation

brandonforster
Copy link
Contributor

@brandonforster brandonforster commented Mar 24, 2020

PR Checklist

Please check if your PR fulfills the following requirements:

PR Type

What kind of change does this PR introduce?

  • Bugfix
  • Feature
  • Code style update (formatting, local variables)
  • Refactoring (no functional changes, no api changes)
  • Build related changes
  • CI related changes
  • Documentation content changes
  • Other... Please describe:

What is the current behavior?

A race occurs during bootstrap external client initialization between the bootstrap handler and the bootstrap registry component. If the external client is initialized JIT, the registry variable will no longer contain valid data. On subsequent calls, the registry variable is updated and execution proceeds as normal.

Issue Number: Fixes #2421

What is the new behavior?

The external client is initialized before the bootstrap update code and the race does not occur.

Does this PR introduce a breaking change?

  • Yes
  • No

Are there any specific instructions or things that should be known prior to reviewing?

To verify:

Repro the bug on master:
Run (in order) consul, redis, config-seed, core-data, core-metadata, core-command. Ensure the core services are configured for consul usage and not local configuration. Make the following curl command: curl http://localhost:48082/api/v1/device .

Checkout this PR code, follow the steps above, and ensure that device data is returned.

Other information

The system management agent contains code that is very similar to the problematic code identified in this PR in command and metadata. I do not have the familiarity with the SMA to say if it is going to be an issue there.

Copy link
Member

@michaelestrin michaelestrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly recommend not to approve this PR. It addresses symptoms, not the root cause. Pushing code around to avoid a race condition indicates a lack of understanding of the actual root cause and does nothing to prevent the issue from cropping up elsewhere in the code.

You will likely have similar issues in the system management agent. Unfortunately, this "solution" isn't going to work there as the agent must do just-in-time initialization of clients.

I can't help but feel @brandonforster and I have discussed this exact issue in the past (perhaps in a slightly different context). Find and address the root cause. Don't move code around to avoid a race to be expeditious.

@tsconn23 @AnthonyMBonafide

@brandonforster
Copy link
Contributor Author

@michaelestrin We absolutely have, and it's a fair point, but then should we change core-data (which is set up like this already) to do JIT right now? I believe we should harmonize all the services to do JIT, but I think committing code that we know will fail black-box tests is irresponsible. To that end, I think the best course of action here is to merge this PR, acknowledge there is a race (but one we will always win) and work on rebuilding the registry client to work the same way the URL client does to address the underlying race. I do not think that leaving black box tests broken for any period of time (like we did for the URL client) should be the course of action here. I think the master branch should always be passing blackbox tests, even if it requires a bandaid that we need to replace with something permanent.

@michaelestrin
Copy link
Member

You're pushing code around and don't understand (or can't adequately describe) the root cause. If you don't understand the root cause, you can't be sure you'll always win the race.

This appears to be a result of a shortcoming with recent registry client changes. Maybe find the root cause and fix it?

I stand by my prior statement.

@michaelestrin
Copy link
Member

FWIW, I'm not attacking you @brandonforster nor am I advocating for irresponsibility and broken black box tests. I'm not saying you caused the problem or are somehow responsible for the current state of the code. I'm simply saying the underlying root cause should be identified and addressed to prevent future similar issues from cropping up. There are places in the code -- specifically in the agent -- where JIT can't be avoided.

@brandonforster
Copy link
Contributor Author

@michaelestrin I know you're not making a personal attack on me and I know you're not attacking my credentials or anything like that. You were correct about the best way to attack this problem from a root cause perspective on the URL client, and the code is now much better for it. All those points you made for the URL client apply 1:1 here and I have no doubt that undertaking that work you are pushing for would have the same dramatic improvement.

However, that work takes time I am just very concerned that we have not been passing black box tests on master since the bug that started the URL client work. After we added the URL client, there was a bug in the URL client code, and now there is this bug in registry that looks like a root cause solution could potentially be just as time consuming. I am not advocating for not doing the root cause work, au contraire, I would love to spearhead it. I just want the system to pass tests and not have these weird hiccups, and I don't think a //TODO redo this in service of passing our API tests is a bad thing.

@AnthonyMBonafide
Copy link
Contributor

AnthonyMBonafide commented Mar 25, 2020

I briefly looked into this and I think we should focus our efforts on finding the root cause and fixing it. To address @brandonforster concern regarding the time it would take to implement a long term fix, how about we start looking into correcting the issue in this PR and add new commits which address the long term issue. That way if we get to a point and we want the quick fix we can simply just take the original commit(s), and if we can come up with a long term fix within a reasonable time frame we can squash all the commits. This would require some additional organization on your part as the author. If you decide to take that path, you can squash your existing 2 commits into one and update the commit message to describe that it is a quick fix. Then for the following commits you can add to that and we can work from there once we need to make a decision as what to do. I think we can get this done within a reasonable time frame. Race conditions and concurrency problems are horrible to find and fix, if we have a handle on it now it would be nice to remove one more complex issue from the system. I hope that all makes sense.

@AnthonyMBonafide
Copy link
Contributor

Also, if you need any assistance with tracking down the issue and/or coming up with a solution please feel free to reach out. I am confident we can get this worked out.

@brandonforster
Copy link
Contributor Author

@AnthonyMBonafide Sounds like a good approach. If the scope starts to look like the URL client refactor, we can reassess, but you're right to not worry about complexity we don't necessarily need yet.

@codecov-io
Copy link

codecov-io commented Mar 25, 2020

Codecov Report

Merging #2450 into master will not change coverage by %.
The diff coverage is 0.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #2450   +/-   ##
=======================================
  Coverage   37.73%   37.73%           
=======================================
  Files         156      156           
  Lines       12826    12826           
=======================================
  Hits         4840     4840           
  Misses       7728     7728           
  Partials      258      258           
Impacted Files Coverage Δ
internal/core/command/init.go 0.00% <ø> (ø)
internal/core/command/rest_device.go 0.00% <0.00%> (ø)
internal/core/metadata/init.go 0.00% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update acc4df9...e9df57a. Read the comment docs.

Copy link
Contributor

@AnthonyMBonafide AnthonyMBonafide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a path forward, when you are ready for the next review please re-request me. Thank you

@brandonforster
Copy link
Contributor Author

brandonforster commented Mar 27, 2020

I spoke with @AnthonyMBonafide, and it seems like the lazy initialization is causing a huge spin up penalty. The registry client has to be initialized, the URL client has to be initialized, we have to wait for the first URL poll, and we have to fetch that URL update on a successful poll and report it back down, all while the first-time request is still in flight.

I've found that the underlying problem here isn't with the registry client, the bootstrap, the URL client or any of our code, it's that all that code and work that has to be done takes too long to accomplish while that first-time request is in flight, and the HTTPResponseWriter times out while it's waiting for our code to finish. Observing the current function in the debugger, everything returns successfully and with valid device data, as expected, but the endpoint function falls over on the call to json.NewEncoder(w).Encode(devices) and returns the following error on my machine:

write tcp 127.0.0.1:48082->127.0.0.1:52314: i/o timeout

This causes the empty response from server, which goes away after the first call because all the clients have already been initialized. You can get a similiar effect with local configuration (which doesn't use any of the URL client retry code) by inserting a time.Sleep(time.Second*30) call into the endpoint function, and observing that the Encode still times out.

I believe that the best course of action is to remove lazy initialization of all clients, and take the time hit when it's least noticeable, during startup. This way, when a REST call is actually made, the HTTP stack is going to be the limiting factor in timing and not any of our business code, as it is today.

@michaelestrin, I'd really like your thoughts here since I know we were both pushing for lazy initialization.

@brandonforster
Copy link
Contributor Author

Note that as part of the current changeset, I have renamed restDevice.go to rest_device.go to bring it in line with the other core service's naming of similar files.

@michaelestrin
Copy link
Member

I believe that the best course of action is to remove lazy initialization of all clients, and take the time hit when it's least noticeable, during startup.

As I said before, this may work with core and support services but will not work with the agent. It's also limiting in that it prevents dynamic service discovery which will be a future issue when high-availability support is implemented (which seems to be gathering some steam).

@akramtexas ran into a similar issue recently with a new blackbox test he wrote taking longer than the socket timeout. This was ultimately solved by a default configuration change to lengthen the timeout. You might look into that as the long-term solution. Not sure that a 5 second timeout lengthened to 6..9 seconds would be a big deal.

I also wonder if this is a universal problem or is being contributed to by the work being done by the use case you're invoking that invokes a client call.

Anyway, that's my 2 cents @brandonforster.

@AnthonyMBonafide
Copy link
Contributor

@michaelestrin, I think I may be missing some important piece here. With @brandonforster proposed solution, I am not seeing how this is will cause and issue for dynamic service discovery. As I see it he is suggesting that we still use the dependency injection container to pass around the pointer to the underlying client, which can still be updated in the background when a URL update happens within Consul. However, we would kickoff the lookup for the URL during startup of the application so we are not doing it at the last possible moment, which is responding to the first request which indirectly needs the client. To me that seems like something we would want to do anyway since getting the client(s) ready for use is something that should be taken care of before the service has started. I hope my understanding is correct.

As for system management agent, is there a reason we are not using the same initialization and update strategy as the other core-services?

@brandonforster
Copy link
Contributor Author

brandonforster commented Mar 30, 2020

Updated with changes @michaelestrin and @AnthonyMBonafide and I discussed on a call.

Copy link
Contributor

@AnthonyMBonafide AnthonyMBonafide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Based on the conversation between @brandonforster and @michaelestrin, I think Michael's suggestion to apply the fix in this fashion is the best option.

… in the comments.

Renamed restDevice to rest_device to bring it in line with the rest of the services.

Signed-off-by: Brandon Forster <me@brandonforster.com>
@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 2 Code Smells

No Coverage information No Coverage information
3.2% 3.2% Duplication

@brandonforster
Copy link
Contributor Author

Sorry about all the pushes. I updated the timeout value to 45,000 ms, because I realized that the URLClient polling interval is currently defined as 15,000 ms, and we should allow for a few polling updates before timeout.

Copy link
Contributor

@AnthonyMBonafide AnthonyMBonafide left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty reply from server when getting Core-Command device
4 participants