-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition when initializing external clients using bootstrap #2450
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly recommend not to approve this PR. It addresses symptoms, not the root cause. Pushing code around to avoid a race condition indicates a lack of understanding of the actual root cause and does nothing to prevent the issue from cropping up elsewhere in the code.
You will likely have similar issues in the system management agent. Unfortunately, this "solution" isn't going to work there as the agent must do just-in-time initialization of clients.
I can't help but feel @brandonforster and I have discussed this exact issue in the past (perhaps in a slightly different context). Find and address the root cause. Don't move code around to avoid a race to be expeditious.
@michaelestrin We absolutely have, and it's a fair point, but then should we change core-data (which is set up like this already) to do JIT right now? I believe we should harmonize all the services to do JIT, but I think committing code that we know will fail black-box tests is irresponsible. To that end, I think the best course of action here is to merge this PR, acknowledge there is a race (but one we will always win) and work on rebuilding the registry client to work the same way the URL client does to address the underlying race. I do not think that leaving black box tests broken for any period of time (like we did for the URL client) should be the course of action here. I think the master branch should always be passing blackbox tests, even if it requires a bandaid that we need to replace with something permanent. |
You're pushing code around and don't understand (or can't adequately describe) the root cause. If you don't understand the root cause, you can't be sure you'll always win the race. This appears to be a result of a shortcoming with recent registry client changes. Maybe find the root cause and fix it? I stand by my prior statement. |
FWIW, I'm not attacking you @brandonforster nor am I advocating for irresponsibility and broken black box tests. I'm not saying you caused the problem or are somehow responsible for the current state of the code. I'm simply saying the underlying root cause should be identified and addressed to prevent future similar issues from cropping up. There are places in the code -- specifically in the agent -- where JIT can't be avoided. |
@michaelestrin I know you're not making a personal attack on me and I know you're not attacking my credentials or anything like that. You were correct about the best way to attack this problem from a root cause perspective on the URL client, and the code is now much better for it. All those points you made for the URL client apply 1:1 here and I have no doubt that undertaking that work you are pushing for would have the same dramatic improvement. However, that work takes time I am just very concerned that we have not been passing black box tests on master since the bug that started the URL client work. After we added the URL client, there was a bug in the URL client code, and now there is this bug in registry that looks like a root cause solution could potentially be just as time consuming. I am not advocating for not doing the root cause work, au contraire, I would love to spearhead it. I just want the system to pass tests and not have these weird hiccups, and I don't think a |
I briefly looked into this and I think we should focus our efforts on finding the root cause and fixing it. To address @brandonforster concern regarding the time it would take to implement a long term fix, how about we start looking into correcting the issue in this PR and add new commits which address the long term issue. That way if we get to a point and we want the quick fix we can simply just take the original commit(s), and if we can come up with a long term fix within a reasonable time frame we can squash all the commits. This would require some additional organization on your part as the author. If you decide to take that path, you can squash your existing 2 commits into one and update the commit message to describe that it is a quick fix. Then for the following commits you can add to that and we can work from there once we need to make a decision as what to do. I think we can get this done within a reasonable time frame. Race conditions and concurrency problems are horrible to find and fix, if we have a handle on it now it would be nice to remove one more complex issue from the system. I hope that all makes sense. |
Also, if you need any assistance with tracking down the issue and/or coming up with a solution please feel free to reach out. I am confident we can get this worked out. |
@AnthonyMBonafide Sounds like a good approach. If the scope starts to look like the URL client refactor, we can reassess, but you're right to not worry about complexity we don't necessarily need yet. |
Codecov Report
@@ Coverage Diff @@
## master #2450 +/- ##
=======================================
Coverage 37.73% 37.73%
=======================================
Files 156 156
Lines 12826 12826
=======================================
Hits 4840 4840
Misses 7728 7728
Partials 258 258
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have a path forward, when you are ready for the next review please re-request me. Thank you
I spoke with @AnthonyMBonafide, and it seems like the lazy initialization is causing a huge spin up penalty. The registry client has to be initialized, the URL client has to be initialized, we have to wait for the first URL poll, and we have to fetch that URL update on a successful poll and report it back down, all while the first-time request is still in flight. I've found that the underlying problem here isn't with the registry client, the bootstrap, the URL client or any of our code, it's that all that code and work that has to be done takes too long to accomplish while that first-time request is in flight, and the HTTPResponseWriter times out while it's waiting for our code to finish. Observing the current function in the debugger, everything returns successfully and with valid device data, as expected, but the endpoint function falls over on the call to
This causes the I believe that the best course of action is to remove lazy initialization of all clients, and take the time hit when it's least noticeable, during startup. This way, when a REST call is actually made, the HTTP stack is going to be the limiting factor in timing and not any of our business code, as it is today. @michaelestrin, I'd really like your thoughts here since I know we were both pushing for lazy initialization. |
Note that as part of the current changeset, I have renamed restDevice.go to rest_device.go to bring it in line with the other core service's naming of similar files. |
As I said before, this may work with core and support services but will not work with the agent. It's also limiting in that it prevents dynamic service discovery which will be a future issue when high-availability support is implemented (which seems to be gathering some steam). @akramtexas ran into a similar issue recently with a new blackbox test he wrote taking longer than the socket timeout. This was ultimately solved by a default configuration change to lengthen the timeout. You might look into that as the long-term solution. Not sure that a 5 second timeout lengthened to 6..9 seconds would be a big deal. I also wonder if this is a universal problem or is being contributed to by the work being done by the use case you're invoking that invokes a client call. Anyway, that's my 2 cents @brandonforster. |
@michaelestrin, I think I may be missing some important piece here. With @brandonforster proposed solution, I am not seeing how this is will cause and issue for dynamic service discovery. As I see it he is suggesting that we still use the dependency injection container to pass around the pointer to the underlying client, which can still be updated in the background when a URL update happens within Consul. However, we would kickoff the lookup for the URL during startup of the application so we are not doing it at the last possible moment, which is responding to the first request which indirectly needs the client. To me that seems like something we would want to do anyway since getting the client(s) ready for use is something that should be taken care of before the service has started. I hope my understanding is correct. As for system management agent, is there a reason we are not using the same initialization and update strategy as the other core-services? |
Updated with changes @michaelestrin and @AnthonyMBonafide and I discussed on a call. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Based on the conversation between @brandonforster and @michaelestrin, I think Michael's suggestion to apply the fix in this fashion is the best option.
… in the comments. Renamed restDevice to rest_device to bring it in line with the rest of the services. Signed-off-by: Brandon Forster <me@brandonforster.com>
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
Sorry about all the pushes. I updated the timeout value to 45,000 ms, because I realized that the URLClient polling interval is currently defined as 15,000 ms, and we should allow for a few polling updates before timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
PR Checklist
Please check if your PR fulfills the following requirements:
PR Type
What kind of change does this PR introduce?
What is the current behavior?
A race occurs during bootstrap external client initialization between the bootstrap handler and the bootstrap registry component. If the external client is initialized JIT, the registry variable will no longer contain valid data. On subsequent calls, the registry variable is updated and execution proceeds as normal.
Issue Number: Fixes #2421
What is the new behavior?
The external client is initialized before the bootstrap update code and the race does not occur.
Does this PR introduce a breaking change?
Are there any specific instructions or things that should be known prior to reviewing?
To verify:
Repro the bug on master:
Run (in order) consul, redis, config-seed, core-data, core-metadata, core-command. Ensure the core services are configured for consul usage and not local configuration. Make the following curl command:
curl http://localhost:48082/api/v1/device
.Checkout this PR code, follow the steps above, and ensure that device data is returned.
Other information
The system management agent contains code that is very similar to the problematic code identified in this PR in command and metadata. I do not have the familiarity with the SMA to say if it is going to be an issue there.