-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Add retry on locks #4997
base: main
Are you sure you want to change the base?
fix: Add retry on locks #4997
Conversation
284e6f9
to
d313d5a
Compare
e6d6135
to
6fdaa09
Compare
6fdaa09
to
f554a9b
Compare
I don't see that this change is adding much value or helping on any of the issues you have linked. The only scenario where it will help is in the example you have given, i.e. running two |
There are a lot of people using terragrunt, atmos and other tools that can run many projects/plans at once, so I see how this help those users. |
I use Terragrunt extensively with Atlantis running parallel plans/applies in a PR, and don't have any locking issues that this change would make any difference to. |
I truly believe we still have much to do to make sure that the locking issue is dealt with. Although I'm not so sure that this code only is applicable for simultaneous commands, I've only used it to showcase the issue since is the same mechanism ( Buy I may argue that the UX will be even better for most cases. The current behavior is:
This PR will change this to:
As I stated here most users are suffering from seeing the message that just says "try again later", I'm automating this step so users only see this message if we are more or less sure that there is a real issue with the plan taking too long (which can be configured by the timeout setting). |
I've used to work in a company that had a pretty big Atlantis install, unfortunately I don't anymore so I can't really test this in large scale, I invite anyone who can test this PR to give it a try. Two areas of improvement I already can see:
|
I've seen this once or twice (for example if you're impatient and run |
I believe a lot of other things would break before this, AFAIK tickers are pretty lightweight so I don't think they would consume a lot of resources and if the system is too slow the tickers might fail and we might skip a few checks but that's not a big issue for our logic. But to make this more clear and resilient we could add another return value in this line, this value could represent the status of the lock acquire process ( This would mean that each caller would need to implement either a queue system or the retry mechanism like the one I've added in this PR. We could also inject an interface with functions to send the messages, I'm fine with either, I'm also open to other suggestions. |
I don't know; I'm still worried about the complexity. 15m is a long time after I type "atlantis apply" for something to happen. For example in your scenario, when Mary goes to run, if she immediately gets an error saying that another PR is locked, she could look and see it's John, then reach out to John, who might have completely lost track of the PR and say "yeah go ahead and close that" at which point Mary can apply her PR. This happens a lot in my workflows, and so adding an additional 15m delay before Mary knows even what the problem is would actually slow this process down for me. My preference would be for this to be disabled by default (timeout = 0) and configurable. As for having the TryLock return a more specific message I think that could be useful, it would especially help with the "it's spinning and I don't know why" problem from above. As you said though that affects the callers as well, so it might be worth doing as a separate follow up. |
Nice feature. I would suggest to make it optional via repo config. Analogous to |
what
I'm opening this as a draft to receive feedback early, I don't expect this to break anything but I believe it could be hidden behind a flag and with better default values for timeout and retries (maybe exponential retry?).
This adds a retry logic to the lock mechanism to mitigate the issue described in here and also in this ADR.
The locking issue itself is more complex and requires much more work, this is just a small step so users don't have to see the error anymore effectively making the code wait instead of asking the user to retry.
why
Currently the user has to rerun any operations that fail because a certain workspace path is locked, this tries just to automate the process.
tests
Will add if this approach receive support.
references
Did my best to try to understand which issues this would affect.
Relates to #3345
Relates to #2921
Relates to #2882
Relates to #4489
Relates to #305
Relates to #4829
Relates to #1847
Relates to #4566
Closes #1618
Closes #2200
Closes #3785
Closes #4489
Closes #4368