-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Yet More Graceful #8874
[WIP] Yet More Graceful #8874
Conversation
If we use a persistable queue, these two should be easy to do; I think that each individual operation is not lengthy. If we shut down the queue, we could afford to wait the indexer to finish the task at hand. Anyway, except required for different reasons, the persistable queues should be persisted only on graceful restart (for performance reasons), and we must be careful when restarting a different version of Gitea in case any formats might have changed. |
@guillep2k yeah, in general my feeling is that we should just let things finish the tasks they're currently doing rather than abort - Repo indexing is however one thing that could take some time so that might need an abort. We do need to shutdown their indices gracefully though - it is possible to corrupt the index by aborting mid write - and then Gitea will not restart. The queues for indexing are configurable afaics so if they are using persistent queues then we're home and dry - although migration problems are still an issue. If not, dealing with the channel queues is a bit more difficult:
In terms of repo/issue indexing I need to look again at the populate index functions because it may be that we just run through all of these on restart anyway - in which case we're done: stop doing work, close the index and set the channel read to a no-op. (Although having to run through every repo at startup may be a terrible idea...) The main difficulties will come with dealing with the other places we're we using channels as worker queues. These will need to be formalised and decisions made about whether there needs to be persistence of the queue or whether the loss of data is tolerable. As an aside It's worth noting that the single TestPullRequests goroutine worker appears to run every pull request at startup (on every Gitea server) - this is fine if you're running a small server but if you have just restarted a clustered server with say >1000 PRs (most of which haven't changed) - that server isn't gonna update any PRs until all that (likely unnecessary) work is done. I think this one likely needs to have an option to be a persistent queue or at least some way of shortcutting... It also probably needs a (configurable?) number of workers. The other benefit we could see by allowing persistent queues is the ability to share this load out. |
I think you've just found the cause for #7947 !!! 😱 |
modules/queue/queue_channel.go
Outdated
} | ||
case <-time.After(time.Millisecond * 100): | ||
i++ | ||
if i >= 3 && len(datas) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 -> batchLength ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's from your original code 😛
My interpretation is:
- 300ms after you have started this queue, send any unsent data even if the batch is incompletely full and every 100ms after that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've restructured to be a bit clearer - the timeout variable changes once it's had its first wait.
I've split out the windows component from this so that that can be merged without this being |
Codecov Report
@@ Coverage Diff @@
## master #8874 +/- ##
==========================================
+ Coverage 41.26% 41.46% +0.19%
==========================================
Files 561 565 +4
Lines 73439 76063 +2624
==========================================
+ Hits 30307 31538 +1231
- Misses 39338 40596 +1258
- Partials 3794 3929 +135
Continue to review full report at Codecov.
|
d9922ec
to
711fd7c
Compare
So, validating this is either a leap of faith or we could create some environment to specifically check some of the concepts addressed here. I don't mean anything to be added to our standard CI script, but a bunch of scripts somewhere (maybe under
Then a push operation would be forced to wait one minute, and we'd have a window of oportunity to manually test the effects of a graceful shutdown. We could also have a similar bash script (replacing the The delayed git hook will test both macaron and cmd shutdowns. The indexer is more difficult to test, as it will need a special token filter to introduce the delay, but you get the idea. |
Tbh the best test I've found is just killall -1 gitea and watching the thing come back up. If I get the code wrong it won't come back or it will take too long and will get hammered! That and raw analysis of the code. That's clearly not very good though. So I've being doing some other work:
In terms of the RunWithShutdownFns API I probably need to adjust these callbacks. They currently are simple: terminateWG.add(1)
go func() {
<-IsTerminate():
runThis()
terminateWG.Done()
} This means there's no way to cancel the I think these may have to be changed to pass in a done channel that could be closed at the end of run so instead the atTerminate looks like: terminateWG.add(1)
go func() {
select {
case <-IsTerminate():
runThis()
case <-done:
}
terminateWG.Done()
} In terms of the RunWithShutdownChan - We can't just pass in the terminate channel because terminating functions need to be added to a wait group to give them a chance to terminate cleanly before the main goroutine closes. It may be that with a bit more thought we could use appropriately created |
@guillep2k btw although in general I don't tend to force push to Gitea PRs, in this case in an attempt to make reviewing this easier in future I'm trying to create understandable individual patches/commits. Therefore I'm rewriting history as I refactor and restructure. (As that messes up the PR history it will likely lead to me having to close and reopen this PR once it moves out of WIP.) If you're closely watching this though and would prefer me to stop force pushing please say. |
hmm... I wonder if I need to be using something like this... |
Thanks. It's #8964 the one I'm reviewing "in small bites" because of its complexity (I know this one is even more complex); my comment above was something I thought about while reviewing that. As long as you mark this with WIP, I don't mind you rewriting. But I do use the GitHub tool to explore deltas in order to check what has been changed and try to understand the reasoning behind the changes. Especially for large files, where you may change just a couple of lines. |
Ok I've made the move to switch to a more context based thing as it looks more natural go and is simpler in general. I likely need to reconsider how the queues work because they could actually be simplified to take a context, but will still need a terminate hook so can't just run within a simple context however shutdown could easily be managed this way. I've also added the git commands to run in the HammerTime context by default but added a function to allow a git command to run in a different context. |
AHA! I can avoid a whole lot of hackery and make HammerTime more powerful by setting the We can then pass in the |
agh! leveldb doesn't allow multiple processes to access the db - that means that both queue_disk and queue_disk_channel need to have a bit more thought. Done |
Sigh those last few commits have broken the tests... I'll have to fix them and try again... |
5c12c33
to
ec01c00
Compare
Remove the old now unused specific queues
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs during the next 2 months. Thank you for your contributions. |
Almost all of this has been merged as other PRs therefore I will close this. |
This PR implements:
Graceful shutdown is now supported for Windows running as a serviceMERGEDGraceful shutdown is now supported for FCGI and unix sockets- MERGEDThe TestPullRequests goroutine now shutsdown gracefully - (As we always retest everything on restart this one is very easy...)- Broken out to Graceful: Xorm, RepoIndexer, Cron and Others #9282 & MERGEDRepoIndexerand IssueIndexer need to be gracefulised - RepoIndexer Broken out to Graceful: Xorm, RepoIndexer, Cron and Others #9282 & MERGEDCron tasks should run a graceful wrapperBroken out to Graceful: Xorm, RepoIndexer, Cron and Others #9282 & MERGEDAll git commands run through git.Command will now time out at Hammer Time, similarly for processes- broken out to Graceful: Cancel Process on monitor pages & HammerTime #9213 & MERGEDAll git commands are now run through git.Command- broken out to Graceful: Cancel Process on monitor pages & HammerTime #9213 & MERGEDProcesses can now be cancelled from the monitor page- broken out to Graceful: Cancel Process on monitor pages & HammerTime #9213 & MERGEDmodules/migrations/migrate.go
- Migration can take a very long time - need to be able to abortmodules/notification/notification.go
- Needs a queue and to be gracefulisedmodules/notification/base/null.go
andmodules/notification/base/queue.go
are now autogenerated from thebase.go
modules/notification/ui/ui.go
also uses a queueservices/mailer/mailer.go
- Needs a queue and to be gracefulisedservices/mirror/mirror.go
- Now gracefully shutsdown but probably needs a persistent queue - might need a 2 part queueservices/pull/pull.go
- AddTestPullRequestTask may need to become a proper queue and InvalidateCodeComments should take place within a graceful run.Contains #9304