Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queues not terminating gracefully #23050

Closed
brechtvl opened this issue Feb 21, 2023 · 19 comments · Fixed by #23054 or #23154
Closed

Queues not terminating gracefully #23050

brechtvl opened this issue Feb 21, 2023 · 19 comments · Fixed by #23054 or #23154
Labels

Comments

@brechtvl
Copy link
Contributor

Description

We are running into the issue where a Gitea restart happens during merge conflict checking, PRs remain stuck in conflict state forever. I think there's two distinct issues here, one is that these don't get unstuck on new pushes and perhaps that's best left for another report.

However the reason things get into this state in the first place seems to be a problem in shutdown terminating all the queues immediately instead of respecting the hammer time.

When starting ./gitea web and then do killall gitea , I see the following in the log:

2023/02/21 19:43:26 ...eful/manager_unix.go:149:handleSignals() [W] [63f510cd-4] PID 1500309. Received SIGTERM. Shutting down...
2023/02/21 19:43:26 cmd/web.go:271:listen() [I] [63f510cd-48] HTTP Listener: 0.0.0.0:3000 Closed
2023/02/21 19:43:26 ...eue/queue_channel.go:127:func1() [W] ChannelQueue: task-channel Terminated before completed flushing
2023/02/21 19:43:26 .../graceful/manager.go:205:doHammerTime() [W] Setting Hammer condition
2023/02/21 19:43:26 ...eue/queue_channel.go:127:func1() [W] ChannelQueue: notification-service-channel Terminated before completed flushing
2023/02/21 19:43:26 ...eue/queue_channel.go:127:func1() [W] ChannelQueue: push_update-channel Terminated before completed flushing
2023/02/21 19:43:26 ...eful/server_hooks.go:46:doShutdown() [I] [63f510cd-48] PID: 1500309 Listener ([::]:3000) closed.
2023/02/21 19:43:27 .../graceful/manager.go:224:doTerminate() [W] Terminating
2023/02/21 19:43:27 ...er/issues/indexer.go:201:2() [I] PID: 1500309 Issue Indexer closed
2023/02/21 19:43:27 ...eful/manager_unix.go:157:handleSignals() [W] PID: 1500309. Background context for manager closed - context canceled - Shutting down...
2023/02/21 19:43:27 cmd/web.go:183:runWeb() [I] PID: 1500309 Gitea Web Finished

This log is from my local test instance with configuration set to defaults as much as possible.

On our production instance that Terminated before completed flushing is leading to a lot of different errors as workers get terminated in the middle of what they're doing.

I can provide more detailed logs if needed, but maybe this is easy to redo on any instance.

Gitea Version

main (43405c3)

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Screenshots

No response

Git Version

No response

Operating System

No response

How are you running Gitea?

Own build from main.

Database

None

@techknowlogick
Copy link
Member

techknowlogick commented Feb 21, 2023

For debugging purposes: [W] [63f40381] cannot get cache context when getting data: &{context.Background.WithCancel.WithCancel.WithValue(type pprof.labelContextKey, val {"graceful-lifecycle":"with-hammer"}) was also found in the logs, which traces back to the following function:
https://sourcegraph.com/github.com/go-gitea/gitea/-/blob/modules/cache/context.go?L58

Edit: this is also showing up on gitea.com, although not during shutdowns.

@brechtvl
Copy link
Contributor Author

I'm also not seeing "graceful-lifecycle":"with-hammer" messages when simply starting and stopping gitea. But when I for example do a push that causes conflict checking workers to be run for PRs, then they start showing up.

So I guess there is some problem with cache context in the workers?

@zeripath
Copy link
Contributor

If I had to guess this is going to be a misreport and there's going to be a non-deterministic hierarchy of cancelled contexts - the wrong one of which is being picked and reported. I'd need to spend a good few hours looking through the code thinking about the concurrency again which I won't have time to look at until the weekend.

@lunny
Copy link
Member

lunny commented Feb 21, 2023

The context cache should be used in http request level currently. Do you have a redid cache server?

@techknowlogick
Copy link
Member

@lunny they are using memcached for cache

@lunny
Copy link
Member

lunny commented Feb 21, 2023

@lunny they are using memcached for cache

This maybe occurs when Gitea start up.

@brechtvl
Copy link
Contributor Author

brechtvl commented Feb 21, 2023

I see these Terminated before completed flushing warning messages also in v1.16.9 (had some problems testing earlier). So I'm not sure if this is a problem that has existed for a while, or if these messages are not as serious as they sound and the problem is elsewhere.

Here is the part of the log from the production server where it got the shutdown signal and then ran into errors, if it helps:
https://gist.github.com/brechtvl/e4778293c1c919a4ec896a31e418677c

@techknowlogick
Copy link
Member

@lunny the same ctx error can be seen in Gitea.com logs and occurs at times other than startup and shutdown

@lunny
Copy link
Member

lunny commented Feb 22, 2023

I have sent a PR #23054, but I don't think that will fix this problem in fact. It will remove the unnecessary warning logs.

@lunny
Copy link
Member

lunny commented Feb 22, 2023

I see these Terminated before completed flushing warning messages also in v1.16.9 (had some problems testing earlier). So I'm not sure if this is a problem that has existed for a while, or if these messages are not as serious as they sound and the problem is elsewhere.

Here is the part of the log from the production server where it got the shutdown signal and then ran into errors, if it helps: https://gist.github.com/brechtvl/e4778293c1c919a4ec896a31e418677c

From the logs, looks like it's a normal terminal process.

@brechtvl
Copy link
Contributor Author

Everything seems to shut down within 1s while GRACEFUL_HAMMER_TIME was left to the default 60s. I was expecting it to wait up to 60 seconds for any ongoing work to complete. Or is that only meant for requests, not queues?

If it's not meant for queues, is the mechanism that work gets added back to the queue on such errors, to be redone after restart? It wasn't doing that as far as I could tell, but I may well have missed something.

@lunny
Copy link
Member

lunny commented Feb 22, 2023

I think the logic should be if all workers shut down less than 60s, then it will exit immediately. If some workers shutting down time out(address 60s), then hammer shutdown will be executed and the program will exit at that time.

brechtvl added a commit to blender/gitea that referenced this issue Feb 22, 2023
@brechtvl
Copy link
Contributor Author

I found more details (and a very hacky workaround) for pull requests getting stuck in conflict checking.

  • After shutdown in the middle of pr_patch_checker work, a persistent queue of remaining PRs to be checked is saved to disk.
  • On restart these are loaded into the persistent queue of prPatchCheckerQueue, but this seemingly does not cause any work to start.
  • Also on restart, InitializePullRequests will put any PRs in need of checking in the channel queue of prPatchCheckerQueue and start working on them.
  • However checkAndUpdateStatus never updates the PR status because prPatchCheckerQueue.Has returns true, due to this same PR still being in the persistent queue of prPatchCheckerQueue.

The persistent queue in this case seems unnecessary as InitializePullRequests can already restart conflict checking for any PRs that need it.

@brechtvl
Copy link
Contributor Author

brechtvl commented Feb 22, 2023

The steps to reproduce this are:

  • Run Gitea with default configuration
  • Migrate Blender repository from http://github.com/blender/blender
  • Create about 10 pull requests (e.g. by editing a single file in the UI and creating a PR)
  • Edit another file through the UI and commit to main directly
  • killall gitea quickly while conflict checking is going on
  • After restart, some pull requests will remain stuck in conflict checking no matter if you do more commits on main or the PR head branch

@wxiaoguang
Copy link
Contributor

wxiaoguang commented Feb 23, 2023

This is a test for this problem

The results are quite surprising.

TestPullRequest_QueueStuck completed1=1, executed2=0, has2=20, executed3=100, has3=19
TestPullRequest_QueueStuck completed1=4, executed2=0, has2=20, executed3=100, has3=19
TestPullRequest_QueueStuck completed1=4, executed2=0, has2=20, executed3=101, has3=19
func TestPullRequest_QueueStuck(t *testing.T) {
	setting_module.AppWorkPath = "/tmp"
	_ = util.RemoveAll("/tmp/data")
	unittest.PrepareTestEnv(t)
	setting_module.InitProviderAndLoadCommonSettingsForTest()
	setting_module.LoadQueueSettings()

	q1 := func() (completedTasks []string) {
		startWhen100Ready := make(chan struct{}) // only start data cnosuming when the 100 tasks are all pushed into queue
		stopAt20Shutdown := make(chan struct{})  // stop and shutdown at the 20th item

		testHandler := func(data ...queue.Data) []queue.Data {
			<-startWhen100Ready
			time.Sleep(100 * time.Millisecond)
			for _, datum := range data {
				s := datum.(string)
				completedTasks = append(completedTasks, s)
				if s == "task-20" {
					close(stopAt20Shutdown)
					return nil
				}
			}
			return nil
		}

		q := queue.CreateUniqueQueue("pr_patch_checker_test", testHandler, "")
		q.Run(func(atShutdown func()) { go func() { <-stopAt20Shutdown; atShutdown() }() }, func(atTerminate func()) {})

		// add 100 tasks to the queue
		for i := 0; i < 100; i++ {
			_ = q.Push("task-" + strconv.Itoa(i))
		}

		close(startWhen100Ready)
		<-stopAt20Shutdown
		return
	}

	q2 := func() (executedTasks []string, hasTasks []string) {
		stop := make(chan struct{})

		// collect the tasks that have been executed
		testHandler := func(data ...queue.Data) []queue.Data {
			for _, datum := range data {
				executedTasks = append(executedTasks, datum.(string))
			}
			return nil
		}

		q := queue.CreateUniqueQueue("pr_patch_checker_test", testHandler, "")
		q.Run(func(atShutdown func()) { go func() { <-stop; atShutdown() }() }, func(atTerminate func()) {})

		// wait for a while to see whether there are tasks to get executed.
		time.Sleep(1 * time.Second)

		// check whether the tasks are still in the queue
		for i := 0; i < 100; i++ {
			if has, _ := q.Has("task-" + strconv.Itoa(i)); has {
				hasTasks = append(hasTasks, "task-"+strconv.Itoa(i))
			}
		}
		close(stop)
		return
	}

	q3 := func() (executedTasks []string, hasTasks []string) {
		stop := make(chan struct{})

		testHandler := func(data ...queue.Data) []queue.Data {
			for _, datum := range data {
				executedTasks = append(executedTasks, datum.(string))
			}
			return nil
		}

		q := queue.CreateUniqueQueue("pr_patch_checker_test", testHandler, "")
		q.Run(func(atShutdown func()) { go func() { <-stop; atShutdown() }() }, func(atTerminate func()) {})

		// re-run all tasks
		for i := 0; i < 100; i++ {
			_ = q.Push("task-" + strconv.Itoa(i))
		}

		// wait for a while
		time.Sleep(1 * time.Second)

		// check whether the tasks are still in the queue
		for i := 0; i < 100; i++ {
			if has, _ := q.Has("task-" + strconv.Itoa(i)); has {
				hasTasks = append(hasTasks, "task-"+strconv.Itoa(i))
			}
		}
		close(stop)
		return
	}

	completedTasks1 := q1() // run some tasks and shutdown at an intermediate point
	time.Sleep(time.Second)
	executedTasks2, hasTasks2 := q2() // restart the queue to check the tasks in it
	time.Sleep(time.Second)
	executedTasks3, hasTasks3 := q3() // try to re-run all tasks

	log.Error("TestPullRequest_QueueStuck completed1=%v, executed2=%v, has2=%v, executed3=%v, has3=%v",
		len(completedTasks1), len(executedTasks2), len(hasTasks2), len(executedTasks3), len(hasTasks3))
}

lunny added a commit that referenced this issue Feb 24, 2023
Partially fix #23050

After #22294 merged, it always has a warning log like `cannot get
context cache` when starting up. This should not affect any real life
but it's annoying. This PR will fix the problem. That means when
starting up, getting the system settings will not try from the cache but
will read from the database directly.

---------

Co-authored-by: Lauris BH <lauris@nix.lv>
yardenshoham pushed a commit to yardenshoham/gitea that referenced this issue Feb 24, 2023
Partially fix go-gitea#23050

After go-gitea#22294 merged, it always has a warning log like `cannot get
context cache` when starting up. This should not affect any real life
but it's annoying. This PR will fix the problem. That means when
starting up, getting the system settings will not try from the cache but
will read from the database directly.

---------

Co-authored-by: Lauris BH <lauris@nix.lv>
@lunny lunny reopened this Feb 24, 2023
techknowlogick pushed a commit that referenced this issue Feb 24, 2023
Backport #23054

Partially fix #23050

After #22294 merged, it always has a warning log like `cannot get
context cache` when starting up. This should not affect any real life
but it's annoying. This PR will fix the problem. That means when
starting up, getting the system settings will not try from the cache but
will read from the database directly.

Co-authored-by: Lunny Xiao <xiaolunwen@gmail.com>
Co-authored-by: Lauris BH <lauris@nix.lv>
@zeripath
Copy link
Contributor

zeripath commented Feb 25, 2023

@wxiaoguang Thanks for your testcase. It's not quite right but it's pointed me in the correct direction.

Queues are extremely concurrent and the handler in q1 is going to drop a lot of data - you cannot expect that as soon as you tell a queue to shutdown that no more data will be handled - In fact you can almost guarantee that at least a few more will be handled.

I'm just working through it trying to make it safe and properly concurrent safe and I'm placing it in modules/queue once I've slightly rewritten it - (I'm not sure what q3 is supposed to reliably do.)


Now to address the Warning - if you switch on Trace logging for modules/queue:

gitea manager logging add console --name traceconsole --level TRACE --expression modules/queue
[log]
MODE = ..., traceconsole
...

[log.traceconsole]
LEVEL=trace
MODE=console
EXPRESSION=modules/queue

You would see that internal channelqueue is explicitly drained at Terminate. There should be no loss of data - the warning is not the cause of the issue you're investigating.


I assume you're actually investigating things not being emptied on startup - and that's a real bug.

If you look at:

if lq, ok := q.internal.(*LevelQueue); ok && lq.byteFIFO.Len(lq.shutdownCtx) != 0 {
// Just run the level queue - we shut it down once it's flushed
go q.internal.Run(func(_ func()) {}, func(_ func()) {})
go func() {
for !q.IsEmpty() {
_ = q.internal.Flush(0)
select {
case <-time.After(100 * time.Millisecond):
case <-q.internal.(*LevelQueue).shutdownCtx.Done():
log.Warn("LevelQueue: %s shut down before completely flushed", q.internal.(*LevelQueue).Name())
return
}
}
log.Debug("LevelQueue: %s flushed so shutting down", q.internal.(*LevelQueue).Name())
q.internal.(*LevelQueue).Shutdown()
GetManager().Remove(q.internal.(*LevelQueue).qid)
}()
} else {
log.Debug("PersistableChannelQueue: %s Skipping running the empty level queue", q.delayedStarter.name)
q.internal.(*LevelQueue).Shutdown()
GetManager().Remove(q.internal.(*LevelQueue).qid)
}
}

and compare with:

if luq, ok := q.internal.(*LevelUniqueQueue); ok && luq.ByteFIFOUniqueQueue.byteFIFO.Len(luq.shutdownCtx) != 0 {
// Just run the level queue - we shut it down once it's flushed
go q.internal.Run(func(_ func()) {}, func(_ func()) {})
go func() {
_ = q.internal.Flush(0)
log.Debug("LevelUniqueQueue: %s flushed so shutting down", q.internal.(*LevelUniqueQueue).Name())
q.internal.(*LevelUniqueQueue).Shutdown()
GetManager().Remove(q.internal.(*LevelUniqueQueue).qid)
}()
} else {
log.Debug("PersistableChannelUniqueQueue: %s Skipping running the empty level queue", q.delayedStarter.name)
q.internal.(*LevelUniqueQueue).Shutdown()
GetManager().Remove(q.internal.(*LevelUniqueQueue).qid)
}
}

I think you'll see the real bug.

@wxiaoguang
Copy link
Contributor

Actually, I do not care about whether the queue is reliable or not. And in most cases, the queue doesn't need to be reliable.

But has3=19 seems to cause the bug described by this issue. These tasks are just stuck in queue.

@zeripath
Copy link
Contributor

zeripath commented Feb 25, 2023

As I said, your testcase is somewhat incorrect in form. The reason you're seeing somewhat inconsistent behaviour is partially because of this but also because of the bug that I mentioned above.

Thank you for the testcase though because whilst it's not quite right it made me realise that the bug is the one I've mentioned above - which I've fixed in the attached PR. The testcase has also been fixed.

zeripath added a commit to zeripath/gitea that referenced this issue Feb 25, 2023
There have been a number of reports of blocked PRs being checked which have been difficult to
debug. In investigating go-gitea#23050 I have realised that whilst the Warn there is somewhat of a
miscall there was a real bug in the way that the LevelUniqueQueue was being restored on
start-up of the PersistableChannelUniqueQueue.

This PR fixes this bug and adds a testcase.

Fix go-gitea#23050
and others

Signed-off-by: Andrew Thornton <art27@cantab.net>
@wxiaoguang
Copy link
Contributor

wxiaoguang commented Feb 26, 2023

After applying #23154 , I can still see some strange results (just strange, I do not know whether they are right or wrong)

  • The q2 always outputs executed2=0, has2=20, does it mean that: if no Push after the startup, the tasks in the queue would never be executed?
  • If the q3 uses Has to limit the tasks only being executed once (the code in below details, maybe like the checkAndUpdateStatus in code?), the output will be executed3=107, has3=0, it means that many tasks are still re-executed even if there is a Has check. I know that there is concurrency, but is it clear to future developers? Is the existing code like checkAndUpdateStatus which relies on Has working as expected in all cases?

Are these two behaviors by design? If yes, I think some comments would be very useful. If no, fix or not?

Feel free to ignore my comment if you think they are unrelated.

ps: I didn't get the point about what do you mean by "incorrect" or "not quite right". That test code is just used to test the queue's behavior, there is no assertion in it, and I didn't say anything about "correct" or "right" before ....

		// q3 test handler to use `Has` to only executed one task
		testHandler := func(data ...queue.Data) []queue.Data {
			for _, datum := range data {
				s := datum.(string)
				if has, _ := q.Has(s); !has {
					executedTasks = append(executedTasks, s)
				}
			}
			return nil
		}

techknowlogick added a commit that referenced this issue Feb 28, 2023
There have been a number of reports of PRs being blocked whilst being
checked which have been difficult to debug. In investigating #23050 I
have realised that whilst the Warn there is somewhat of a miscall there
was a real bug in the way that the LevelUniqueQueue was being restored
on start-up of the PersistableChannelUniqueQueue.

Next there is a conflict in the setting of the internal leveldb queue
name - This wasn't being set so it was being overridden by other unique
queues.

This PR fixes these bugs and adds a testcase.

Thanks to @brechtvl  for noticing the second issue.

Fix #23050
and others

---------

Signed-off-by: Andrew Thornton <art27@cantab.net>
Co-authored-by: techknowlogick <techknowlogick@gitea.io>
GiteaBot pushed a commit to GiteaBot/gitea that referenced this issue Feb 28, 2023
There have been a number of reports of PRs being blocked whilst being
checked which have been difficult to debug. In investigating go-gitea#23050 I
have realised that whilst the Warn there is somewhat of a miscall there
was a real bug in the way that the LevelUniqueQueue was being restored
on start-up of the PersistableChannelUniqueQueue.

Next there is a conflict in the setting of the internal leveldb queue
name - This wasn't being set so it was being overridden by other unique
queues.

This PR fixes these bugs and adds a testcase.

Thanks to @brechtvl  for noticing the second issue.

Fix go-gitea#23050
and others

---------

Signed-off-by: Andrew Thornton <art27@cantab.net>
Co-authored-by: techknowlogick <techknowlogick@gitea.io>
GiteaBot pushed a commit to GiteaBot/gitea that referenced this issue Feb 28, 2023
There have been a number of reports of PRs being blocked whilst being
checked which have been difficult to debug. In investigating go-gitea#23050 I
have realised that whilst the Warn there is somewhat of a miscall there
was a real bug in the way that the LevelUniqueQueue was being restored
on start-up of the PersistableChannelUniqueQueue.

Next there is a conflict in the setting of the internal leveldb queue
name - This wasn't being set so it was being overridden by other unique
queues.

This PR fixes these bugs and adds a testcase.

Thanks to @brechtvl  for noticing the second issue.

Fix go-gitea#23050
and others

---------

Signed-off-by: Andrew Thornton <art27@cantab.net>
Co-authored-by: techknowlogick <techknowlogick@gitea.io>
lunny pushed a commit that referenced this issue Mar 6, 2023
Backport #23154

There have been a number of reports of PRs being blocked whilst being
checked which have been difficult to debug. In investigating #23050 I
have realised that whilst the Warn there is somewhat of a miscall there
was a real bug in the way that the LevelUniqueQueue was being restored
on start-up of the PersistableChannelUniqueQueue.

Next there is a conflict in the setting of the internal leveldb queue
name - This wasn't being set so it was being overridden by other unique
queues.

This PR fixes these bugs and adds a testcase.

Thanks to @brechtvl  for noticing the second issue.

Fix #23050
and others

Signed-off-by: Andrew Thornton <art27@cantab.net>
Co-authored-by: zeripath <art27@cantab.net>
Co-authored-by: techknowlogick <techknowlogick@gitea.io>
Co-authored-by: delvh <leon@kske.dev>
lunny pushed a commit that referenced this issue Mar 6, 2023
Backport #23154

There have been a number of reports of PRs being blocked whilst being
checked which have been difficult to debug. In investigating #23050 I
have realised that whilst the Warn there is somewhat of a miscall there
was a real bug in the way that the LevelUniqueQueue was being restored
on start-up of the PersistableChannelUniqueQueue.

Next there is a conflict in the setting of the internal leveldb queue
name - This wasn't being set so it was being overridden by other unique
queues.

This PR fixes these bugs and adds a testcase.

Thanks to @brechtvl  for noticing the second issue.

Fix #23050
and others

Signed-off-by: Andrew Thornton <art27@cantab.net>
Co-authored-by: zeripath <art27@cantab.net>
Co-authored-by: techknowlogick <techknowlogick@gitea.io>
Co-authored-by: delvh <leon@kske.dev>
@go-gitea go-gitea locked and limited conversation to collaborators May 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
5 participants