-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recurring freezes on BSD #18180
Comments
How did you compile Gitea yourself? |
It's built by poudriere, which is a bulk package builder for FreeBSD that Actual disk usage I sadly don't have data for, but here are graphs for the other metrics as well as gitea and nginx (the green one is the gitea, or rather the transparent proxy for it) |
I should probably also mention that no other services seem to be affected by this, so it looks like it's not hogging any system resources (or rather not any used by anything else) when it freezes – it just stops doing stuff. |
This sounds like a deadlock happening somewhere but postgres isn't usually a DB that suffers from them. (SQLite is usually the way we detect these.) Apart from DB deadlocks I'm not sure there's any other obvious thing that would cause a deadlock. One problem in 1.15.7- was that If so, a few thoughts:
|
|
The logs appear somewhat confusing:
This implies that the problem has occurred somewhere in: Lines 551 to 558 in b25a571
But the problem is that there isn't really any place for a deadlock to occur in there. My thoughts for progressing this further are to apply the following: diff --git a/modules/git/command.go b/modules/git/command.go
index d83c42fdc..273bea632 100644
--- a/modules/git/command.go
+++ b/modules/git/command.go
@@ -120,6 +120,8 @@ func (c *Command) RunInDirTimeoutEnvFullPipelineFunc(env []string, timeout time.
log.Debug("%s: %v", dir, c)
}
+ defer log.Debug("Done %s: %v", dir, c)
+
ctx, cancel := context.WithTimeout(c.parentContext, timeout)
defer cancel()
diff --git a/modules/git/repo_branch_nogogit.go b/modules/git/repo_branch_nogogit.go
index 666ca81c1..b9a7a483f 100644
--- a/modules/git/repo_branch_nogogit.go
+++ b/modules/git/repo_branch_nogogit.go
@@ -68,6 +68,9 @@ func (repo *Repository) GetBranches(skip, limit int) ([]string, int, error) {
// callShowRef return refs, if limit = 0 it will not limit
func callShowRef(repoPath, prefix, arg string, skip, limit int) (branchNames []string, countAll int, err error) {
+ log.Debug("callShowRef %s %s %s %d %d", repoPath, prefix, arg, skip, limit)
+ defer log.Debug("done: callShowRef %s %s %s %d %d", repoPath, prefix, arg, skip, limit)
+
stdoutReader, stdoutWriter := io.Pipe()
defer func() {
_ = stdoutReader.Close()
diff --git a/modules/process/manager.go b/modules/process/manager.go
index e42e38a0f..7ea6a73e6 100644
--- a/modules/process/manager.go
+++ b/modules/process/manager.go
@@ -14,6 +14,8 @@ import (
"sort"
"sync"
"time"
+
+ "code.gitea.io/gitea/modules/log"
)
// TODO: This packages still uses a singleton for the Manager.
@@ -56,6 +58,7 @@ func GetManager() *Manager {
// Add a process to the ProcessManager and returns its PID.
func (pm *Manager) Add(description string, cancel context.CancelFunc) int64 {
+ log.Debug("Add(%s)", description)
pm.mutex.Lock()
pid := pm.counter + 1
pm.processes[pid] = &Process{
@@ -67,6 +70,7 @@ func (pm *Manager) Add(description string, cancel context.CancelFunc) int64 {
pm.counter = pid
pm.mutex.Unlock()
+ log.Debug("Done Add(%s) PID: %d", description, pid)
return pid
}
That would help us to see if there is deadlock somewhere. One final random thought is maybe the problem is in log/file.go - I guess it's worth another review. |
Have you seen any further freezes? |
Yes and no. The bug seems to have triggered again, but this time the process actually managed to die. |
OK so somewhat relievingly the problem is not in process.GetManager().Add() and that the error has occurred in the same place suggests that logging is not to blame. The problem lies somewhere in: Lines 123 to 155 in 6cb5069
The only places left therefore are:
None of these are looking greatly soluble, I guess adding some logging to this section of code would be the only thing to do to move things closer to working out what the possible reason is. diff --git a/modules/git/command.go b/modules/git/command.go
index d83c42fdc..71c398aae 100644
--- a/modules/git/command.go
+++ b/modules/git/command.go
@@ -120,9 +120,13 @@ func (c *Command) RunInDirTimeoutEnvFullPipelineFunc(env []string, timeout time.
log.Debug("%s: %v", dir, c)
}
+ defer log.Debug("Done %s: %v", dir, c)
+
ctx, cancel := context.WithTimeout(c.parentContext, timeout)
defer cancel()
+ log.Debug("%s: %v created context", dir, c)
+
cmd := exec.CommandContext(ctx, c.name, c.args...)
if env == nil {
cmd.Env = os.Environ()
@@ -130,6 +134,8 @@ func (c *Command) RunInDirTimeoutEnvFullPipelineFunc(env []string, timeout time.
cmd.Env = env
}
+ log.Debug("%s: %v created CommandContext", dir, c)
+
cmd.Env = append(
cmd.Env,
fmt.Sprintf("LC_ALL=%s", DefaultLocale),
@@ -149,6 +155,8 @@ func (c *Command) RunInDirTimeoutEnvFullPipelineFunc(env []string, timeout time.
return err
}
+ log.Debug("%s: %v started", dir, c)
+
desc := c.desc
if desc == "" {
desc = fmt.Sprintf("%s %s %s [repo_path: %s]", GitExecutable, c.name, strings.Join(c.args, " "), dir)
diff --git a/modules/git/repo_branch_nogogit.go b/modules/git/repo_branch_nogogit.go
index 666ca81c1..b9a7a483f 100644
--- a/modules/git/repo_branch_nogogit.go
+++ b/modules/git/repo_branch_nogogit.go
@@ -68,6 +68,9 @@ func (repo *Repository) GetBranches(skip, limit int) ([]string, int, error) {
// callShowRef return refs, if limit = 0 it will not limit
func callShowRef(repoPath, prefix, arg string, skip, limit int) (branchNames []string, countAll int, err error) {
+ log.Debug("callShowRef %s %s %s %d %d", repoPath, prefix, arg, skip, limit)
+ defer log.Debug("done: callShowRef %s %s %s %d %d", repoPath, prefix, arg, skip, limit)
+
stdoutReader, stdoutWriter := io.Pipe()
defer func() {
_ = stdoutReader.Close()
diff --git a/modules/process/manager.go b/modules/process/manager.go
index e42e38a0f..7ea6a73e6 100644
--- a/modules/process/manager.go
+++ b/modules/process/manager.go
@@ -14,6 +14,8 @@ import (
"sort"
"sync"
"time"
+
+ "code.gitea.io/gitea/modules/log"
)
// TODO: This packages still uses a singleton for the Manager.
@@ -56,6 +58,7 @@ func GetManager() *Manager {
// Add a process to the ProcessManager and returns its PID.
func (pm *Manager) Add(description string, cancel context.CancelFunc) int64 {
+ log.Debug("Add(%s)", description)
pm.mutex.Lock()
pid := pm.counter + 1
pm.processes[pid] = &Process{
@@ -67,6 +70,7 @@ func (pm *Manager) Add(description string, cancel context.CancelFunc) int64 {
pm.counter = pid
pm.mutex.Unlock()
+ log.Debug("Done Add(%s) PID: %d", description, pid)
return pid
}
(remember github likes to pretend the final empty line doesn't exist so if you copy this add a terminal empty line.) |
Patched, deployed and will get back next time the bug strikes. Thanks for the assistance. :) |
@phryk thanks for posting this bug report. I'm having the same issue running Gitea in Bastille running postgres in the same container. I'm currently updating to the latest release but did notice that the hang happens the same way as you and have to fully restart the server when I need to update a setting or two (still getting Gitea moved over and configured properly from a Linux server). Definitely seeing this issue with 1.15.10 on my end. The easiest way to get it to trigger for me was to just spam |
@bedwardly-down as I said above this isn't looking like a Gitea bug per se. The three points of possible deadlock are all deep in go std library code and likely at system calls. My greatest suspicion is unfortunately falling at the context.WithTimeout call. If it's there then that's a serious problem and working around it will not be easy (although we could simply drop the WithTimeout assuming WithCancel is unaffected.) |
@zeripath thanks for responding. I'm waiting on the FreeBSD port maintainer to get back to me on it but go apps having this kind of issue is not uncommon on FreeBSD, according to other maintainers I've been interacting with in the IRC and official Discord channels. I ran a test yesterday that kind of goes along with your assumptions too. I'm primarily involved with the Nodejs ecosystem right now, so I use the PM2 process manager and Nodemon pretty regularly. Running Gitea through PM2 and having that strapped to FreeBSD's init system showed that the port was attempting to start and stop at the exact same time in various places causing it to deadlock. I had a similar occurrence with Caddy and a few other go apps on my old Gentoo server (it uses a modified version of FreeBSD's init system with most of the core functionality being exactly the same between the two). Running Gitea and Caddy outside of init but instead directly seemed to work fine in several of my tests. |
Bug hit again. Logs will shortly be on the way to @zeripath, last 3 lines from
|
OK well we've established that the problem is in starting the process: Line 148 in 6cb5069
I think this is likely to be an os/jail problem - @bedwardly-down 's comment suggests that perhaps the issue might some deadlock in PM2 with processes being created at exactly the same time. If so, there's nothing we as gitea can do. You could try to use the gogit variant - as this will create a lot fewer calls to git - which might reduce the issue? |
To further clarify, PM2 was not how I normally ran Gitea. It was a test to see what’s actually happening behind the scenes with a tool that has a built in monitor that prints some basic but useful information. It also doesn’t seem to have any issues with Sqlite3. |
@phryk how dependent are you on getting your git up and running? A large chunk of my daily needs are built around git and version control, so this definitely was an inconvenience for me. I hope it gets fixed pretty quickly upstream. |
Have you catched the issue when enable pprof? If that, could you upload pprof report? |
I haven’t tried or heard of that. I’ll have to tinker with that later. |
@łunny I've tried that before, but with the process frozen, pprof doesn't answer anymore either. Would be easier, if the info collected by pprof would also go into the openmetrics output as that's already being polled. :P @bedwardly-down I'm running a private gitea instance which I use for all my projects, so it is a bit of an inconvenience for me personally, but I don't have anyone else depending on the service. |
I'm really not sure pprof if going to help much. The problem appears to be in
|
@zeripath Thanks for looking into this as much as you have. For mine:
@phryk I just had a thought: are you running postgres in the same jail as Gitea instead of separately and interacting with it through a port? To get it to work, it needed some modifications to its main config file to allow it to use more RAM and other system resources than most BSD jails initially allocate. I haven’t tried postgres in a separate jail yet since many online sources recommended encapsulating DB and its software in a single unit to allow easier transfer and backup. I wonder if the deadlock is caused because of that. |
The issue wasn’t solved by moving postgres to a separate container for me. Even running postgres through a port and connecting that way caused gitea to hang. Removing the app.ini and letting it try to generate a new one also caused it to hang even with the /usr/local/etc/gitea/conf directory permissions set to 755 or 777. So far, it looks like SQLite is the definitive way to run it right now. Can I get anyone else to try it with Sqlite? So far, that works in my tests but isn’t ideal if you have big repositories |
@bedwardly-down are you sure that you have the same problem as @phryk ? It would useful to double check to see where your hangs are happening. I cannot see why SQLite would be better for a problem relating to forking. If you're finding that SQLite is better then could you try connecting to postgres over a unix port instead tcp port as it might be that you're suffering tcp port-exhaustion instead. Thinking again to this blocking problem relating to fork I wonder if the block is happening is because a page fault is occurring in fork and the signal cannot be handled? @phyrk what version of go are you building Gitea with? Please ensure that go is the most recent version. It might help to |
@zeripath honestly, it may be different. There’s not enough info from the original poster to get a clear picture and the only common thread we have is postgres and I’m grasping straws using my own limited knowledge of how gitea and its database support works. |
@bedwardly-down could you apply the patch in #18180 (comment) and check your logs for when then deadlock occurs to see if it's at the same place. If you find that the final log line is the same as in @phryk you're hitting the same problem. If not - well then we get to find another bug. If switching to sqlite helps then that means port exhaustion is more likely. Similarly if using a unix port for postgres helps then it's far more likely to be a port exhaustion problem. |
Didn’t apply the patch. Running it straight through the unix socket inside the same container finally solved my issue. Looking at the system logs, there was no indication that anything was happening prior running through a port from either a separate or local port that my permissions were wrong within Postgres. I fixed them and can now run Gitea with no issues. So, I’ll have to agree that mine was probably port exhaustion as you suggested. |
If you're using an http/https proxy like nginx you should also be able to make gitea run as http+unix which would then prevent port exhaustion (except through migrations) as a cause for problems. (Well I say "eliminate" I mean stop Gitea from being the cause - the port exhaustion could still happen in nginx.) Actually it looks like Go is now setting SO_REUSEADDR so our server listening shouldn't be affected by port exhaustion - perhaps the DB connection sockets aren't set with SO_REUSEADDR? |
I’m using Caddy in a separate container and reverse proxying straight from the local container ports. I don't believe Bastille supports unix sockets right now but that's possibly in the works. I have under 10 containers so far and caddy is only currently serving 3 while the rest are completely internal. That’s not good. I’ll have to research a bit on how to wrangle the containers into a single socket. Thanks. |
I have been using gitea for more than a year and have never had such problems. My configuration: $ uname -srm Jails were created with make distribution and make installworld from /usr/src, actually I am using one jail as a template and then only tar -x to the specific location. The main system (host) is build from source, after host is built then the jails are updated with the same binaries: I am using a custom script which does mergemaster, make installworld, make delete-old(libs) to the specific jail location with -D and DESTDIR parameter. Packages are built in a different jail (I am not using poudiere) and are available to other jails through custom pkg repository. From pf (packet filter) I am making a redirect to a jail with nginx:
In the jail the nginx config looks like:
proxy-headers.conf:
ssl-params.conf:
Nginx makes a proxy to the jail where gitea is installed together with postgresql.
My jails config /etc/jail.conf:
Make sure you have sysv* set in your config if you are using postgresql. My limits for the gitea jail:
and the average resource use:
In fact, there are three separate instances of gitea in the jail - each with custom startup script. But the startup script is based on the standard /usr/local/etc/rc.d/gitea so this should not be the problem. Someone said above that this error can be provoked by service restart so I have tried:
but nothing wrong has happened. May this depends on the kind of the git repository. If you wish I can prepare you an empty gitea instance and you can then interact with it to see if this problem occurs. |
For completeness: I am not using ZFS but UFS with soft-updates:
In the past I have used soft update journaling but there were a bug in soft updated journaling which caused the file system to be locked: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224292 and I have changed the filesystem configuration so that UFS only uses soft-updates now. |
My setup looks a bit different. I'm not really sure what
Set up on this is a custom FreeBSD (12.2) install with redundant ZFS pools All currently deployed packages are built by a poudriere run on the host OS, I'm running a custom thinjail setup using All jails are on an extra loopback interface, I use nginx as reverse proxy with gitea running on TCP port 3000 in the same I think I already said this to @zeripath on IRC, but it might be worth reiterating, |
I also use the nullfs trick to mount postgres, mysql, and syslogd sockets into jails but have never encountered issues with it FWIW edit: what is your SSD? Some have small buffers that cause the entire device to choke if you fill it completely with writes |
30 days ago, I did a (minor) FreeBSD update which contained a new kernel I'm closing this issue for now – feel free to reopen if you think this isn't warranted. I would of course prefer to know what the underlying issue was, but it |
Gitea Version
1.15.8
Git Version
2.34.1
Operating System
FreeBSD
How are you running Gitea?
Installed from custom package repository built py local poudriere.
Same problem occured with versions from official FreeBSD pkg repo.
Gitea is running within a jail and has an nginx running on the same jail in front of it as transparent proxy.
Database
PostgreSQL
Can you reproduce the bug on the Gitea demo site?
No
Log Gist
I'm not sure what that even means. Do you want a paste?
Description
I've been experiencing recurring freezes where only
killall -9 gitea
helps for about a year now.This happens from within a couple hours to a couple weeks after the service is started.
I cannot reproduce this issue wilfully, but if I wait long enough it always shows up.
I have previously tried setting the log level to Debug, but haven't seen anything that
would tell me what the actual issue is. As this happened last summer, I don't have
those particular logs anymore, but have re-escalated my log level to Debug to be able
to attach a log when this bug next hits.
My latest attempt in figuring out what's going on was to enable gitea's metrics and have
them graphed in Grafana, but every single metric just shows a straight line up to the point
when gitea freezes, which is when data stops coming in.
In the meantime till I can attach a log: Any hunches on what might trigger this behavior?
Any other data I can supply to help triage this issue?
Best wishes,
phryk
Screenshots
No response
The text was updated successfully, but these errors were encountered: