Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recurring freezes on BSD #18180

Closed
phryk opened this issue Jan 4, 2022 · 34 comments
Closed

Recurring freezes on BSD #18180

phryk opened this issue Jan 4, 2022 · 34 comments
Labels
issue/needs-feedback For bugs, we need more details. For features, the feature must be described in more detail type/upstream This is an issue in one of Gitea's dependencies and should be reported there

Comments

@phryk
Copy link

phryk commented Jan 4, 2022

Gitea Version

1.15.8

Git Version

2.34.1

Operating System

FreeBSD

How are you running Gitea?

Installed from custom package repository built py local poudriere.
Same problem occured with versions from official FreeBSD pkg repo.

Gitea is running within a jail and has an nginx running on the same jail in front of it as transparent proxy.

Database

PostgreSQL

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

I'm not sure what that even means. Do you want a paste?

Description

I've been experiencing recurring freezes where only killall -9 gitea helps for about a year now.
This happens from within a couple hours to a couple weeks after the service is started.

I cannot reproduce this issue wilfully, but if I wait long enough it always shows up.

I have previously tried setting the log level to Debug, but haven't seen anything that
would tell me what the actual issue is. As this happened last summer, I don't have
those particular logs anymore, but have re-escalated my log level to Debug to be able
to attach a log when this bug next hits.

My latest attempt in figuring out what's going on was to enable gitea's metrics and have
them graphed in Grafana, but every single metric just shows a straight line up to the point
when gitea freezes, which is when data stops coming in.

In the meantime till I can attach a log: Any hunches on what might trigger this behavior?
Any other data I can supply to help triage this issue?

Best wishes,

phryk

Screenshots

No response

@lunny
Copy link
Member

lunny commented Jan 4, 2022

How did you compile Gitea yourself?
And could you catch the CPU, disk, memory usage when Gitea recurring freezes?

@phryk
Copy link
Author

phryk commented Jan 4, 2022

It's built by poudriere, which is a bulk package builder for FreeBSD that
uses FreeBSDs ports system to do the actual building.
You can find out more about it here: https://github.com/freebsd/poudriere/wiki

Actual disk usage I sadly don't have data for, but here are graphs for the other metrics as well as gitea and nginx (the green one is the gitea, or rather the transparent proxy for it)

combined

@phryk
Copy link
Author

phryk commented Jan 4, 2022

I should probably also mention that no other services seem to be affected by this, so it looks like it's not hogging any system resources (or rather not any used by anything else) when it freezes – it just stops doing stuff.

@zeripath
Copy link
Contributor

zeripath commented Jan 4, 2022

This sounds like a deadlock happening somewhere but postgres isn't usually a DB that suffers from them. (SQLite is usually the way we detect these.) Apart from DB deadlocks I'm not sure there's any other obvious thing that would cause a deadlock.

One problem in 1.15.7- was that git cat-file --batch won't fail fast if the git repo is broken but #17992 should prevent that - Is this definitely still happening on 1.15.8?

If so, a few thoughts:

  • Do you have pprof enabled? (be careful with this as on 1.15 pprof will write files to PPROF_DATA_PATH every time people ssh to your server.) It may be helpful to poll the cpuprofile to see if we can catch the problem as it happens.
  • Have you been able to rule out a network port exhaustion problem? Or open files problem?
  • In the router logs have you seen any common requests that aren't complete?
  • I worry that what we're seeing might be a bug in go on FreeBSD - but I just don't know how to prove that.

@phryk
Copy link
Author

phryk commented Jan 4, 2022

  • Other services using PostgreSQL (with other databases) are unaffected, not sure if that means a DB deadlock can be excluded.
  • Yes, definitely happening on 1.15.8
  • Well, now I have pprof enabled, will report back when the bug hits again
  • I'm not quite sure what you mean with router logs and how I would tell a request didn't complete
    • grep -i error gitea.log shows nothing of interest, the only legitimate errors are from gitea being unable
      to update a mirror whose source repo got DMCA'd…

@zeripath
Copy link
Contributor

zeripath commented Jan 7, 2022

The logs appear somewhat confusing:

2022/01/04 23:11:01 ...s/context/context.go:185:HTML() [D] Template: repo/commits
2022/01/04 23:11:01 models/repo.go:504:ComposeMetas() [I] [SQL] SELECT team.lower_name FROM "team_repo" INNER JOIN "team" ON team.id = team_repo.team_id WHERE (team_repo.repo_id = $1) ORDER BY team.lower_name [1] - 268.794µs
2022/01/04 23:11:02 ...s/context/context.go:755:1() [D] Session ID: <crawler>
2022/01/04 23:11:02 ...s/context/context.go:756:1() [D] CSRF Token: <crawler>
2022/01/04 23:11:02 ...rm.io/xorm/engine.go:1139:Get() [I] [SQL] SELECT "id", "lower_name", "name", "full_name", "email", "keep_email_private", "email_notifications_preference", "passwd", "passwd_hash_algo", "must_change_password", "login_type", "login_source", "login_name", "type", "location", "website", "rands", "salt", "language", "description", "created_unix", "updated_unix", "last_login_unix", "last_repo_visibility", "max_repo_creation", "is_active", "is_admin", "is_restricted", "allow_git_hook", "allow_import_local", "allow_create_organization", "prohibit_login", "avatar", "avatar_email", "use_custom_avatar", "num_followers", "num_following", "num_stars", "num_repos", "num_teams", "num_members", "visibility", "repo_admin_change_team_access", "diff_view_style", "theme", "keep_activity_private" FROM "user" WHERE "lower_name"=$1 LIMIT 1 [owner] - 463.192µs
2022/01/04 23:11:02 ...rm.io/xorm/engine.go:1139:Get() [I] [SQL] SELECT "id", "owner_id", "owner_name", "lower_name", "name", "description", "website", "original_service_type", "original_url", "default_branch", "num_watches", "num_stars", "num_forks", "num_issues", "num_closed_issues", "num_pulls", "num_closed_pulls", "num_milestones", "num_closed_milestones", "num_projects", "num_closed_projects", "is_private", "is_empty", "is_archived", "is_mirror", "status", "is_fork", "fork_id", "is_template", "template_id", "size", "is_fsck_enabled", "close_issues_via_commit_in_any_branch", "topics", "trust_model", "avatar", "created_unix", "updated_unix" FROM "repository" WHERE "owner_id"=$1 AND "lower_name"=$2 LIMIT 1 [2 repo] - 413.763µs
2022/01/04 23:11:02 models/repo_unit.go:210:getUnitsByRepoID() [I] [SQL] SELECT "id", "repo_id", "type", "config", "created_unix" FROM "repo_unit" WHERE (repo_id = $1) [1] - 183.034µs
2022/01/04 23:11:02 ...s/repo_pushmirror.go:97:GetPushMirrorsByRepoID() [I] [SQL] SELECT "id", "repo_id", "remote_name", "interval", "created_unix", "last_update", "last_error" FROM "push_mirror" WHERE (repo_id=$1) [1] - 162.754µs
2022/01/04 23:11:02 ...ules/context/repo.go:470:RepoAssignment() [I] [SQL] SELECT count(*) FROM "release" WHERE repo_id=$1 AND is_draft=$2 [1 false] - 194.067µs
2022/01/04 23:11:02 ...ules/context/repo.go:477:RepoAssignment() [I] [SQL] SELECT count(*) FROM "release" WHERE repo_id=$1 AND is_draft=$2 AND is_tag=$3 [1 false false] - 196.112µs
2022/01/04 23:11:02 ...dules/git/command.go:120:RunInDirTimeoutEnvFullPipelineFunc() [D] /var/db/gitea/gitea-repositories/owner/repo.git: /usr/local/bin/git -c credential.helper= -c protocol.version=2 rev-parse
2022/01/04 23:11:02 ...dules/git/command.go:120:RunInDirTimeoutEnvFullPipelineFunc() [D] /var/db/gitea/gitea-repositories/owner/repo.git: /usr/local/bin/git -c credential.helper= -c protocol.version=2 show-ref --tags
2022/01/04 23:11:02 ...dules/git/command.go:120:RunInDirTimeoutEnvFullPipelineFunc() [D] /var/db/gitea/gitea-repositories/owner/repo.git: /usr/local/bin/git -c credential.helper= -c protocol.version=2 cat-file --batch
2022/01/04 23:11:02 ...dules/git/command.go:120:RunInDirTimeoutEnvFullPipelineFunc() [D] /var/db/gitea/gitea-repositories/owner/repo.git: /usr/local/bin/git -c credential.helper= -c protocol.version=2 cat-file --batch-check

This implies that the problem has occurred somewhere in:

tags, err := ctx.Repo.GitRepo.GetTags()
if err != nil {
ctx.ServerError("GetTags", err)
return
}
ctx.Data["Tags"] = tags
brs, _, err := ctx.Repo.GitRepo.GetBranches(0, 0)

But the problem is that there isn't really any place for a deadlock to occur in there.

My thoughts for progressing this further are to apply the following:

diff --git a/modules/git/command.go b/modules/git/command.go
index d83c42fdc..273bea632 100644
--- a/modules/git/command.go
+++ b/modules/git/command.go
@@ -120,6 +120,8 @@ func (c *Command) RunInDirTimeoutEnvFullPipelineFunc(env []string, timeout time.
 		log.Debug("%s: %v", dir, c)
 	}
 
+	defer log.Debug("Done %s: %v", dir, c)
+
 	ctx, cancel := context.WithTimeout(c.parentContext, timeout)
 	defer cancel()
 
diff --git a/modules/git/repo_branch_nogogit.go b/modules/git/repo_branch_nogogit.go
index 666ca81c1..b9a7a483f 100644
--- a/modules/git/repo_branch_nogogit.go
+++ b/modules/git/repo_branch_nogogit.go
@@ -68,6 +68,9 @@ func (repo *Repository) GetBranches(skip, limit int) ([]string, int, error) {
 
 // callShowRef return refs, if limit = 0 it will not limit
 func callShowRef(repoPath, prefix, arg string, skip, limit int) (branchNames []string, countAll int, err error) {
+	log.Debug("callShowRef %s %s %s %d %d", repoPath, prefix, arg, skip, limit)
+	defer log.Debug("done: callShowRef %s %s %s %d %d", repoPath, prefix, arg, skip, limit)
+
 	stdoutReader, stdoutWriter := io.Pipe()
 	defer func() {
 		_ = stdoutReader.Close()
diff --git a/modules/process/manager.go b/modules/process/manager.go
index e42e38a0f..7ea6a73e6 100644
--- a/modules/process/manager.go
+++ b/modules/process/manager.go
@@ -14,6 +14,8 @@ import (
 	"sort"
 	"sync"
 	"time"
+
+	"code.gitea.io/gitea/modules/log"
 )
 
 // TODO: This packages still uses a singleton for the Manager.
@@ -56,6 +58,7 @@ func GetManager() *Manager {
 
 // Add a process to the ProcessManager and returns its PID.
 func (pm *Manager) Add(description string, cancel context.CancelFunc) int64 {
+	log.Debug("Add(%s)", description)
 	pm.mutex.Lock()
 	pid := pm.counter + 1
 	pm.processes[pid] = &Process{
@@ -67,6 +70,7 @@ func (pm *Manager) Add(description string, cancel context.CancelFunc) int64 {
 	pm.counter = pid
 	pm.mutex.Unlock()
 
+	log.Debug("Done Add(%s) PID: %d", description, pid)
 	return pid
 }
 

That would help us to see if there is deadlock somewhere.

One final random thought is maybe the problem is in log/file.go - I guess it's worth another review.

@zeripath
Copy link
Contributor

Have you seen any further freezes?

@phryk
Copy link
Author

phryk commented Jan 16, 2022

Yes and no. The bug seems to have triggered again, but this time the process actually managed to die.
Logs are on their way to you.

@zeripath
Copy link
Contributor

OK so somewhat relievingly the problem is not in process.GetManager().Add() and that the error has occurred in the same place suggests that logging is not to blame.

The problem lies somewhere in:

ctx, cancel := context.WithTimeout(c.parentContext, timeout)
defer cancel()
cmd := exec.CommandContext(ctx, c.name, c.args...)
if env == nil {
cmd.Env = os.Environ()
} else {
cmd.Env = env
}
cmd.Env = append(
cmd.Env,
fmt.Sprintf("LC_ALL=%s", DefaultLocale),
// avoid prompting for credentials interactively, supported since git v2.3
"GIT_TERMINAL_PROMPT=0",
)
// TODO: verify if this is still needed in golang 1.15
if goVersionLessThan115 {
cmd.Env = append(cmd.Env, "GODEBUG=asyncpreemptoff=1")
}
cmd.Dir = dir
cmd.Stdout = stdout
cmd.Stderr = stderr
cmd.Stdin = stdin
if err := cmd.Start(); err != nil {
return err
}
desc := c.desc
if desc == "" {
desc = fmt.Sprintf("%s %s %s [repo_path: %s]", GitExecutable, c.name, strings.Join(c.args, " "), dir)
}

The only places left therefore are:

  1. line 123: ctx, cancel := context.WithTimeout(c.parentContext, timeout) is this deadlocking somehow?
  2. line 126: cmd := exec.CommandContext(ctx, c.name, c.args...) this requires stating the filesystem. Is the filesystem locking for some reason?
  3. line 148: if err := cmd.Start(); err != nil { this involves starting the process itself so maybe there is some issue with bsd process creation here.

None of these are looking greatly soluble, I guess adding some logging to this section of code would be the only thing to do to move things closer to working out what the possible reason is.


diff --git a/modules/git/command.go b/modules/git/command.go
index d83c42fdc..71c398aae 100644
--- a/modules/git/command.go
+++ b/modules/git/command.go
@@ -120,9 +120,13 @@ func (c *Command) RunInDirTimeoutEnvFullPipelineFunc(env []string, timeout time.
 		log.Debug("%s: %v", dir, c)
 	}
 
+	defer log.Debug("Done %s: %v", dir, c)
+
 	ctx, cancel := context.WithTimeout(c.parentContext, timeout)
 	defer cancel()
 
+	log.Debug("%s: %v created context", dir, c)
+
 	cmd := exec.CommandContext(ctx, c.name, c.args...)
 	if env == nil {
 		cmd.Env = os.Environ()
@@ -130,6 +134,8 @@ func (c *Command) RunInDirTimeoutEnvFullPipelineFunc(env []string, timeout time.
 		cmd.Env = env
 	}
 
+	log.Debug("%s: %v created CommandContext", dir, c)
+
 	cmd.Env = append(
 		cmd.Env,
 		fmt.Sprintf("LC_ALL=%s", DefaultLocale),
@@ -149,6 +155,8 @@ func (c *Command) RunInDirTimeoutEnvFullPipelineFunc(env []string, timeout time.
 		return err
 	}
 
+	log.Debug("%s: %v started", dir, c)
+
 	desc := c.desc
 	if desc == "" {
 		desc = fmt.Sprintf("%s %s %s [repo_path: %s]", GitExecutable, c.name, strings.Join(c.args, " "), dir)
diff --git a/modules/git/repo_branch_nogogit.go b/modules/git/repo_branch_nogogit.go
index 666ca81c1..b9a7a483f 100644
--- a/modules/git/repo_branch_nogogit.go
+++ b/modules/git/repo_branch_nogogit.go
@@ -68,6 +68,9 @@ func (repo *Repository) GetBranches(skip, limit int) ([]string, int, error) {
 
 // callShowRef return refs, if limit = 0 it will not limit
 func callShowRef(repoPath, prefix, arg string, skip, limit int) (branchNames []string, countAll int, err error) {
+	log.Debug("callShowRef %s %s %s %d %d", repoPath, prefix, arg, skip, limit)
+	defer log.Debug("done: callShowRef %s %s %s %d %d", repoPath, prefix, arg, skip, limit)
+
 	stdoutReader, stdoutWriter := io.Pipe()
 	defer func() {
 		_ = stdoutReader.Close()
diff --git a/modules/process/manager.go b/modules/process/manager.go
index e42e38a0f..7ea6a73e6 100644
--- a/modules/process/manager.go
+++ b/modules/process/manager.go
@@ -14,6 +14,8 @@ import (
 	"sort"
 	"sync"
 	"time"
+
+	"code.gitea.io/gitea/modules/log"
 )
 
 // TODO: This packages still uses a singleton for the Manager.
@@ -56,6 +58,7 @@ func GetManager() *Manager {
 
 // Add a process to the ProcessManager and returns its PID.
 func (pm *Manager) Add(description string, cancel context.CancelFunc) int64 {
+	log.Debug("Add(%s)", description)
 	pm.mutex.Lock()
 	pid := pm.counter + 1
 	pm.processes[pid] = &Process{
@@ -67,6 +70,7 @@ func (pm *Manager) Add(description string, cancel context.CancelFunc) int64 {
 	pm.counter = pid
 	pm.mutex.Unlock()
 
+	log.Debug("Done Add(%s) PID: %d", description, pid)
 	return pid
 }
 

(remember github likes to pretend the final empty line doesn't exist so if you copy this add a terminal empty line.)

@phryk
Copy link
Author

phryk commented Jan 16, 2022

Patched, deployed and will get back next time the bug strikes. Thanks for the assistance. :)

@zeripath zeripath added the issue/needs-feedback For bugs, we need more details. For features, the feature must be described in more detail label Jan 17, 2022
@thearchivalone
Copy link

thearchivalone commented Jan 20, 2022

@phryk thanks for posting this bug report. I'm having the same issue running Gitea in Bastille running postgres in the same container. I'm currently updating to the latest release but did notice that the hang happens the same way as you and have to fully restart the server when I need to update a setting or two (still getting Gitea moved over and configured properly from a Linux server).

Definitely seeing this issue with 1.15.10 on my end.

The easiest way to get it to trigger for me was to just spam service gitea restart until it freezes at Waiting for PIDS message. Sending this to the port maintainer.

@zeripath
Copy link
Contributor

@bedwardly-down as I said above this isn't looking like a Gitea bug per se. The three points of possible deadlock are all deep in go std library code and likely at system calls.

My greatest suspicion is unfortunately falling at the context.WithTimeout call. If it's there then that's a serious problem and working around it will not be easy (although we could simply drop the WithTimeout assuming WithCancel is unaffected.)

@thearchivalone
Copy link

thearchivalone commented Jan 20, 2022

@zeripath thanks for responding. I'm waiting on the FreeBSD port maintainer to get back to me on it but go apps having this kind of issue is not uncommon on FreeBSD, according to other maintainers I've been interacting with in the IRC and official Discord channels. I ran a test yesterday that kind of goes along with your assumptions too.

I'm primarily involved with the Nodejs ecosystem right now, so I use the PM2 process manager and Nodemon pretty regularly. Running Gitea through PM2 and having that strapped to FreeBSD's init system showed that the port was attempting to start and stop at the exact same time in various places causing it to deadlock. I had a similar occurrence with Caddy and a few other go apps on my old Gentoo server (it uses a modified version of FreeBSD's init system with most of the core functionality being exactly the same between the two). Running Gitea and Caddy outside of init but instead directly seemed to work fine in several of my tests.

@phryk
Copy link
Author

phryk commented Jan 20, 2022

Bug hit again. Logs will shortly be on the way to @zeripath, last 3 lines from gitea.log look like this:

2022/01/20 03:24:11 ...dules/git/command.go:120:RunInDirTimeoutEnvFullPipelineFunc() [D] /var/db/gitea/gitea-repositories/mirror/foo.git: /usr/local/bin/git -c credential.helper= -c protocol.version=2 rev-list --count <somehash>
2022/01/20 03:24:11 ...dules/git/command.go:128:RunInDirTimeoutEnvFullPipelineFunc() [D] /var/db/gitea/gitea-repositories/mirror/foo.git: /usr/local/bin/git -c credential.helper= -c protocol.version=2 rev-list --count <somehash> created context
2022/01/20 03:24:11 ...dules/git/command.go:137:RunInDirTimeoutEnvFullPipelineFunc() [D] /var/db/gitea/gitea-repositories/mirror/foo.git: /usr/local/bin/git -c credential.helper= -c protocol.version=2 rev-list --count <somehash> created CommandContext

@zeripath
Copy link
Contributor

OK well we've established that the problem is in starting the process:

if err := cmd.Start(); err != nil {

I think this is likely to be an os/jail problem - @bedwardly-down 's comment suggests that perhaps the issue might some deadlock in PM2 with processes being created at exactly the same time. If so, there's nothing we as gitea can do.

You could try to use the gogit variant - as this will create a lot fewer calls to git - which might reduce the issue?

@zeripath zeripath added the type/upstream This is an issue in one of Gitea's dependencies and should be reported there label Jan 20, 2022
@zeripath zeripath changed the title Recurring freezes Recurring freezes on BSD Jan 20, 2022
@thearchivalone
Copy link

OK well we've established that the problem is in starting the process:

if err := cmd.Start(); err != nil {

I think this is likely to be an os/jail problem - @bedwardly-down 's comment suggests that perhaps the issue might some deadlock in PM2 with processes being created at exactly the same time. If so, there's nothing we as gitea can do.

You could try to use the gogit variant - as this will create a lot fewer calls to git - which might reduce the issue?

To further clarify, PM2 was not how I normally ran Gitea. It was a test to see what’s actually happening behind the scenes with a tool that has a built in monitor that prints some basic but useful information. It also doesn’t seem to have any issues with Sqlite3.

@thearchivalone
Copy link

@phryk how dependent are you on getting your git up and running? A large chunk of my daily needs are built around git and version control, so this definitely was an inconvenience for me. I hope it gets fixed pretty quickly upstream.

@lunny
Copy link
Member

lunny commented Jan 23, 2022

Have you catched the issue when enable pprof? If that, could you upload pprof report?

@thearchivalone
Copy link

thearchivalone commented Jan 23, 2022

I haven’t tried or heard of that. I’ll have to tinker with that later.

@phryk
Copy link
Author

phryk commented Jan 23, 2022

@łunny I've tried that before, but with the process frozen, pprof doesn't answer anymore either.
@zeripath recommended periodically polling pprof and saving the last couple results. I might get to that in a couple days.

Would be easier, if the info collected by pprof would also go into the openmetrics output as that's already being polled. :P

@bedwardly-down I'm running a private gitea instance which I use for all my projects, so it is a bit of an inconvenience for me personally, but I don't have anyone else depending on the service.

@zeripath
Copy link
Contributor

I'm really not sure pprof if going to help much. The problem appears to be in os/exec/exec.go within go's std library itself.

os/exec/exec.go:(*cmd).Start()

Walking through the code of Start in there points to:

	c.Process, err = os.StartProcess(c.Path, c.argv(), &os.ProcAttr{
		Dir:   c.Dir,
		Files: c.childFiles,
		Env:   addCriticalEnv(dedupEnv(envv)),
		Sys:   c.SysProcAttr,
	})

As the place where the problem is.

syscall/exec_unix.go:os.StartProcess(...)

Which on bsd, unix and linux all call forkExec in syscall/exec_unix.go. A cursory glance of this code shows:

...
	// Acquire the fork lock so that no other threads
	// create new fds that are not yet close-on-exec
	// before we fork.
	ForkLock.Lock()

	// Allocate child status pipe close on exec.
	if err = forkExecPipe(p[:]); err != nil {
		ForkLock.Unlock()
		return 0, err
	}

	// Kick off child.
	pid, err1 = forkAndExecInChild(argv0p, argvp, envvp, chroot, dir, attr, sys, p[1])
	if err1 != 0 {
		Close(p[0])
		Close(p[1])
		ForkLock.Unlock()
		return 0, Errno(err1)
	}
	ForkLock.Unlock()
...

Now if there is a panic in there the ForkLock could end up being left locked but the panic should be seen in our logs and I see no evidence of this.

Which leads me to think that either forkExecPipe or forkAndExecInChild are blocking.

forkExecPipe

Now forkExecPipe in linux is somewhat more complex than that on bsd:

BSD (syscall/forkpipe2.go):

func forkExecPipe(p []int) error {
	return Pipe2(p, O_CLOEXEC)
}

Linux (exec_linux,go):

// Try to open a pipe with O_CLOEXEC set on both file descriptors.
func forkExecPipe(p []int) (err error) {
	err = Pipe2(p, O_CLOEXEC)
	// pipe2 was added in 2.6.27 and our minimum requirement is 2.6.23, so it
	// might not be implemented.
	if err == ENOSYS {
		if err = Pipe(p); err != nil {
			return
		}
		if _, err = fcntl(p[0], F_SETFD, FD_CLOEXEC); err != nil {
			return
		}
		_, err = fcntl(p[1], F_SETFD, FD_CLOEXEC)
	}
	return
}

But the complexity here is simply falling back. So if there is a deadlock here it's in the syscall pipe2. https://www.freebsd.org/cgi/man.cgi?query=pipe2&sektion=2&format=html

The man page does not indicate that this could block but dtrace and ktrace will capture these calls.

forkAndExecInChild

The implementations between linux and bsd are substantially different here and it's getting into the deep systems programming level of starting processes that I'm afraid I know little to nothing about. The file is syscall/exec_bsd.go

	// About to call fork.
	// No more allocation or calls of non-assembly functions.
	runtime_BeforeFork()
	r1, _, err1 = RawSyscall(SYS_FORK, 0, 0, 0)
	if err1 != 0 {
		runtime_AfterFork()
		return 0, err1
	}

	if r1 != 0 {
		// parent; return PID
		runtime_AfterFork()
		return int(r1), 0
	}

Fundamentally there is an assembly call into runtime_BeforeFork which calls systemstack(beforeFork) which tells the system stack to block signals. (Could the blocking of signals be causing a problem? Does your jail send up a signal if Fork is blocked?)

Then https://www.freebsd.org/cgi/man.cgi?fork(2), and we check the pid returned - and if we're the parent or an error occurs returns the pid or error after running runtime_AfterFork() which calls systemstack(afterFork) reversing the changes of fork.

Remaining part of forkExec after ForkLock.Unlock()

	// Read child error status from pipe.
	Close(p[1])
	for {
		n, err = readlen(p[0], (*byte)(unsafe.Pointer(&err1)), int(unsafe.Sizeof(err1)))
		if err != EINTR {
			break
		}
	}
	Close(p[0])
	if err != nil || n != 0 {
		if n == int(unsafe.Sizeof(err1)) {
			err = Errno(err1)
		}
		if err == nil {
			err = EPIPE
		}

		// Child failed; wait for it to exit, to make sure
		// the zombies don't accumulate.
		_, err1 := Wait4(pid, &wstatus, 0, nil)
		for err1 == EINTR {
			_, err1 = Wait4(pid, &wstatus, 0, nil)
		}
		return 0, err
	}

	// Read got EOF, so pipe closed on exec, so exec succeeded.
	return pid, nil

I guess this could block - but I don't understand why it would take down the whole system.

Summary

  • The blocking has been traced to os/exec/exec.go:(*cmd).Start()
  • It's likely to be in syscall/exec_unix.go:os.StartProcess(...) and hence forkExec in syscall/exec_unix.go.
  • The reports of the entire go process halting make me suspicious about something happening around forkAndExecInChild in syscall/exec_bsd.go.
  • In particular around the call to fork(2) either at that call or at the systemstack calls
  • This should be dtraceable or ktraceable.
  • Alternatively something happening around pipe(2) in forkAndExecInChild could be blocking but again this should be dtraceable/ktraceable.
  • After the ForkLock has unlocked there is some code around reading the error from the child pipe but this shouldn't kill the whole go process.

So we're likely looking at a bug in go's runtime either due to some weird resource limit handling of the jail and the way it reports issues to go, or even more difficult to fix some bug in the OS about creating processes after some limit has been reached.

Maybe this golang/go#43873 is related?

You've never told us the version of FreeBSD you're running or given us any information about how you have set up the jail. Could the parameters of your jail responsible?

@thearchivalone
Copy link

thearchivalone commented Jan 23, 2022

@zeripath Thanks for looking into this as much as you have. For mine:

  • FreeBSD 13.0 p6.0 (latest stable patches and kernel)
  • Jail - BastilleBSD latest 0.9 patches
  • Postgres 13.5
  • Intel Xeon 3cpu 2.6GHz per core
  • 4 GB DDR4
  • 75GB NVMe SSD

@phryk I just had a thought: are you running postgres in the same jail as Gitea instead of separately and interacting with it through a port? To get it to work, it needed some modifications to its main config file to allow it to use more RAM and other system resources than most BSD jails initially allocate. I haven’t tried postgres in a separate jail yet since many online sources recommended encapsulating DB and its software in a single unit to allow easier transfer and backup. I wonder if the deadlock is caused because of that.

@thearchivalone
Copy link

thearchivalone commented Jan 23, 2022

The issue wasn’t solved by moving postgres to a separate container for me. Even running postgres through a port and connecting that way caused gitea to hang. Removing the app.ini and letting it try to generate a new one also caused it to hang even with the /usr/local/etc/gitea/conf directory permissions set to 755 or 777. So far, it looks like SQLite is the definitive way to run it right now.

Can I get anyone else to try it with Sqlite? So far, that works in my tests but isn’t ideal if you have big repositories

@zeripath
Copy link
Contributor

@bedwardly-down are you sure that you have the same problem as @phryk ? It would useful to double check to see where your hangs are happening. I cannot see why SQLite would be better for a problem relating to forking.

If you're finding that SQLite is better then could you try connecting to postgres over a unix port instead tcp port as it might be that you're suffering tcp port-exhaustion instead.


Thinking again to this blocking problem relating to fork I wonder if the block is happening is because a page fault is occurring in fork and the signal cannot be handled?

@phyrk what version of go are you building Gitea with? Please ensure that go is the most recent version. It might help to allow.mlock in the jail.

@thearchivalone
Copy link

@zeripath honestly, it may be different. There’s not enough info from the original poster to get a clear picture and the only common thread we have is postgres and I’m grasping straws using my own limited knowledge of how gitea and its database support works.

@zeripath
Copy link
Contributor

@bedwardly-down could you apply the patch in #18180 (comment) and check your logs for when then deadlock occurs to see if it's at the same place.

If you find that the final log line is the same as in @phryk you're hitting the same problem. If not - well then we get to find another bug.

If switching to sqlite helps then that means port exhaustion is more likely. Similarly if using a unix port for postgres helps then it's far more likely to be a port exhaustion problem.

@thearchivalone
Copy link

thearchivalone commented Jan 23, 2022

Didn’t apply the patch. Running it straight through the unix socket inside the same container finally solved my issue. Looking at the system logs, there was no indication that anything was happening prior running through a port from either a separate or local port that my permissions were wrong within Postgres. I fixed them and can now run Gitea with no issues. So, I’ll have to agree that mine was probably port exhaustion as you suggested.

@zeripath
Copy link
Contributor

zeripath commented Jan 23, 2022

If you're using an http/https proxy like nginx you should also be able to make gitea run as http+unix which would then prevent port exhaustion (except through migrations) as a cause for problems. (Well I say "eliminate" I mean stop Gitea from being the cause - the port exhaustion could still happen in nginx.)


Actually it looks like Go is now setting SO_REUSEADDR so our server listening shouldn't be affected by port exhaustion - perhaps the DB connection sockets aren't set with SO_REUSEADDR?

@thearchivalone
Copy link

thearchivalone commented Jan 23, 2022

I’m using Caddy in a separate container and reverse proxying straight from the local container ports. I don't believe Bastille supports unix sockets right now but that's possibly in the works. I have under 10 containers so far and caddy is only currently serving 3 while the rest are completely internal. That’s not good. I’ll have to research a bit on how to wrangle the containers into a single socket. Thanks.

@tsowa
Copy link

tsowa commented Jan 25, 2022

I have been using gitea for more than a year and have never had such problems. My configuration:

$ uname -srm
FreeBSD 13.0-STABLE amd64

Jails were created with make distribution and make installworld from /usr/src, actually I am using one jail as a template and then only tar -x to the specific location. The main system (host) is build from source, after host is built then the jails are updated with the same binaries: I am using a custom script which does mergemaster, make installworld, make delete-old(libs) to the specific jail location with -D and DESTDIR parameter. Packages are built in a different jail (I am not using poudiere) and are available to other jails through custom pkg repository.

From pf (packet filter) I am making a redirect to a jail with nginx:

rdr on $ext_if proto tcp from any to {$main_ext_ip, $main_ext_ip2} port {80, 443} tag X51REDIRECT -> $jail_nginx_ip
rdr on lo0 proto tcp from {$main_ext_ip, $main_ext_ip2, 192.168.1.0/24} to {$main_ext_ip, $main_ext_ip2} port {80, 443} tag X51REDIRECT -> $jail_nginx_ip

pass in quick on $ext_if proto tcp from any to $jail_nginx_ip port { 80, 443 } flags S/SA keep state (if-bound) queue (www_out, ack_out)
pass in quick on lo0 proto tcp from {$main_ext_ip, $main_ext_ip2} to $jail_nginx_ip port { 80, 443 } flags S/SA keep state (if-bound) queue (www_out, ack_out)
pass out quick on lo0 proto tcp from 192.168.1.0/24 to {$main_ext_ip, $main_ext_ip2} port { 80, 443 } flags S/SA keep state (if-bound)
pass in  quick on lo0 proto tcp from 192.168.1.0/24 to $jail_nginx_ip port { 80, 443 } flags S/SA keep state (if-bound) queue (www_out, ack_out)

In the jail the nginx config looks like:

   server {
        listen      192.168.1.61:80;
        listen      192.168.1.61:443 ssl;

        server_name gitea.ttmath.org;
        client_max_body_size 300M;

        ssl_certificate     /etc/ssl/ttmath.org/certs.pem;
        ssl_certificate_key /etc/ssl/ttmath.org/privatekey.pem;
        ssl_dhparam         /etc/ssl/ttmath.org/dhparam.pem;

        location / {
            proxy_pass       http://192.168.1.93:3002;
            proxy_read_timeout 360s;
            include          proxy-headers.conf;

            if ($scheme = http) {
               return 301 https://$host$request_uri;
            }
        }

        include location-well-known.conf;
        include ssl-params.conf;
    }

proxy-headers.conf:

proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;

ssl-params.conf:

ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
ssl_prefer_server_ciphers on;
ssl_session_cache    shared:SSL:1m;
ssl_session_timeout  5m;
ssl_ciphers  HIGH:!aNULL:!MD5;

Nginx makes a proxy to the jail where gitea is installed together with postgresql.

$ pkg info -x gitea postgresql
gitea-1.15.10
postgresql13-client-13.3
postgresql13-server-13.4_1

My jails config /etc/jail.conf:

nginx {
        exec.start = "/bin/sh /etc/rc";
        exec.stop = "/bin/sh /etc/rc.shutdown";
        exec.clean;
        mount.devfs;
        host.hostname = "nginx.ttmath.org";
        
        path = "/home/jails/nginx";
        ip4.addr = "192.168.1.61";
        allow.mount = "true";
        allow.raw_sockets="true";
        allow.sysvipc="true";

        linux = "new";
        sysvmsg = "new";
        sysvsem = "new";
        sysvshm = "new";
}

gitea {
        exec.start = "/bin/sh /etc/rc";
        exec.stop = "/bin/sh /etc/rc.shutdown";
        exec.clean;
        mount.devfs;

        host.hostname = "gitea.ttmath.org";
        
        path = "/2/jails/gitea";
        ip4.addr = "192.168.1.93";
        allow.mount = "true";
        allow.raw_sockets="true";
        allow.sysvipc="true";
        allow.chflags="true";

        linux = "new";
        sysvmsg = "new";
        sysvsem = "new";
        sysvshm = "new";
}

Make sure you have sysv* set in your config if you are using postgresql.

My limits for the gitea jail:

# rctl -h jail:gitea
jail:gitea:maxproc:deny=500
jail:gitea:openfiles:deny=49K
jail:gitea:memoryuse:deny=6144M

and the average resource use:

# rctl -h -u jail:gitea
cputime=39K
datasize=13M
stacksize=0
coredumpsize=0
memoryuse=929M
memorylocked=0
maxproc=28
openfiles=8968
vmemoryuse=5357M
pseudoterminals=1
swapuse=0
nthr=111
msgqqueued=0
msgqsize=0
nmsgq=0
nsem=0
nsemop=0
nshm=1
shmsize=4096
wallclock=21M
pcpu=0
readbps=0
writebps=0
readiops=0
writeiops=0

In fact, there are three separate instances of gitea in the jail - each with custom startup script. But the startup script is based on the standard /usr/local/etc/rc.d/gitea so this should not be the problem.

Someone said above that this error can be provoked by service restart so I have tried:

sh -c 'n=1000; while [ $n -ne 0 ] ; do service gitea3 restart ; n=`expr $n - 1` ; done'

but nothing wrong has happened. May this depends on the kind of the git repository. If you wish I can prepare you an empty gitea instance and you can then interact with it to see if this problem occurs.

@tsowa
Copy link

tsowa commented Jan 25, 2022

For completeness: I am not using ZFS but UFS with soft-updates:

$ mount | grep gm1s1a
/dev/mirror/gm1s1a on /2 (ufs, local, noatime, soft-updates)

In the past I have used soft update journaling but there were a bug in soft updated journaling which caused the file system to be locked: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=224292 and I have changed the filesystem configuration so that UFS only uses soft-updates now.

@phryk
Copy link
Author

phryk commented Jan 27, 2022

My setup looks a bit different. I'm not really sure what
(if anything) of this is useful to know, but here you go:

  • Intel Core i7-2600
  • 16G non-ECC RAM
  • 2x 3TB HDD
  • 128GB SSD
  • some RealTek NIC

Set up on this is a custom FreeBSD (12.2) install with redundant ZFS pools
and a hybrid HDD/SSD UFS mirror (which is where database-specific data lives).
All drives use 256 bit AES-XTS encryption via geom_eli, accelerated with AESNI.

All currently deployed packages are built by a poudriere run on the host OS,
but this is a relatively recent change on this setup and the bug already happened
way before it.

I'm running a custom thinjail setup using /etc/jail.conf in which I have just one
base-system (/jail/base) that's being mounted read-only into all jail roots with
nullfs and another read-write nullfs mount of /jail/rw/<jailname> for most
other stuff on top – the Postgres' data directory being a notable exception as
that's yet another read-write nullfs mount.

All jails are on an extra loopback interface, lo1 so communications to and
from them can't reach the internet without going through the firewall (pf).

I use nginx as reverse proxy with gitea running on TCP port 3000 in the same
jail (http). PostgreSQL however runs on a separate jail (database) and is
only exposed through a UNIX socket in a directory that is nullfs-mounted to
http so services on that jail can access the database (offers pretty neat and
granular access control).

I think I already said this to @zeripath on IRC, but it might be worth reiterating,
that I kind of get the feeling that the root cause of my problem might be buried
somewhere in nullfs.

@feld
Copy link

feld commented Jan 27, 2022

I also use the nullfs trick to mount postgres, mysql, and syslogd sockets into jails but have never encountered issues with it FWIW

edit: what is your SSD? Some have small buffers that cause the entire device to choke if you fill it completely with writes

@phryk
Copy link
Author

phryk commented Mar 29, 2022

30 days ago, I did a (minor) FreeBSD update which contained a new kernel
version and updated gitea to 1.15.10. Go might also have been updated,
tho I'm not sure about that – it's now at 1.17.6. I have seen no freezes since.

I'm closing this issue for now – feel free to reopen if you think this isn't warranted.

I would of course prefer to know what the underlying issue was, but it
sounds like that would require investing a huge amount of time. :P

@phryk phryk closed this as completed Mar 29, 2022
@go-gitea go-gitea locked and limited conversation to collaborators Apr 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
issue/needs-feedback For bugs, we need more details. For features, the feature must be described in more detail type/upstream This is an issue in one of Gitea's dependencies and should be reported there
Projects
None yet
Development

No branches or pull requests

6 participants