Skip to content

Conversation

@jianghao65536
Copy link
Contributor

Before version 254, systemd used the glibc macro SOMAXCONN to set the backlog. Prior to Linux 5.4, SOMAXCONN was set to 128. If the number of instantaneous connections exceeds this value, newConnection will return EAGAIN, which can cause runc to fail when creating containers.

@jianghao65536 jianghao65536 force-pushed the eagain-retry branch 2 times, most recently from 350f6fc to c391dec Compare October 9, 2025 17:25
Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the test from here (without the fix, just the test case) on three different distros:

  • Fedora 42: kernel 6.16.10-200.fc42.x86_64, systemd 257.9-2.fc42
  • Alma 9: kernel 5.14.0-570.37.1.el9_6.x86_64, systemd 252-51.el9.alma.1
  • Alma 8: kernel 4.18.0-553.16.1.el8_10.x86_64, systemd 239-82.el8_10.1

and it does not fail.

This might be because Red Hat has backported systemd fix, so what distro do you recommend to try so we can see the test fails before the fix?

@jianghao65536
Copy link
Contributor Author

@kolyshkin I can reproduce the issue using Alma 8. The reproducible Vagrantfile is as follows:

Vagrant.configure("2") do |config|
  config.vm.box = "almalinux/8"
  config.vm.hostname = "alma8-vm"

  config.vm.provider "virtualbox" do |vb|
    vb.name = "AlmaLinux8_Vagrant"
    vb.memory = "2048"
    vb.cpus = 2
  end

  config.vm.provision "shell", inline: <<-SHELL
    sudo dnf install -y go
    go version
  SHELL
end

The test result is as follows:
image

I haven’t tested it on many kernel versions, but based on the commit messages of systemd and glibc, it seems that systemd compiled with glibc versions earlier than 2.31 may encounter the EAGAIN issue.
Systemd related link: systemd/systemd#28018
Glibc related link: bminor/glibc@96958e2

@AkihiroSuda AkihiroSuda added this to the 0.0.6 milestone Oct 28, 2025
systemd/dbus.go Outdated
Comment on lines 93 to 94
// despite retries, still EAGAIN
return nil, fmt.Errorf("dbus connection failed after 5 retries: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is flipped in a slightly confusing way, I'm going to push a more straightforward version.

systemd/dbus.go Outdated
}
if retry < 4 {
// Exponential backoff with jitter: base delay doubles each retry, plus random jitter up to 1 second
baseDelay := time.Second << retry // 1s, 2s, 4s, 8s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 seconds with the first step being 1s seems a little aggressive -- I would expect the first retry to be more like 100ms or something. runc generally runs on the order of 100ms so I would not expect the base delay to be this long.

I think that 100ms * 2^retry + 12.5% jitter is more reasonable. I'm not even sure the jitter is necessary, but I've kept it.

"testing"
)

func TestParallelConnection(t *testing.T) {
Copy link
Member

@cyphar cyphar Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funnily enough, this test actually crashes Firefox on my machine if you run it in rootless mode as your user:

[1522754.178417] [   T3250] wireplumber[3250]: segfault at 8 ip 00007f8b0097315d sp 00007ffd1e267fe8 error 4 in libgobject-2.0.so.0.8600.0[3d15d,7f8b00946000+36000] likely on CPU 14 (core 7, socket 0)
[1522754.178426] [   T3250] Code: 0f 1f 44 00 00 48 c1 ed 02 48 8d 05 5d 62 02 00 48 8b 34 e8 e9 48 ff ff ff 0f 1f 40 00 48 85 ff 74 4b 48 8b 07 48 85 c0 74 43 <48> 8b 00 48 3d fc 03 00 00 77 28 48 8d 15 31 62 02 00 48 c1 e8 02

Signed-off-by: jianghao65536 <jianghao65536@gmail.com>
[cyphar: gofumpt systemd/dbus_test.go]
[cyphar: simplify retry loop to return from inside loop]
[cyphar: improve exponential backoff to be less aggressive]
[cyphar: improve parallel test]
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Comment on lines 21 to 43
var wg sync.WaitGroup
errChan := make(chan error, len(dms))
for _, dm := range dms {
wg.Add(1)
go func(dm *dbusConnManager) {
defer wg.Done()
select {
case <-ctx.Done():
return
default:
_, err := dm.newConnection()
if err != nil {
errChan <- err
cancel()
}
}
}(dm)
}
wg.Wait()
close(errChan)
for err := range errChan {
t.Fatal(err)
}
Copy link
Member

@cyphar cyphar Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't actually test all connections being created at the same time, and the usage of errChan is a little suspect. I will push a different version.

@cyphar cyphar dismissed lifubang’s stale review October 28, 2025 06:08

Suggestion implemented.

@cyphar
Copy link
Member

cyphar commented Oct 28, 2025

@kolyshkin let me know if you're happy with the new backoff setup I pushed.

Copy link
Contributor

@kolyshkin kolyshkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kolyshkin kolyshkin merged commit c63eee3 into opencontainers:main Oct 29, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants