-
Notifications
You must be signed in to change notification settings - Fork 25
systemd: retry when the dbus connection returns EAGAIN #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
350f6fc to
c391dec
Compare
kolyshkin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the test from here (without the fix, just the test case) on three different distros:
- Fedora 42: kernel 6.16.10-200.fc42.x86_64, systemd 257.9-2.fc42
- Alma 9: kernel 5.14.0-570.37.1.el9_6.x86_64, systemd 252-51.el9.alma.1
- Alma 8: kernel 4.18.0-553.16.1.el8_10.x86_64, systemd 239-82.el8_10.1
and it does not fail.
This might be because Red Hat has backported systemd fix, so what distro do you recommend to try so we can see the test fails before the fix?
c391dec to
6db5fc0
Compare
|
@kolyshkin I can reproduce the issue using Alma 8. The reproducible Vagrantfile is as follows: The test result is as follows: I haven’t tested it on many kernel versions, but based on the commit messages of systemd and glibc, it seems that systemd compiled with glibc versions earlier than 2.31 may encounter the EAGAIN issue. |
systemd/dbus.go
Outdated
| // despite retries, still EAGAIN | ||
| return nil, fmt.Errorf("dbus connection failed after 5 retries: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is flipped in a slightly confusing way, I'm going to push a more straightforward version.
systemd/dbus.go
Outdated
| } | ||
| if retry < 4 { | ||
| // Exponential backoff with jitter: base delay doubles each retry, plus random jitter up to 1 second | ||
| baseDelay := time.Second << retry // 1s, 2s, 4s, 8s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
15 seconds with the first step being 1s seems a little aggressive -- I would expect the first retry to be more like 100ms or something. runc generally runs on the order of 100ms so I would not expect the base delay to be this long.
I think that 100ms * 2^retry + 12.5% jitter is more reasonable. I'm not even sure the jitter is necessary, but I've kept it.
| "testing" | ||
| ) | ||
|
|
||
| func TestParallelConnection(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funnily enough, this test actually crashes Firefox on my machine if you run it in rootless mode as your user:
[1522754.178417] [ T3250] wireplumber[3250]: segfault at 8 ip 00007f8b0097315d sp 00007ffd1e267fe8 error 4 in libgobject-2.0.so.0.8600.0[3d15d,7f8b00946000+36000] likely on CPU 14 (core 7, socket 0)
[1522754.178426] [ T3250] Code: 0f 1f 44 00 00 48 c1 ed 02 48 8d 05 5d 62 02 00 48 8b 34 e8 e9 48 ff ff ff 0f 1f 40 00 48 85 ff 74 4b 48 8b 07 48 85 c0 74 43 <48> 8b 00 48 3d fc 03 00 00 77 28 48 8d 15 31 62 02 00 48 c1 e8 02
Signed-off-by: jianghao65536 <jianghao65536@gmail.com> [cyphar: gofumpt systemd/dbus_test.go] [cyphar: simplify retry loop to return from inside loop] [cyphar: improve exponential backoff to be less aggressive] [cyphar: improve parallel test] Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
systemd/dbus_test.go
Outdated
| var wg sync.WaitGroup | ||
| errChan := make(chan error, len(dms)) | ||
| for _, dm := range dms { | ||
| wg.Add(1) | ||
| go func(dm *dbusConnManager) { | ||
| defer wg.Done() | ||
| select { | ||
| case <-ctx.Done(): | ||
| return | ||
| default: | ||
| _, err := dm.newConnection() | ||
| if err != nil { | ||
| errChan <- err | ||
| cancel() | ||
| } | ||
| } | ||
| }(dm) | ||
| } | ||
| wg.Wait() | ||
| close(errChan) | ||
| for err := range errChan { | ||
| t.Fatal(err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't actually test all connections being created at the same time, and the usage of errChan is a little suspect. I will push a different version.
|
@kolyshkin let me know if you're happy with the new backoff setup I pushed. |
kolyshkin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM

Before version 254, systemd used the glibc macro SOMAXCONN to set the backlog. Prior to Linux 5.4, SOMAXCONN was set to 128. If the number of instantaneous connections exceeds this value, newConnection will return EAGAIN, which can cause runc to fail when creating containers.