Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[4.7] libct/cg/sd: reconnect and retry on dbus connection error #8

Closed
wants to merge 11 commits into from

Conversation

kolyshkin
Copy link

@kolyshkin kolyshkin commented May 1, 2021

This is a backport of upstream PRs

to release-4.7 branch, to be consumed by cri-o 1.20 in order to fix https://bugzilla.redhat.com/show_bug.cgi?id=1941456.

This also includes backports of

Original description follows


n case the dbus daemon is ever restarted, the connection is no longer valid and every operation
fails. This is a minor concern for short-lived runc, but much more of a issue in case there is
a long-running daemon (e.g. cri-o) is using runc's libcontainer, as the connection is never
retried and the only remedy is to restart the daemon.

The solution to the above is to check the errors returned for dbus: connection closed by user
error, and try to re-connect on that. This is what PR opencontainers#2862 does.

This is a carry of opencontainers#2862, implementing the idea of retry-in-place (first described
at opencontainers#2862 (comment) and opencontainers#2862 (comment)) on top of what it does.

For more info, see commit messages as well as opencontainers#2862.

Fixes:

kolyshkin and others added 7 commits April 30, 2021 18:00
In case it takes more than 1 second for systemd to create a unit,
startUnit() times out with a warning and then runc proceeds
(to create cgroups using fs manager and so on).

Now runc and systemd are racing, and multiple scenarios are possible.

In one such scenario, by the time runc calls systemd manager's Apply()
the unit is not yet created, the dbusConnection.SetUnitProperties()
call fails with "unit xxx.scope not found", and the whole container
start also fails.

To eliminate the race, we need to return an error in case the timeout is
hit.

To reduce the chance to fail, increase the timeout from 1 to 30 seconds,
to not error out too early on a busy/slow system (and times like 3-5
seconds are not unrealistic).

While at it, as the timeout is quite long now, make sure to not leave
a stray timer.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 3844789)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
As the caller of this function just logs the error, it does not make
sense to pass it. Instead, log it (once) and return -1.

This is a preparation for the second user.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit eee425f)

[Minor merge conflict due to missing "return" removed by a hunk from
commit 978fa6e]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
[@kolyshkin: documentation nits]

Signed-off-by: Shiming Zhang <wzshiming@foxmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit cdbed6f)

[minor merge conflict due to missing upstream commit 73f22e7]

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Generalize isUnitExists as isDbusError, and use errors.As while at it
(which can handle wrapped errors as well).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit bacfc2c)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
[@kolyshkin: doc nits, use dbus.ErrClosed and isDbusError]

Signed-off-by: Shiming Zhang <wzshiming@foxmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 15fee98)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Shiming Zhang <wzshiming@foxmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 6122bc8)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Instead of reconnecting to dbus after some failed operations, and
returning an error (so a caller has to retry), reconnect AND retry
in place for all such operations.

This should fix issues caused by a stale dbus connection after e.g.
a dbus daemon restart.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 47ef9a1)

[Minor merge conflicts due to missing upstream commits
52390d6 and af521ed.]
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@openshift-ci-robot
Copy link

@kolyshkin: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[4.7] libct/cg/sd: reconnect and retry on dbus connection error

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci-robot
Copy link

@kolyshkin: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[4.7] libct/cg/sd: reconnect and retry on dbus connection error

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kolyshkin kolyshkin changed the title [4.7] libct/cg/sd: reconnect and retry on dbus connection error bug 1941456: [4.7] libct/cg/sd: reconnect and retry on dbus connection error May 3, 2021
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 3, 2021
@openshift-ci-robot
Copy link

@kolyshkin: This pull request references Bugzilla bug 1941456, which is invalid:

  • expected the bug to target the "4.7.z" release, but it targets "4.8.0" instead
  • expected Bugzilla bug 1941456 to depend on a bug targeting a release in 4.8.0 and in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE), but no dependents were found

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

bug 1941456: [4.7] libct/cg/sd: reconnect and retry on dbus connection error

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kolyshkin
Copy link
Author

/bugzilla refresh

@openshift-ci-robot
Copy link

@kolyshkin: This pull request references Bugzilla bug 1941456, which is invalid:

  • expected Bugzilla bug 1941456 to depend on a bug targeting a release in 4.8.0 and in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE), but no dependents were found

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kolyshkin kolyshkin changed the title bug 1941456: [4.7] libct/cg/sd: reconnect and retry on dbus connection error [4.7] libct/cg/sd: reconnect and retry on dbus connection error May 3, 2021
@openshift-ci-robot
Copy link

@kolyshkin: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[4.7] libct/cg/sd: reconnect and retry on dbus connection error

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 3, 2021
@kolyshkin kolyshkin marked this pull request as draft May 5, 2021 19:17
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 5, 2021
@kolyshkin
Copy link
Author

This also needs opencontainers#2936 / opencontainers#2937 backported.

kolyshkin added 2 commits May 21, 2021 12:08
Commit 47ef9a1 forgot to wrap GetManagerProperty("ControlGroup")
into retryOnDisconnect. Since there's one other user of
GetManagerProperty, add getManagerProperty wrapper and use it.

Fixes: 47ef9a1

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit 99c5c50)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Using per cgroup manager dbus connection instances means
that every cgroup manager instance gets a new connection,
and those connections are never closed, ultimately resulting
in file descriptors limit being hit.

Revert back to using a single global dbus connection for everything,
without changing the callers.

NOTE that it is assumed a runtime can't use both root and rootless
dbus at the same time. If this happens, we panic.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit c7f847e)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@kolyshkin
Copy link
Author

Added a backport of opencontainers#2937 (modulo test)

@openshift-ci
Copy link

openshift-ci bot commented May 21, 2021

@kolyshkin: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[4.7] libct/cg/sd: reconnect and retry on dbus connection error

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kolyshkin kolyshkin marked this pull request as ready for review May 21, 2021 19:13
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 21, 2021
@openshift-ci
Copy link

openshift-ci bot commented May 21, 2021

@kolyshkin: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[4.7] libct/cg/sd: reconnect and retry on dbus connection error

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci
Copy link

openshift-ci bot commented May 21, 2021

@kolyshkin: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[4.7] libct/cg/sd: reconnect and retry on dbus connection error

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kolyshkin
Copy link
Author

This also needs opencontainers#2997

kolyshkin added 2 commits June 7, 2021 12:58
This fixes isDbusError function, introduced by commit bacfc2c. Due to a
type error it was not working at all.

This also fixes the whole "retry on dbus disconnect" logic.

This also fixes a regression in startUnit (and cgroupManager.Apply()),
which should never return "unit already exists" error but it did.

Fixes: bacfc2c
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit b2d28c5, modulo test)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Noticed that the check of trying to use both rootful and rootless
in NewDbusConnManager never worked, as we never set dbusInited to true.

Do that. While at it, protect this with the mutex (against the
case of two goroutines simultaneously calling NewDbusConnManager).
This is a rare call, so taking read-only then read-write mutex does not
make sense.

Fixes: c7f847e

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
(cherry picked from commit d06bda6)
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
@openshift-ci
Copy link

openshift-ci bot commented Jun 7, 2021

@kolyshkin: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[4.7] libct/cg/sd: reconnect and retry on dbus connection error

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 5, 2021
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 5, 2021
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Nov 4, 2021
@openshift-ci
Copy link

openshift-ci bot commented Nov 4, 2021

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants