Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rook + Ceph clusters do not work on OKD releases 4.12.0-0.okd-2023-02-04-212953 and 4.11.0-0.okd-2023-01-14-152430 #1505

Closed
SriRamanujam opened this issue Feb 12, 2023 · 19 comments

Comments

@SriRamanujam
Copy link

Describe the bug

Rook + Ceph clusters stop functioning or greatly degrade in performance on the OKD releases 4.12.0-0.okd-2023-02-04-212953 and 4.11.0-0.okd-2023-01-14-152430. I'm opening this ticket to serve as a tracking issue for OKD specifically, as it seems others have opened several tickets and discussions elsewhere and I couldn't find one here.

Version
4.12.0-0.okd-2023-02-04-212953 and 4.11.0-0.okd-2023-01-14-152430

How reproducible
Pretty much 100% of the time. Symptoms include many/all PGs going inactive, slow i/o on a cluster that was previously performing fine, components like RGW and CSI mounts stop functioning, probably other stuff too.

Current workaround

As of 2023-02-11, it seems the only workaround is to downgrade the cluster to a previous version, which seems to fix things.

Related issues and discussions

@solacelost
Copy link

That's the result of https://bugzilla.redhat.com/show_bug.cgi?id=2159066

see also:
coreos/fedora-coreos-tracker#1393
https://utcc.utoronto.ca/~cks/space/blog/linux/KernelBindBugIn6016
#1463 (comment)

The fix is already in okd-machine-os:
openshift/okd-machine-os#526

Just need a new OKD build that includes it (4.12.0-0.okd-2023-02-11-023427 or newer) to pass CI and make it to stable for a long-term fix. I'm on 4.12.0-0.okd-2023-01-21-055900 with kernel 6.0.15-300.fc37.x86_64 in the meantime.

@msteenhu
Copy link

Thanks for this issue and the root cause!

Then it would be really nice there is a 4.12 release with a DIRECT upgrade path from the latest 'Ceph' working 4.11, which is the one before last if I am not mistaken.

@solacelost
Copy link

I believe there is a direct upgrade path to 4.12.0-0.okd-2023-01-21-055900 - or at least there was when I took it, lol. Has that edge since been blocked? I imagine that will be ironed out for the next release that should land this weekend.

@vrutkovs
Copy link
Member

This should be resolved in https://github.com/okd-project/okd/releases/tag/4.12.0-0.okd-2023-02-18-033438

Since we now use layering you could have built a custom OS image with updated kernel - see https://github.com/vrutkovs/custom-okd-os/blob/main/drbd/Dockerfile for instance

@solacelost
Copy link

https://docs.okd.io/4.12/post_installation_configuration/coreos-layering.html

So this I wasn't aware of. I'm going to give this a shot tonight before updating to the new release, just to see how it works!

@ibotty
Copy link

ibotty commented Feb 19, 2023

@vrutkovs Are you sure it's included in https://github.com/okd-project/okd/releases/tag/4.12.0-0.okd-2023-02-18-033438? This release uses osImageURL=ostree-unverified-registry:quay.io/openshift/okd-content@sha256:6ccff52c50e1ef975931242dc1941617431d45fbd3e425b8016d2cc62aa543d8 afaict and this is based on 37.20230110.3.1 and uses kernel 6.0.18-300.fc37, which is not fixed.

Am I missing something?

@solacelost
Copy link

I see in fedora-coreos-config stable branch that we should be on 6.1.6 here

I see in the submodule's HEAD commit that we should be on 6.1.10 here

However....

$ podman run --rm -it --entrypoint rpm $(curl -sL https://github.com/okd-project/okd/releases/download/4.12.0-0.okd-2023-02-18-033438/release.txt | awk '/machine-os-content/ {print $2}') -qi kernel | grep Version
Version     : 6.0.18

@vrutkovs
Copy link
Member

vrutkovs commented Feb 20, 2023

Oh, sorry, we're still using the FCOS from January (a bad commit sneaked in - openshift/okd-machine-os@e83e32a). openshift/okd-machine-os#532 would fix it

@PiotrKlimczak
Copy link

+1 For priority on this as it has serious impact on us and it breaks ceph completely for us. Community should also consider either:

  1. Allowing upgrade from last working version directly to 1st fixed version (not requiring to go through broken version)
  2. Provide workaround which can be applied BEFORE upgrade to 1st broken version, so upgraded cluster will keep fully functional

Unless it is possible somehow for Rook/Ceph to make change on their end?

Otherwise everybody using rook+ceph will be stuck completely without being able to upgrade as there will be no stable upgrade path.

I am sure you guys know it but still feels right to highlight it.

Thanks all for working on this.

@vrutkovs
Copy link
Member

vrutkovs commented Mar 2, 2023

New OKD 4.12 nightly should be based on FCOS 37.20230205.3.0 and have kernel 6.1.9-200.fc37.x86_64 with the fix. I'll add upgrade edges from 4.11 to the next 4.12 stable

As for a workaround which can be applied before upgrade - I don't know if its possible, this is a kernel issue so its not easy to workaround

@mbuchholz
Copy link

@vrutkovs Thanks for letting us know.

I can confirm that upgrading to version 4.12.0-0.okd-2023-03-03-055825 fixed all the issues regarding rook ceph cluster and volumes are mounting again.

I used the following command to upgrade directly from 4.11.0-0.okd-2023-01-14-152430:

$ oc adm upgrade --allow-explicit-upgrade --allow-upgrade-with-warnings --to-image registry.ci.openshift.org/origin/release@sha256:a2e94c433f8747ad9899bd4c9c6654982934595a42f5af2c9257b351786fd379

@vrutkovs
Copy link
Member

vrutkovs commented Mar 3, 2023

Perfect, thank you. We'll release a new stable over the weekend then

@ibotty
Copy link

ibotty commented Mar 5, 2023

I could successfully upgrade the affected cluster to the released version (with some machine-config-daemon hand-holding) and have a stable ceph cluster. I guess this can be closed.

@vrutkovs vrutkovs closed this as completed Mar 6, 2023
@joyartoun
Copy link

Hello!

I just tried installing rook-ceph version 1.11.0 on OKD 4.12.0-0.okd-2023-03-05-022504 and fedora coreos 37.20230205.3.0 and see the errors still.

Kernel version on the storage nodes is 6.0.18-300.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Jan 7 17:10:00 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

@joyartoun
Copy link

Hello!

I just tried installing rook-ceph version 1.11.0 on OKD 4.12.0-0.okd-2023-03-05-022504 and fedora coreos 37.20230205.3.0 and see the errors still.

Kernel version on the storage nodes is 6.0.18-300.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Sat Jan 7 17:10:00 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

My bad, I will try with 37.20230303.2.0 instead

@pomland-94
Copy link

Are there any News on it? Can I use rook on OpenShift/OKD 4.12?

@schuemann
Copy link

I have no problems with the current OKD 4.12 version.

@peterroth
Copy link
Contributor

Hi everybody,
Out of curiosity, is the fix ported to OKD 4.11?
I'm asking because I also stepped into the 4.11.0-0.okd-2023-01-14-152430 trap and I was looking for a way to get out of it without going a minor version up, but I couldn't find a version that ships the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests