Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[求助/Help]Rook-Ceph的rook-ceph-detect-version不断重启,无法进行扫描初始化磁盘 #21890

Open
chenjacken opened this issue Dec 24, 2024 · 3 comments
Labels
question Further information is requested state/awaiting processing

Comments

@chenjacken
Copy link

chenjacken commented Dec 24, 2024

根据官方文档网站 https://www.cloudpods.org/blog/rook-ceph-with-cloudpods 部署一套Ceph的集群
宿主机OS版本是Centos 7.9(5.4.130-1.yn20230805.el7.x86_64)

重启rook-ceph-operator后,是会检查cluster.yaml变化并初始化等工作,部署OSD,但是目前一直停留在rook-ceph-detect-version不断重启,无法进行后面的任务。

kubectl delete pod rook-ceph-operator-6994bc6ccc-f6khh -n rook-ceph

之后POD就是不断重启rook-ceph-detect-version

rook-discover-xlhkh                                 1/1     Running     11         278d    10.40.142.28    node17    <none>           <none>
rook-discover-zrrqz                                 1/1     Running     9          277d    10.40.167.214   node19    <none>           <none>
rook-ceph-detect-version-67skv                      0/1     Pending     0          0s      <none>          <none>    <none>           <none>
rook-ceph-detect-version-67skv                      0/1     Pending     0          0s      <none>          master2   <none>           <none>
rook-ceph-detect-version-67skv                      0/1     Init:0/1    0          0s      <none>          master2   <none>           <none>
rook-ceph-detect-version-67skv                      0/1     Init:0/1    0          0s      <none>          master2   <none>           <none>
rook-ceph-detect-version-67skv                      0/1     PodInitializing   0          1s      10.40.180.61    master2   <none>           <none>
rook-ceph-detect-version-67skv                      0/1     Completed         0          2s      10.40.180.61    master2   <none>           <none>
rook-ceph-detect-version-67skv                      0/1     Terminating       0          2s      10.40.180.61    master2   <none>           <none>
rook-ceph-detect-version-67skv                      0/1     Terminating       0          2s      10.40.180.61    master2   <none>           <none>
rook-ceph-csi-detect-version-6bj7v                  0/1     Pending           0          0s      <none>          <none>    <none>           <none>
rook-ceph-csi-detect-version-6bj7v                  0/1     Pending           0          0s      <none>          node6     <none>           <none>
rook-ceph-csi-detect-version-6bj7v                  0/1     Init:0/1          0          0s      <none>          node6     <none>           <none>
rook-ceph-csi-detect-version-6bj7v                  0/1     Init:0/1          0          1s      <none>          node6     <none>           <none>
rook-ceph-csi-detect-version-6bj7v                  0/1     PodInitializing   0          2s      10.40.139.20    node6     <none>           <none>
rook-ceph-csi-detect-version-6bj7v                  0/1     Terminating       0          2s      10.40.139.20    node6     <none>           <none>
rook-ceph-csi-detect-version-6bj7v                  0/1     Terminating       0          3s      10.40.139.20    node6     <none>           <none>
rook-ceph-csi-detect-version-6bj7v                  0/1     Terminating       0          3s      10.40.139.20    node6     <none>           <none>
rook-ceph-csi-detect-version-6bj7v                  0/1     Terminating       0          4s      10.40.139.20    node6     <none>           <none>
rook-ceph-csi-detect-version-6bj7v                  0/1     Terminating       0          4s      10.40.139.20    node6     <none>           <none>
rook-ceph-csi-detect-version-6bj7v                  0/1     Terminating       0          4s      10.40.139.20    node6     <none>           <none>
rook-ceph-detect-version-qxzfg                      0/1     Pending           0          0s      <none>          <none>    <none>           <none>
rook-ceph-detect-version-qxzfg                      0/1     Pending           0          0s      <none>          master2   <none>           <none>
rook-ceph-detect-version-qxzfg                      0/1     Init:0/1          0          0s      <none>          master2   <none>           <none>
rook-ceph-detect-version-qxzfg                      0/1     Init:0/1          0          0s      <none>          master2   <none>           <none>
rook-ceph-detect-version-qxzfg                      0/1     PodInitializing   0          1s      10.40.180.3     master2   <none>           <none>
rook-ceph-detect-version-qxzfg                      0/1     Terminating       0          2s      10.40.180.3     master2   <none>           <none>
rook-ceph-detect-version-qxzfg                      0/1     Terminating       0          2s      10.40.180.3     master2   <none>           <none>
rook-ceph-detect-version-qxzfg                      0/1     Terminating       0          2s      10.40.180.3     master2   <none>           <none>
rook-ceph-detect-version-qxzfg                      0/1     Terminating       0          3s      10.40.180.3     master2   <none>           <none>
rook-ceph-detect-version-qxzfg                      0/1     Terminating       0          3s      10.40.180.3     master2   <none>           <none>
rook-ceph-detect-version-qxzfg                      0/1     Terminating       0          3s      10.40.180.3     master2   <none>           <none>

rook-ceph-detect-version-cwxnv                      0/1     Pending           0          0s      <none>          <none>    <none>           <none>
rook-ceph-detect-version-cwxnv                      0/1     Pending           0          0s      <none>          master2   <none>           <none>
rook-ceph-detect-version-cwxnv                      0/1     Init:0/1          0          0s      <none>          master2   <none>           <none>
rook-ceph-detect-version-cwxnv                      0/1     Init:0/1          0          1s      <none>          master2   <none>           <none>
rook-ceph-detect-version-cwxnv                      0/1     PodInitializing   0          1s      10.40.180.13    master2   <none>           <none>
rook-ceph-detect-version-cwxnv                      0/1     Terminating       0          2s      10.40.180.13    master2   <none>           <none>
rook-ceph-detect-version-cwxnv                      0/1     Terminating       0          2s      10.40.180.13    master2   <none>           <none>
rook-ceph-detect-version-cwxnv                      0/1     Terminating       0          2s      10.40.180.13    master2   <none>           <none>
rook-ceph-detect-version-cwxnv                      0/1     Terminating       0          3s      10.40.180.13    master2   <none>           <none>

网上看资料,例如:
rook/rook#12950 (Detect-version job pod got stuck in terminating status after adding osd to the cluster)

rook/rook#5896 (rook-ceph-detect-version never completes)

https://zhuanlan.zhihu.com/p/138991974

!!!!!!!!!!!!!!!!!!!!!!
但是没看到有用的解决方法。
请教如何检查问题并解决?
谢谢!

@chenjacken chenjacken added the question Further information is requested label Dec 24, 2024
@zexi
Copy link
Member

zexi commented Jan 6, 2025

看下 Init pod 的日志?

@chenjacken
Copy link
Author

chenjacken commented Jan 6, 2025

看下 Init pod 的日志?

请细化指导下,用什么命令来做看 Init pod 的日志?
谢谢!

Init:0/1 很快又变成 Terminating,不知道如何快速查看日志

@chenjacken
Copy link
Author

chenjacken commented Jan 12, 2025

rook-ceph-operator的日志:

2025-01-12 04:12:06.698371 I | op-mgr: start running mgr
2025-01-12 04:12:06.698390 I | cephclient: getting or creating ceph auth key "mgr.a"
2025-01-12 04:12:07.160038 I | op-mgr: deployment for mgr rook-ceph-mgr-a already exists. updating if needed
2025-01-12 04:12:07.167828 I | op-k8sutil: deployment "rook-ceph-mgr-a" did not change, nothing to update
2025-01-12 04:12:07.167847 I | cephclient: getting or creating ceph auth key "mgr.b"
2025-01-12 04:12:07.628359 I | op-mgr: deployment for mgr rook-ceph-mgr-b already exists. updating if needed
2025-01-12 04:12:07.634736 I | op-k8sutil: deployment "rook-ceph-mgr-b" did not change, nothing to update
2025-01-12 04:12:08.036604 I | op-mgr: setting services to point to mgr "b"
W0112 04:12:08.235510       7 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2025-01-12 04:12:08.250519 E | ceph-cluster-controller: failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed to create cluster: failed to start ceph mgr: failed to enable mgr services: failed to enable service monitor: service monitor could not be enabled: failed to retrieve servicemonitor. servicemonitors.monitoring.coreos.com "rook-ceph-mgr" is forbidden: User "system:serviceaccount:rook-ceph:rook-ceph-system" cannot get resource "servicemonitors" in API group "monitoring.coreos.com" in the namespace "rook-ceph"
2025-01-12 04:12:08.260777 I | ceph-cluster-controller: reconciling ceph cluster in namespace "rook-ceph"
2025-01-12 04:12:08.265848 I | op-mon: parsing mon endpoints: o=172.16.1.211:6789,r=172.16.1.212:6789,m=172.16.1.213:6789
2025-01-12 04:12:08.377355 I | ceph-cluster-controller: detecting the ceph image version for image registry.cn-beijing.aliyuncs.com/yunionio/ceph:v16.2.14...
2025-01-12 04:12:08.882842 E | ceph-spec: failed to update cluster condition to {Type:Ready Status:True Reason:ClusterCreated Message:Cluster created successfully LastHeartbeatTime:2025-01-12 04:12:08.570817401 +0000 UTC m=+65.243521561 LastTransitionTime:2025-01-12 04:01:44 +0000 UTC}. failed to update object "rook-ceph/rook-ceph" status: Operation cannot be fulfilled on cephclusters.ceph.rook.io "rook-ceph": the object has been modified; please apply your changes to the latest version and try again
2025-01-12 04:12:10.670269 I | ceph-cluster-controller: detected ceph image version: "16.2.14-0 pacific"
2025-01-12 04:12:10.670295 I | ceph-cluster-controller: validating ceph version from provided image
2025-01-12 04:12:10.674725 I | op-mon: parsing mon endpoints: o=172.16.1.211:6789,r=172.16.1.212:6789,m=172.16.1.213:6789
2025-01-12 04:12:10.677060 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2025-01-12 04:12:10.677182 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
.
.
.
2025-01-12 04:13:16.798404 I | op-config: deleting "mon_mds_skip_sanity" option from the mon configuration database
2025-01-12 04:13:17.156479 I | op-config: successfully deleted "mon_mds_skip_sanity" option from the mon configuration database
2025-01-12 04:13:17.156512 I | cephclient: create rbd-mirror bootstrap peer token "client.rbd-mirror-peer"
2025-01-12 04:13:17.156523 I | cephclient: getting or creating ceph auth key "client.rbd-mirror-peer"
2025-01-12 04:13:17.595874 I | cephclient: successfully created rbd-mirror bootstrap peer token for cluster "rook-ceph"
2025-01-12 04:13:17.661335 I | op-mgr: start running mgr
2025-01-12 04:13:17.661359 I | cephclient: getting or creating ceph auth key "mgr.a"
2025-01-12 04:13:18.121599 I | op-mgr: deployment for mgr rook-ceph-mgr-a already exists. updating if needed
2025-01-12 04:13:18.128107 I | op-k8sutil: deployment "rook-ceph-mgr-a" did not change, nothing to update
2025-01-12 04:13:18.128125 I | cephclient: getting or creating ceph auth key "mgr.b"
2025-01-12 04:13:18.593329 I | op-mgr: deployment for mgr rook-ceph-mgr-b already exists. updating if needed
2025-01-12 04:13:18.599630 I | op-k8sutil: deployment "rook-ceph-mgr-b" did not change, nothing to update
2025-01-12 04:13:19.008410 I | op-mgr: setting services to point to mgr "b"
W0112 04:13:19.202833       7 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2025-01-12 04:13:19.220445 E | ceph-cluster-controller: failed to reconcile CephCluster "rook-ceph/rook-ceph". failed to reconcile cluster "rook-ceph": failed to configure local ceph cluster: failed to create cluster: failed to start ceph mgr: failed to enable mgr services: failed to enable service monitor: service monitor could not be enabled: failed to retrieve servicemonitor. servicemonitors.monitoring.coreos.com "rook-ceph-mgr" is forbidden: User "system:serviceaccount:rook-ceph:rook-ceph-system" cannot get resource "servicemonitors" in API group "monitoring.coreos.com" in the namespace "rook-ceph"
2025-01-12 04:13:19.300746 I | ceph-cluster-controller: reconciling ceph cluster in namespace "rook-ceph"
2025-01-12 04:13:19.305724 I | op-mon: parsing mon endpoints: o=172.16.1.211:6789,r=172.16.1.212:6789,m=172.16.1.213:6789
2025-01-12 04:13:19.344232 I | ceph-cluster-controller: detecting the ceph image version for image registry.cn-beijing.aliyuncs.com/yunionio/ceph:v16.2.14...
2025-01-12 04:13:21.219222 I | ceph-cluster-controller: detected ceph image version: "16.2.14-0 pacific"
2025-01-12 04:13:21.219249 I | ceph-cluster-controller: validating ceph version from provided image
2025-01-12 04:13:21.223817 I | op-mon: parsing mon endpoints: o=172.16.1.211:6789,r=172.16.1.212:6789,m=172.16.1.213:6789
2025-01-12 04:13:21.226741 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2025-01-12 04:13:21.226875 I | cephclient: generated admin config in /var/lib/rook/rook-ceph

2025-01-12 04:13:21.662064 I | ceph-cluster-controller: cluster "rook-ceph": version "16.2.14-0 pacific" detected for image "registry.cn-beijing.aliyuncs.com/yunionio/ceph:v16.2.14"
2025-01-12 04:13:21.715424 I | op-mon: start running mons
2025-01-12 04:13:21.719787 I | op-mon: parsing mon endpoints: o=172.16.1.211:6789,r=172.16.1.212:6789,m=172.16.1.213:6789
2025-01-12 04:13:21.904352 I | op-mon: saved mon endpoints to config map map[csi-cluster-config-json:[{"clusterID":"rook-ceph","monitors":["172.16.1.211:6789","172.16.1.212:6789","172.16.1.213:6789"]}] data:m=172.16.1.213:6789,o=172.16.1.211:6789,r=172.16.1.212:6789 mapping:{"node":{"m":{"Name":"master3","Hostname":"master3","Address":"172.16.1.213"},"o":{"Name":"master1","Hostname":"master1","Address":"172.16.1.211"},"r":{"Name":"master2","Hostname":"master2","Address":"172.16.1.212"}}} maxMonId:17]
2025-01-12 04:13:22.502993 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2025-01-12 04:13:22.503226 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2025-01-12 04:13:23.703442 I | op-mon: targeting the mon count 3

以下的内容是关键信息?

2025-01-12 04:12:08.882842 E | ceph-spec: failed to update cluster condition to {Type:Ready Status:True Reason:ClusterCreated Message:Cluster created successfully LastHeartbeatTime:2025-01-12 04:12:08.570817401 +0000 UTC m=+65.243521561 LastTransitionTime:2025-01-12 04:01:44 +0000 UTC}. failed to update object "rook-ceph/rook-ceph" status: Operation cannot be fulfilled on cephclusters.ceph.rook.io "rook-ceph": the object has been modified; please apply your changes to the latest version and try again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested state/awaiting processing
Projects
None yet
Development

No branches or pull requests

2 participants