You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have also experienced unstable cluster bootstrap. I have fully recreated a cluster multiple times and periodically I saw that the cluster was stacked on bootstrapping the second node.
Eventually, I have found a correlation between this issue and a recreation of bootstrap pod.
We are using Karpenter and sometimes, during bootstrap process it could decide to move bootstrap pod to another node. When that happens, the cluster creation stack with this error:
opensearch [2024-10-29T06:38:10,310][WARN ][o.o.c.c.Coordinator ] [opensearch-primary-bootstrap-0] failed to validate incoming join request from node [{opensearch-primary-nodes-0}{9zZmg5EGRpidHf_0OwLUyA}{kV9
e6qUTSsmvj1lUP-2QjA}{opensearch-primary-nodes-0}{10.152.42.19:9300}{dm}{shard_indexing_pressure_enabled=true}]
opensearch org.opensearch.transport.RemoteTransportException: [opensearch-primary-nodes-0][10.152.42.19:9300][internal:cluster/coordination/join/validate_compressed]
opensearch Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid 7QJiU55FRcWvBidZD_MF6A than local cluster uuid U4a0ix4h
TwCvij0JF9qoEw, rejecting
I believe it is caused by the fact that bootstrap pod is not using persistent disk, and if it is restarted it gets a new cluster UUID which is non equal with the UUID on node-0
I want to add support of PV for the bootstrap pod, WDYT?
@evheniyt Fine for me. I think the bootstrap pod being restarted was not a scenario ever considered as it is only running for a few minutes. IMO there are no reasons against having a PV for the pod, but it should be cleaned up afterwards.
swoehrl-mw
changed the title
[BUG] Missing PersistenceVolume settings for bootstrap pod
Missing PersistenceVolume settings for bootstrap pod
Nov 12, 2024
Eventually, I have found a correlation between this issue and a recreation of bootstrap pod.
We are using Karpenter and sometimes, during bootstrap process it could decide to move bootstrap pod to another node. When that happens, the cluster creation stack with this error:
I believe it is caused by the fact that bootstrap pod is not using persistent disk, and if it is restarted it gets a new cluster UUID which is non equal with the UUID on node-0
Originally posted by @evheniyt in #811 (comment)
The text was updated successfully, but these errors were encountered: