Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing PersistenceVolume settings for bootstrap pod #897

Open
evheniyt opened this issue Nov 8, 2024 · 3 comments
Open

Missing PersistenceVolume settings for bootstrap pod #897

evheniyt opened this issue Nov 8, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@evheniyt
Copy link
Contributor

evheniyt commented Nov 8, 2024

          I have also experienced unstable cluster bootstrap. I have fully recreated a cluster multiple times and periodically I saw that the cluster was stacked on bootstrapping the second node.

Eventually, I have found a correlation between this issue and a recreation of bootstrap pod.
We are using Karpenter and sometimes, during bootstrap process it could decide to move bootstrap pod to another node. When that happens, the cluster creation stack with this error:

opensearch [2024-10-29T06:38:10,310][WARN ][o.o.c.c.Coordinator      ] [opensearch-primary-bootstrap-0] failed to validate incoming join request from node [{opensearch-primary-nodes-0}{9zZmg5EGRpidHf_0OwLUyA}{kV9
e6qUTSsmvj1lUP-2QjA}{opensearch-primary-nodes-0}{10.152.42.19:9300}{dm}{shard_indexing_pressure_enabled=true}]                                                                                                      
opensearch org.opensearch.transport.RemoteTransportException: [opensearch-primary-nodes-0][10.152.42.19:9300][internal:cluster/coordination/join/validate_compressed]                                               
opensearch Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: join validation on cluster state with a different cluster uuid 7QJiU55FRcWvBidZD_MF6A than local cluster uuid U4a0ix4h
TwCvij0JF9qoEw, rejecting

I believe it is caused by the fact that bootstrap pod is not using persistent disk, and if it is restarted it gets a new cluster UUID which is non equal with the UUID on node-0

Originally posted by @evheniyt in #811 (comment)

@github-actions github-actions bot added the untriaged Issues that have not yet been triaged label Nov 8, 2024
@evheniyt
Copy link
Contributor Author

evheniyt commented Nov 8, 2024

@swoehrl-mw @prudhvigodithi
I want to add support of PV for the bootstrap pod, WDYT?

@swoehrl-mw
Copy link
Collaborator

I want to add support of PV for the bootstrap pod, WDYT?

@evheniyt Fine for me. I think the bootstrap pod being restarted was not a scenario ever considered as it is only running for a few minutes. IMO there are no reasons against having a PV for the pod, but it should be cleaned up afterwards.

@swoehrl-mw swoehrl-mw added bug Something isn't working and removed untriaged Issues that have not yet been triaged labels Nov 12, 2024
@swoehrl-mw swoehrl-mw changed the title [BUG] Missing PersistenceVolume settings for bootstrap pod Missing PersistenceVolume settings for bootstrap pod Nov 12, 2024
@prudhvigodithi
Copy link
Member

Hey @evheniyt ya fine with me as well, here are some PR's https://github.com/opensearch-project/opensearch-k8s-operator/pulls?q=bootstrap where additional configurations are added to the bootstrap pod, please take a look and let us know if you are interested to in adding PersistenceVolume settings for bootstrap pod.
@getsaurabh02

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 📦 Backlog
Development

No branches or pull requests

3 participants