Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading ECK to 2.6.0 and ES to 8.6.0 causes ES to fail to bootstrap/form a cluster #6303

Closed
SpencerLN opened this issue Jan 11, 2023 · 4 comments
Labels
>bug Something isn't working

Comments

@SpencerLN
Copy link

Bug Report

What did you do?

Upgraded ECK to 2.6.0 and ES to 8.6.0

What did you expect to see?
Elasticsearch to perform a rolling upgrade and continue functioning.

What did you see instead? Under which circumstances?
The master nodes failed to successfully form a cluster after ECK performed the rolling upgrade.

{"@timestamp":"2023-01-11T01:36:15.170Z", "log.level": "WARN", "message":"master not discovered or elected yet, an election requires at least 2 nodes with ids from [GDLzi1FtQsmIynZWAABPrw, vnn4kgHxQYWJHGADX67xsQ, zfwXTetqR-uSfUkuUo7LKQ], have discovered possible quorum [{quickstart-es-default-1}{zfwXTetqR-uSfUkuUo7LKQ}{trTZbnD4Rzms0uusHaOknA}{quickstart-es-default-1}{10.64.1.8}{10.64.1.8:9300}{cdfhilmrstw}, {quickstart-es-default-2}{vnn4kgHxQYWJHGADX67xsQ}{bAcNneIsRtOoaII1s2JkCg}{quickstart-es-default-2}{10.64.0.9}{10.64.0.9:9300}{cdfhilmrstw}, {quickstart-es-default-0}{GDLzi1FtQsmIynZWAABPrw}{kJ-auB7nTyyvRumjk183ng}{quickstart-es-default-0}{10.64.2.4}{10.64.2.4:9300}{cdfhilmrstw}]; discovery will continue using [10.64.0.9:9300, 10.64.2.4:9300] from hosts providers and [{quickstart-es-default-1}{zfwXTetqR-uSfUkuUo7LKQ}{trTZbnD4Rzms0uusHaOknA}{quickstart-es-default-1}{10.64.1.8}{10.64.1.8:9300}{cdfhilmrstw}] from last-known cluster state; node term 3, last-accepted version 105 in term 2; joining [{quickstart-es-default-2}{vnn4kgHxQYWJHGADX67xsQ}{bAcNneIsRtOoaII1s2JkCg}{quickstart-es-default-2}{10.64.0.9}{10.64.0.9:9300}{cdfhilmrstw}] in term [3] has status [waiting for response] after [9.9m/599788ms]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[quickstart-es-default-1][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.node.name":"quickstart-es-default-1","elasticsearch.cluster.name":"quickstart"}

Environment

  • ECK version:

2.5.0 > 2.6.0

  • Kubernetes information:
    GKE
% kubectl version
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-03T13:46:05Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.7-gke.900", GitCommit:"e35c4457f66187eff006dda6d2c0fe12144ef2ec", GitTreeState:"clean", BuildDate:"2022-10-26T09:25:34Z", GoVersion:"go1.18.7b7", Compiler:"gc", Platform:"linux/amd64"}
  • Resource definition:
kubectl create -f https://download.elastic.co/downloads/eck/2.5.0/crds.yaml
kubectl apply -f https://download.elastic.co/downloads/eck/2.5.0/operator.yaml
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 8.5.3
  nodeSets:
  - name: default
    count: 3
    config:
      node.store.allow_mmap: false
EOF

kubectl apply -f https://download.elastic.co/downloads/eck/2.6.0/crds.yaml
kubectl apply -f https://download.elastic.co/downloads/eck/2.6.0/operator.yaml
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 8.6.0
  nodeSets:
  - name: default
    count: 3
    config:
      node.store.allow_mmap: false
EOF

Upgrading to 8.6.0 without upgrading ECK to 2.6.0 does not seem to cause the same problem.

@botelastic botelastic bot added the triage label Jan 11, 2023
@barkbay barkbay added >bug Something isn't working and removed triage labels Jan 11, 2023
@barkbay
Copy link
Contributor

barkbay commented Jan 11, 2023

This seems to be related to file based settings feature (#6148), if I manually disable it in the code the upgrade runs fine.

From one of the node that cannot join the cluster:

{
	"@timestamp": "2023-01-11T07:53:06.100Z",
	"log.level": "WARN",
	"message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [YpB5rhdaSuOEMb_fG3npaA, ROL_DH0rQ4W_6oxuL-an8Q, Hgd4D5KsT8SgE5jRF0rlbg], have discovered possible quorum [{elasticsearch-sample-es-default-0}{ROL_DH0rQ4W_6oxuL-an8Q}{CXv0SmFDSNeUjeXS2lwqQw}{elasticsearch-sample-es-default-0}{10.92.146.16}{10.92.146.16:9300}{dilm}, {elasticsearch-sample-es-default-2}{YpB5rhdaSuOEMb_fG3npaA}{6n90WvzbRriBvtG6Ne8s7Q}{elasticsearch-sample-es-default-2}{10.92.144.22}{10.92.144.22:9300}{dilm}, {elasticsearch-sample-es-default-1}{Hgd4D5KsT8SgE5jRF0rlbg}{Rxoftt8STh-veVmP7VKTxQ}{elasticsearch-sample-es-default-1}{10.92.145.15}{10.92.145.15:9300}{dilm}]; discovery will continue using [10.92.144.22:9300, 10.92.145.15:9300] from hosts providers and [{elasticsearch-sample-es-default-0}{ROL_DH0rQ4W_6oxuL-an8Q}{CXv0SmFDSNeUjeXS2lwqQw}{elasticsearch-sample-es-default-0}{10.92.146.16}{10.92.146.16:9300}{dilm}] from last-known cluster state; node term 14, last-accepted version 168 in term 13; joining [{elasticsearch-sample-es-default-2}{YpB5rhdaSuOEMb_fG3npaA}{6n90WvzbRriBvtG6Ne8s7Q}{elasticsearch-sample-es-default-2}{10.92.144.22}{10.92.144.22:9300}{dilm}] in term [14] has status [waiting for response] after [6.6m/398570ms]",
	"ecs.version": "1.2.0",
	"service.name": "ES_ECS",
	"event.dataset": "elasticsearch.server",
	"process.thread.name": "elasticsearch[elasticsearch-sample-es-default-0][cluster_coordination][T#1]",
	"log.logger": "org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper",
	"elasticsearch.node.name": "elasticsearch-sample-es-default-0",
	"elasticsearch.cluster.name": "elasticsearch-sample"
}

The above IP addresses are correct:

  • elasticsearch-sample-es-default-2 has 10.92.144.22
  • elasticsearch-sample-es-default-0 has 10.92.146.16

On the other hand elasticsearch-sample-es-default-2 seems to use an old IP:

{
	"@timestamp": "2023-01-11T07:59:15.273Z",
	"log.level": "WARN",
	"message": "failed to retrieve stats for node [ROL_DH0rQ4W_6oxuL-an8Q]",
	"ecs.version": "1.2.0",
	"service.name": "ES_ECS",
	"event.dataset": "elasticsearch.server",
	"process.thread.name": "elasticsearch[elasticsearch-sample-es-default-2][generic][T#4]",
	"log.logger": "org.elasticsearch.cluster.InternalClusterInfoService",
	"elasticsearch.cluster.uuid": "WluJfwzIRnex8xc5DV8yKQ",
	"elasticsearch.node.id": "YpB5rhdaSuOEMb_fG3npaA",
	"elasticsearch.node.name": "elasticsearch-sample-es-default-2",
	"elasticsearch.cluster.name": "elasticsearch-sample",
	"error.type": "org.elasticsearch.transport.NodeNotConnectedException",
	"error.message": "[elasticsearch-sample-es-default-0][10.92.146.15:9300] Node not connected",
	"error.stack_trace": "org.elasticsearch.transport.NodeNotConnectedException: [elasticsearch-sample-es-default-0][10.92.146.15:9300] Node not connected"
}

@barkbay
Copy link
Contributor

barkbay commented Jan 11, 2023

Root cause of the issue is a bug in Elasticsearch see elastic/elasticsearch#92812 for more details.
We are updating the release notes in #6307 #6312

@barkbay
Copy link
Contributor

barkbay commented Jan 12, 2023

I'm closing this issue as ECK 2.6.1 as been released:

If some Elasticsearch control plane nodes are not joining the cluster with a message similar to have discovered possible quorum ... joining [...] has status [waiting for response] after [....], then you may need to restart those nodes using the following command:

kubectl delete pods -l elasticsearch.k8s.elastic.co/node-master=true,elasticsearch.k8s.elastic.co/cluster-name=$CLUSTERNAME

@barkbay barkbay closed this as completed Jan 12, 2023
@morphalus
Copy link

Thanks @barkbay!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants