Upgrading ECK to 2.6.0 and ES to 8.6.0 causes ES to fail to bootstrap/form a cluster #6303

SpencerLN · 2023-01-11T01:56:43Z

Bug Report

What did you do?

Upgraded ECK to 2.6.0 and ES to 8.6.0

What did you expect to see?
Elasticsearch to perform a rolling upgrade and continue functioning.

What did you see instead? Under which circumstances?
The master nodes failed to successfully form a cluster after ECK performed the rolling upgrade.

{"@timestamp":"2023-01-11T01:36:15.170Z", "log.level": "WARN", "message":"master not discovered or elected yet, an election requires at least 2 nodes with ids from [GDLzi1FtQsmIynZWAABPrw, vnn4kgHxQYWJHGADX67xsQ, zfwXTetqR-uSfUkuUo7LKQ], have discovered possible quorum [{quickstart-es-default-1}{zfwXTetqR-uSfUkuUo7LKQ}{trTZbnD4Rzms0uusHaOknA}{quickstart-es-default-1}{10.64.1.8}{10.64.1.8:9300}{cdfhilmrstw}, {quickstart-es-default-2}{vnn4kgHxQYWJHGADX67xsQ}{bAcNneIsRtOoaII1s2JkCg}{quickstart-es-default-2}{10.64.0.9}{10.64.0.9:9300}{cdfhilmrstw}, {quickstart-es-default-0}{GDLzi1FtQsmIynZWAABPrw}{kJ-auB7nTyyvRumjk183ng}{quickstart-es-default-0}{10.64.2.4}{10.64.2.4:9300}{cdfhilmrstw}]; discovery will continue using [10.64.0.9:9300, 10.64.2.4:9300] from hosts providers and [{quickstart-es-default-1}{zfwXTetqR-uSfUkuUo7LKQ}{trTZbnD4Rzms0uusHaOknA}{quickstart-es-default-1}{10.64.1.8}{10.64.1.8:9300}{cdfhilmrstw}] from last-known cluster state; node term 3, last-accepted version 105 in term 2; joining [{quickstart-es-default-2}{vnn4kgHxQYWJHGADX67xsQ}{bAcNneIsRtOoaII1s2JkCg}{quickstart-es-default-2}{10.64.0.9}{10.64.0.9:9300}{cdfhilmrstw}] in term [3] has status [waiting for response] after [9.9m/599788ms]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[quickstart-es-default-1][cluster_coordination][T#1]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.node.name":"quickstart-es-default-1","elasticsearch.cluster.name":"quickstart"}

Environment

ECK version:

2.5.0 > 2.6.0

Kubernetes information:
GKE

% kubectl version
Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.0", GitCommit:"4ce5a8954017644c5420bae81d72b09b735c21f0", GitTreeState:"clean", BuildDate:"2022-05-03T13:46:05Z", GoVersion:"go1.18.1", Compiler:"gc", Platform:"darwin/arm64"}
Kustomize Version: v4.5.4
Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.7-gke.900", GitCommit:"e35c4457f66187eff006dda6d2c0fe12144ef2ec", GitTreeState:"clean", BuildDate:"2022-10-26T09:25:34Z", GoVersion:"go1.18.7b7", Compiler:"gc", Platform:"linux/amd64"}

Resource definition:

kubectl create -f https://download.elastic.co/downloads/eck/2.5.0/crds.yaml
kubectl apply -f https://download.elastic.co/downloads/eck/2.5.0/operator.yaml
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 8.5.3
  nodeSets:
  - name: default
    count: 3
    config:
      node.store.allow_mmap: false
EOF

kubectl apply -f https://download.elastic.co/downloads/eck/2.6.0/crds.yaml
kubectl apply -f https://download.elastic.co/downloads/eck/2.6.0/operator.yaml
cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 8.6.0
  nodeSets:
  - name: default
    count: 3
    config:
      node.store.allow_mmap: false
EOF

Upgrading to 8.6.0 without upgrading ECK to 2.6.0 does not seem to cause the same problem.

The text was updated successfully, but these errors were encountered:

barkbay · 2023-01-11T08:02:48Z

This seems to be related to file based settings feature (#6148), if I manually disable it in the code the upgrade runs fine.

From one of the node that cannot join the cluster:

{
	"@timestamp": "2023-01-11T07:53:06.100Z",
	"log.level": "WARN",
	"message": "master not discovered or elected yet, an election requires at least 2 nodes with ids from [YpB5rhdaSuOEMb_fG3npaA, ROL_DH0rQ4W_6oxuL-an8Q, Hgd4D5KsT8SgE5jRF0rlbg], have discovered possible quorum [{elasticsearch-sample-es-default-0}{ROL_DH0rQ4W_6oxuL-an8Q}{CXv0SmFDSNeUjeXS2lwqQw}{elasticsearch-sample-es-default-0}{10.92.146.16}{10.92.146.16:9300}{dilm}, {elasticsearch-sample-es-default-2}{YpB5rhdaSuOEMb_fG3npaA}{6n90WvzbRriBvtG6Ne8s7Q}{elasticsearch-sample-es-default-2}{10.92.144.22}{10.92.144.22:9300}{dilm}, {elasticsearch-sample-es-default-1}{Hgd4D5KsT8SgE5jRF0rlbg}{Rxoftt8STh-veVmP7VKTxQ}{elasticsearch-sample-es-default-1}{10.92.145.15}{10.92.145.15:9300}{dilm}]; discovery will continue using [10.92.144.22:9300, 10.92.145.15:9300] from hosts providers and [{elasticsearch-sample-es-default-0}{ROL_DH0rQ4W_6oxuL-an8Q}{CXv0SmFDSNeUjeXS2lwqQw}{elasticsearch-sample-es-default-0}{10.92.146.16}{10.92.146.16:9300}{dilm}] from last-known cluster state; node term 14, last-accepted version 168 in term 13; joining [{elasticsearch-sample-es-default-2}{YpB5rhdaSuOEMb_fG3npaA}{6n90WvzbRriBvtG6Ne8s7Q}{elasticsearch-sample-es-default-2}{10.92.144.22}{10.92.144.22:9300}{dilm}] in term [14] has status [waiting for response] after [6.6m/398570ms]",
	"ecs.version": "1.2.0",
	"service.name": "ES_ECS",
	"event.dataset": "elasticsearch.server",
	"process.thread.name": "elasticsearch[elasticsearch-sample-es-default-0][cluster_coordination][T#1]",
	"log.logger": "org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper",
	"elasticsearch.node.name": "elasticsearch-sample-es-default-0",
	"elasticsearch.cluster.name": "elasticsearch-sample"
}

The above IP addresses are correct:

elasticsearch-sample-es-default-2 has 10.92.144.22
elasticsearch-sample-es-default-0 has 10.92.146.16

On the other hand elasticsearch-sample-es-default-2 seems to use an old IP:

{
	"@timestamp": "2023-01-11T07:59:15.273Z",
	"log.level": "WARN",
	"message": "failed to retrieve stats for node [ROL_DH0rQ4W_6oxuL-an8Q]",
	"ecs.version": "1.2.0",
	"service.name": "ES_ECS",
	"event.dataset": "elasticsearch.server",
	"process.thread.name": "elasticsearch[elasticsearch-sample-es-default-2][generic][T#4]",
	"log.logger": "org.elasticsearch.cluster.InternalClusterInfoService",
	"elasticsearch.cluster.uuid": "WluJfwzIRnex8xc5DV8yKQ",
	"elasticsearch.node.id": "YpB5rhdaSuOEMb_fG3npaA",
	"elasticsearch.node.name": "elasticsearch-sample-es-default-2",
	"elasticsearch.cluster.name": "elasticsearch-sample",
	"error.type": "org.elasticsearch.transport.NodeNotConnectedException",
	"error.message": "[elasticsearch-sample-es-default-0][10.92.146.15:9300] Node not connected",
	"error.stack_trace": "org.elasticsearch.transport.NodeNotConnectedException: [elasticsearch-sample-es-default-0][10.92.146.15:9300] Node not connected"
}

barkbay · 2023-01-11T13:25:08Z

Root cause of the issue is a bug in Elasticsearch see elastic/elasticsearch#92812 for more details.
We are updating the release notes in ~~#6307~~ #6312

barkbay · 2023-01-12T07:05:38Z

I'm closing this issue as ECK 2.6.1 as been released:

See the 2.6.1 release notes here: https://www.elastic.co/guide/en/cloud-on-k8s/current/release-notes-2.6.1.html
And the known issue note for 2.6.0 here: https://www.elastic.co/guide/en/cloud-on-k8s/current/release-highlights-2.6.0.html#k8s-260-known-issues

If some Elasticsearch control plane nodes are not joining the cluster with a message similar to have discovered possible quorum ... joining [...] has status [waiting for response] after [....], then you may need to restart those nodes using the following command:

kubectl delete pods -l elasticsearch.k8s.elastic.co/node-master=true,elasticsearch.k8s.elastic.co/cluster-name=$CLUSTERNAME

morphalus · 2024-02-27T15:30:39Z

Thanks @barkbay!

botelastic bot added the triage label Jan 11, 2023

barkbay added >bug Something isn't working and removed triage labels Jan 11, 2023

thecoop mentioned this issue Jan 11, 2023

Deadlock in FileSettingsService elastic/elasticsearch#92812

Closed

thbkrkr mentioned this issue Jan 11, 2023

Update minimum version to use Elasticsearch file-based settings feature #6305

Merged

idanmo mentioned this issue Jan 11, 2023

Update 2.6.0 release notes #6307

Closed

barkbay closed this as completed Jan 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading ECK to 2.6.0 and ES to 8.6.0 causes ES to fail to bootstrap/form a cluster #6303

Upgrading ECK to 2.6.0 and ES to 8.6.0 causes ES to fail to bootstrap/form a cluster #6303

SpencerLN commented Jan 11, 2023

barkbay commented Jan 11, 2023 •

edited

Loading

barkbay commented Jan 11, 2023 •

edited

Loading

barkbay commented Jan 12, 2023 •

edited

Loading

morphalus commented Feb 27, 2024

Upgrading ECK to 2.6.0 and ES to 8.6.0 causes ES to fail to bootstrap/form a cluster #6303

Upgrading ECK to 2.6.0 and ES to 8.6.0 causes ES to fail to bootstrap/form a cluster #6303

Comments

SpencerLN commented Jan 11, 2023

Bug Report

barkbay commented Jan 11, 2023 • edited Loading

barkbay commented Jan 11, 2023 • edited Loading

barkbay commented Jan 12, 2023 • edited Loading

morphalus commented Feb 27, 2024

barkbay commented Jan 11, 2023 •

edited

Loading

barkbay commented Jan 11, 2023 •

edited

Loading

barkbay commented Jan 12, 2023 •

edited

Loading