Minimize downtime during Pod recycling #2233

david-kow · 2019-12-10T06:17:45Z

This PR fixes #1927 and adds:

HEADLESS_SERVICE_NAME env variable in ES pods
pre-stop-hook-script.sh to scripts ConfigMap
prestop lifecycle hook in ES pods
e2e test with watcher doing high rps against ES during pod recycle

barkbay

Only left some nitpicks regarding the migration to V1
I still need to do some tests.

pkg/controller/common/license/trial.go

pkg/controller/elasticsearch/nodespec/podspec.go

pkg/controller/elasticsearch/nodespec/podspec_test.go

test/e2e/es/mutation_test.go

pkg/controller/elasticsearch/nodespec/readiness_probe.go

pkg/controller/common/defaults/pod_template.go

pkg/controller/elasticsearch/nodespec/lifecycle_hook.go

pkg/controller/elasticsearch/settings/environment.go

test/e2e/es/mutation_test.go

pebrc

LGTM!

pkg/controller/common/defaults/pod_template.go

pkg/controller/elasticsearch/nodespec/defaults.go

barkbay

Got this failure when running the e2e test:

--- PASS: TestMutationWhileLoadTesting/Data_added_initially_should_still_be_present (0.13s)
    --- FAIL: TestMutationWhileLoadTesting/Stopping_to_load_test (0.00s)
        mutation_test.go:281:
                Error Trace:    mutation_test.go:281
                                                        watcher.go:77
                Error:          Not equal:
                                expected: 1
                                actual  : 2
                Test:           TestMutationWhileLoadTesting/Stopping_to_load_test
                Messages:       [metrics: %v {"latencies":{"total":8815880891193,"mean":445719242,"50th":5751658,"95th":9378657,"99th":30000172532,"max":30000303574},"bytes_in":{"total":8631495,"mean":436.39693614439557},"bytes_out":{"total":0,"mean":0},"earliest":"2019-12-11T12:36:34.30051482Z","latest":"2019-12-11T12:39:52.330499653Z","end":"2019-12-11T12:39:52.342198435Z","duration":198029984833,"wait":11698782,"requests":19779,"rate":99.87881389113757,"throughput":0,"success":0,"status_codes":{"0":974,"401":18805},"errors":["Get https://34.77.139.214:9200/: dial tcp 0.0.0.0:0-\u003e34.77.139.214:9200: connect: connection refused","401 Unauthorized","Get https://34.77.139.214:9200/: dial tcp 0.0.0.0:0-\u003e34.77.139.214:9200: connect: no route to host","Get https://34.77.139.214:9200/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"]}]

david-kow · 2019-12-11T22:58:31Z

@barkbay I saw it too after I've created PR.

While looking into it I noticed that wait for LB IP provisioning has to be separated out, because it might take so long that we miss the entire update without firing a request. After fixing that I couldn't repro it anymore, but I also don't understand how it could affect it :) Even if we start hitting it in the middle of upgrade, we should always be safe...

Anyway "0":974 shows almost 1k errors and it seems to me that it took a while (over 1s) to use the IP address that was returned by DNS. Pod IP was already not routable (connect: no route to host) at this point or it was but nothing was listening ("99th":30000172532 -> over 30 seconds). As I couldn't repro it with the latest commit (out of 10 tries or so) I'd leave it as is and come back if it's flaky.

barkbay

I left a comment regarding the use of a load balancer.
Could we not just use the dns provided by the service to test the endpoint ( https://test-while-load-testing-rch4-es-http.e2e-xxxx-mercury.svc.cluster.local:9200) ?

test/e2e/es/mutation_test.go

david-kow · 2019-12-13T01:22:54Z

A small update.

I tried port forwarding approach with providing our http client with port forwarding dialer, but it turned out I need to reduce the request rate to avoid hitting fd limits. But even then the test felt flaky as every few runs it would fail with few requests failing which I assume is due to the long tail latency.

Then, as discussed offline, I tried detecting local vs non-local e2e execution and switching between using LB and service name for those respectively. Surprisingly, this also had flakiness issues when running in the cluster. Increasing ADDITIONAL_WAIT_TIME helps, but I didn't want to optimize the setting for the test case and making upgrades longer. Also, I feel 1s is a good default, while anything above feels arbitrary.

I took a step back and thought that what we want to validate here is not the default value, but the draining mechanism we use. So instead trying to adjust all other things to work under
1s, I'll override the additional wait time just for the test. Users can do the same thing to adjust that value if they feel necessary. This will allow to make sure that the hook is not doing anything bad, the upgrade can progress and draining is not broken in any fundamental way. While the test is not watertight it achieves all (I believe) it can given the long tail latency I'm seeing.

barkbay

LGTM

* Minimize downtime during pod recycling * Add missing imports * Fix PR comments * PR comments fixes * Update go.sum * Fix panic in ContinuousHealthCheck * Add UT for .WithPreStopHook, rename parameter * Move waiting for LB IP provisioning to separate step * Use forwarding for local and svc for in cluster execution * Fix conditional variable assignement

barkbay reviewed Dec 10, 2019

View reviewed changes

pebrc reviewed Dec 10, 2019

View reviewed changes

pkg/controller/elasticsearch/nodespec/readiness_probe.go Outdated Show resolved Hide resolved

barkbay reviewed Dec 10, 2019

View reviewed changes

pkg/controller/elasticsearch/nodespec/readiness_probe.go Outdated Show resolved Hide resolved

sebgl reviewed Dec 10, 2019

View reviewed changes

David Kowalski added 4 commits December 11, 2019 00:29

Minimize downtime during pod recycling

4ad0185

Add missing imports

d6146ab

Fix PR comments

ef6adcf

PR comments fixes

42de81f

david-kow force-pushed the minimize_downtime branch from e74a1a6 to 42de81f Compare December 10, 2019 23:39

Update go.sum

7a420ad

pebrc approved these changes Dec 11, 2019

View reviewed changes

sebgl reviewed Dec 11, 2019

View reviewed changes

pkg/controller/common/defaults/pod_template.go Show resolved Hide resolved

pkg/controller/elasticsearch/nodespec/defaults.go Outdated Show resolved Hide resolved

barkbay reviewed Dec 11, 2019

View reviewed changes

David Kowalski added 3 commits December 11, 2019 23:34

Fix panic in ContinuousHealthCheck

aef6ca0

Add UT for .WithPreStopHook, rename parameter

3d75dc6

Move waiting for LB IP provisioning to separate step

b1339c2

barkbay reviewed Dec 12, 2019

View reviewed changes

test/e2e/es/mutation_test.go Outdated Show resolved Hide resolved

pebrc added >enhancement Enhancement of existing functionality v1.0.0 labels Dec 12, 2019

Use forwarding for local and svc for in cluster execution

c3a533f

david-kow requested a review from barkbay December 13, 2019 10:06

barkbay approved these changes Dec 13, 2019

View reviewed changes

Fix conditional variable assignement

c0ecef4

david-kow merged commit ebc71c8 into elastic:master Dec 13, 2019

david-kow deleted the minimize_downtime branch December 13, 2019 12:18

This was referenced Dec 13, 2019

Use PodTemplate that contains correct SecurityContext #2268

Merged

Document PreStop hook and it's env vars #2274

Merged

sebgl mentioned this pull request Dec 17, 2019

Continuous cluster health checks are flaky #2262

Closed

sebgl mentioned this pull request Dec 20, 2019

E2E test Cluster_UUID_should_have_been_preserved is flaky #2318

Closed

thbkrkr changed the title ~~Minimize downtime during pod recycling~~ Minimize downtime during Pod recycling Jan 9, 2020

thbkrkr mentioned this pull request Feb 2, 2022

TestMutationWhileLoadTesting fails due to failed connection attempts #5263

Closed

Minimize downtime during Pod recycling #2233

Minimize downtime during Pod recycling #2233

Uh oh!

Conversation

david-kow commented Dec 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

barkbay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pebrc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

barkbay left a comment

Choose a reason for hiding this comment

Uh oh!

david-kow commented Dec 11, 2019

Uh oh!

barkbay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

david-kow commented Dec 13, 2019

Uh oh!

barkbay left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

david-kow commented Dec 10, 2019 •

edited

Loading