Skip to content

Conversation

@david-kow
Copy link
Contributor

@david-kow david-kow commented Dec 10, 2019

This PR fixes #1927 and adds:

  • HEADLESS_SERVICE_NAME env variable in ES pods
  • pre-stop-hook-script.sh to scripts ConfigMap
  • prestop lifecycle hook in ES pods
  • e2e test with watcher doing high rps against ES during pod recycle

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only left some nitpicks regarding the migration to V1
I still need to do some tests.

Copy link
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got this failure when running the e2e test:

--- PASS: TestMutationWhileLoadTesting/Data_added_initially_should_still_be_present (0.13s)
    --- FAIL: TestMutationWhileLoadTesting/Stopping_to_load_test (0.00s)
        mutation_test.go:281:
                Error Trace:    mutation_test.go:281
                                                        watcher.go:77
                Error:          Not equal:
                                expected: 1
                                actual  : 2
                Test:           TestMutationWhileLoadTesting/Stopping_to_load_test
                Messages:       [metrics: %v {"latencies":{"total":8815880891193,"mean":445719242,"50th":5751658,"95th":9378657,"99th":30000172532,"max":30000303574},"bytes_in":{"total":8631495,"mean":436.39693614439557},"bytes_out":{"total":0,"mean":0},"earliest":"2019-12-11T12:36:34.30051482Z","latest":"2019-12-11T12:39:52.330499653Z","end":"2019-12-11T12:39:52.342198435Z","duration":198029984833,"wait":11698782,"requests":19779,"rate":99.87881389113757,"throughput":0,"success":0,"status_codes":{"0":974,"401":18805},"errors":["Get https://34.77.139.214:9200/: dial tcp 0.0.0.0:0-\u003e34.77.139.214:9200: connect: connection refused","401 Unauthorized","Get https://34.77.139.214:9200/: dial tcp 0.0.0.0:0-\u003e34.77.139.214:9200: connect: no route to host","Get https://34.77.139.214:9200/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"]}]

@david-kow
Copy link
Contributor Author

@barkbay I saw it too after I've created PR.

While looking into it I noticed that wait for LB IP provisioning has to be separated out, because it might take so long that we miss the entire update without firing a request. After fixing that I couldn't repro it anymore, but I also don't understand how it could affect it :) Even if we start hitting it in the middle of upgrade, we should always be safe...

Anyway "0":974 shows almost 1k errors and it seems to me that it took a while (over 1s) to use the IP address that was returned by DNS. Pod IP was already not routable (connect: no route to host) at this point or it was but nothing was listening ("99th":30000172532 -> over 30 seconds). As I couldn't repro it with the latest commit (out of 10 tries or so) I'd leave it as is and come back if it's flaky.

Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment regarding the use of a load balancer.
Could we not just use the dns provided by the service to test the endpoint ( https://test-while-load-testing-rch4-es-http.e2e-xxxx-mercury.svc.cluster.local:9200) ?

@pebrc pebrc added >enhancement Enhancement of existing functionality v1.0.0 labels Dec 12, 2019
@david-kow
Copy link
Contributor Author

A small update.

I tried port forwarding approach with providing our http client with port forwarding dialer, but it turned out I need to reduce the request rate to avoid hitting fd limits. But even then the test felt flaky as every few runs it would fail with few requests failing which I assume is due to the long tail latency.

Then, as discussed offline, I tried detecting local vs non-local e2e execution and switching between using LB and service name for those respectively. Surprisingly, this also had flakiness issues when running in the cluster. Increasing ADDITIONAL_WAIT_TIME helps, but I didn't want to optimize the setting for the test case and making upgrades longer. Also, I feel 1s is a good default, while anything above feels arbitrary.

I took a step back and thought that what we want to validate here is not the default value, but the draining mechanism we use. So instead trying to adjust all other things to work under
1s, I'll override the additional wait time just for the test. Users can do the same thing to adjust that value if they feel necessary. This will allow to make sure that the hook is not doing anything bad, the upgrade can progress and draining is not broken in any fundamental way. While the test is not watertight it achieves all (I believe) it can given the long tail latency I'm seeing.

@david-kow david-kow requested a review from barkbay December 13, 2019 10:06
Copy link
Contributor

@barkbay barkbay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thbkrkr thbkrkr changed the title Minimize downtime during pod recycling Minimize downtime during Pod recycling Jan 9, 2020
mjmbischoff pushed a commit to mjmbischoff/cloud-on-k8s that referenced this pull request Jan 13, 2020
* Minimize downtime during pod recycling

* Add missing imports

* Fix PR comments

* PR comments fixes

* Update go.sum

* Fix panic in ContinuousHealthCheck

* Add UT for .WithPreStopHook, rename parameter

* Move waiting for LB IP provisioning to separate step

* Use forwarding for local and svc for in cluster execution

* Fix conditional variable assignement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement Enhancement of existing functionality v1.0.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Minimize downtime by ensuring ES Pods endpoints are removed before ES is stopped

4 participants