-
Notifications
You must be signed in to change notification settings - Fork 775
Minimize downtime during Pod recycling #2233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
barkbay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only left some nitpicks regarding the migration to V1
I still need to do some tests.
e74a1a6 to
42de81f
Compare
pebrc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
barkbay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got this failure when running the e2e test:
--- PASS: TestMutationWhileLoadTesting/Data_added_initially_should_still_be_present (0.13s)
--- FAIL: TestMutationWhileLoadTesting/Stopping_to_load_test (0.00s)
mutation_test.go:281:
Error Trace: mutation_test.go:281
watcher.go:77
Error: Not equal:
expected: 1
actual : 2
Test: TestMutationWhileLoadTesting/Stopping_to_load_test
Messages: [metrics: %v {"latencies":{"total":8815880891193,"mean":445719242,"50th":5751658,"95th":9378657,"99th":30000172532,"max":30000303574},"bytes_in":{"total":8631495,"mean":436.39693614439557},"bytes_out":{"total":0,"mean":0},"earliest":"2019-12-11T12:36:34.30051482Z","latest":"2019-12-11T12:39:52.330499653Z","end":"2019-12-11T12:39:52.342198435Z","duration":198029984833,"wait":11698782,"requests":19779,"rate":99.87881389113757,"throughput":0,"success":0,"status_codes":{"0":974,"401":18805},"errors":["Get https://34.77.139.214:9200/: dial tcp 0.0.0.0:0-\u003e34.77.139.214:9200: connect: connection refused","401 Unauthorized","Get https://34.77.139.214:9200/: dial tcp 0.0.0.0:0-\u003e34.77.139.214:9200: connect: no route to host","Get https://34.77.139.214:9200/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"]}]
|
@barkbay I saw it too after I've created PR. While looking into it I noticed that wait for LB IP provisioning has to be separated out, because it might take so long that we miss the entire update without firing a request. After fixing that I couldn't repro it anymore, but I also don't understand how it could affect it :) Even if we start hitting it in the middle of upgrade, we should always be safe... Anyway |
barkbay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment regarding the use of a load balancer.
Could we not just use the dns provided by the service to test the endpoint ( https://test-while-load-testing-rch4-es-http.e2e-xxxx-mercury.svc.cluster.local:9200) ?
|
A small update. I tried port forwarding approach with providing our http client with port forwarding dialer, but it turned out I need to reduce the request rate to avoid hitting fd limits. But even then the test felt flaky as every few runs it would fail with few requests failing which I assume is due to the long tail latency. Then, as discussed offline, I tried detecting local vs non-local e2e execution and switching between using LB and service name for those respectively. Surprisingly, this also had flakiness issues when running in the cluster. Increasing I took a step back and thought that what we want to validate here is not the default value, but the draining mechanism we use. So instead trying to adjust all other things to work under |
barkbay
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Minimize downtime during pod recycling * Add missing imports * Fix PR comments * PR comments fixes * Update go.sum * Fix panic in ContinuousHealthCheck * Add UT for .WithPreStopHook, rename parameter * Move waiting for LB IP provisioning to separate step * Use forwarding for local and svc for in cluster execution * Fix conditional variable assignement
This PR fixes #1927 and adds: