Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discovery-ec2] Plugin doesn't retry EC2 describe instance calls #50462

Closed
Bukhtawar opened this issue Dec 23, 2019 · 6 comments · Fixed by #50550
Closed

[discovery-ec2] Plugin doesn't retry EC2 describe instance calls #50462

Bukhtawar opened this issue Dec 23, 2019 · 6 comments · Fixed by #50550
Assignees
Labels
>bug :Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure

Comments

@Bukhtawar
Copy link
Contributor

Problem Description

Discovery EC2 doesn't retry EC2 describe calls on throttle or exceptions due to NO_RETRY_CONDITION which has the potential to make the throttling worse. Although there is a logic to exponentially backoff in such cases, backoff doesn't kick in.

RetryPolicy.RetryCondition.NO_RETRY_CONDITION,
(originalRequest, exception, retriesAttempted) -> {
// with 10 retries the max delay time is 320s/320000ms (10 * 2^5 * 1 * 1000)
logger.warn("EC2 API request failed, retry again. Reason was:", exception);
return 1000L * (long) (10d * Math.pow(2, retriesAttempted / 2.0d) * (1.0d + rand.nextDouble()));
},
10,
false);
clientConfiguration.setRetryPolicy(retryPolicy);

AWS reference

@original-brownbear original-brownbear added the :Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure label Dec 23, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Discovery-Plugins)

@Bukhtawar Bukhtawar changed the title [discovery-ec2] Plugin doesn't retry EC2 describe calls [discovery-ec2] Plugin doesn't retry EC2 describe instance calls Dec 23, 2019
@original-brownbear original-brownbear self-assigned this Dec 24, 2019
@Bukhtawar
Copy link
Contributor Author

Bukhtawar commented Dec 24, 2019

Do you mind if I can raise a PR for the same @original-brownbear . It's a small change I'm fine either ways

@original-brownbear
Copy link
Member

original-brownbear commented Dec 24, 2019

@Bukhtawar I'm not sure this is entirely a small change. It seems we're not just using the EC2 SDK to make calls to the EC2 metadata service but also use org.elasticsearch.discovery.ec2.Ec2NameResolver#resolve which just works from java.net.URL#openConnection() to load single metadata values. Wouldn't we have to fix those as well to really get proper retrying here? It seems like we'd only be fixing part of the calls to retry if we just did something like move to using the default retry condition in the code you pasted?

(I ran into this when I tried to test retries by faking 500s in the EC2 mock we use, if we only fixed this one spot which I'm not sure is all that valueable it still wouldn't be entirely trivial to test the fix either would it or do you have an idea?)

@Bukhtawar
Copy link
Contributor Author

Bukhtawar commented Dec 25, 2019

@original-brownbear Please correct me if I misunderstood your point but aren't the two calls made to different endpoint and hence logically different entities.

  1. DescribeInstance - EC2 control plane(via SDK)
  2. Instance Metadata - http://169.254.169.254/latest/meta-data/ (via URL#openConnection())

While for (1) we need retries for cases when a large cluster(100 node or more) loses a master and then PeerFinder on all nodes do a per second EC2 call causing account level limits to be breached.
(2) might be different than this though(haven't encountered throttling so far for this) and I certainly think (1) is valueable and worth the change.

Testing
We tested (1) by restarting masters on a 150 node cluster, which lead to more calls getting throttled without the fix. While DEFAULT_RETRY ensured back off kicks in. We could see from the timestamp of the exception stack trace. Having said this, we are working on a better test plan

@original-brownbear
Copy link
Member

Thanks @Bukhtawar, you are correct I think. (Thanks @DaveCTurner for chiming in here!). I think I found a way for a proper test and was will raise a PR for this one shortly. We have a bunch of HTTP mocking infrastructure from the snapshot repository tests that we can reuse here and I think I was able to craft a valid test from that :)

@Bukhtawar
Copy link
Contributor Author

Thanks @original-brownbear and @DaveCTurner. I had one more proposal wrt to the backoff strategy. It is to use the SDKDefaultBackoffStrategy which has equal/full jitters configured for throttle/non-throttle exceptions in the newer AWS SDK versions. Defaults for some configurations can be overridden as below.

RetryPolicy retryPolicy = new RetryPolicy(PredefinedRetryPolicies.DEFAULT_RETRY_CONDITION,
            (originalRequest, exception, retriesAttempted) -> {
                logger.warn("EC2 API request failed, retry again. Reason was:", exception);
                return new PredefinedBackoffStrategies.SDKDefaultBackoffStrategy(500,
                    1000,
                    200 * 1000).delayBeforeNextRetry(originalRequest, exception, retriesAttempted);

            },
            10,
            false);
        clientConfiguration.setRetryPolicy(retryPolicy);

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 2, 2020
Use the default retry condition instead of never retrying in the discovery plugin causing hot retries upstream and
add a test that verifies retrying works.

Closes elastic#50462
original-brownbear added a commit that referenced this issue Jan 2, 2020
Use the default retry condition instead of never retrying in the discovery plugin causing hot retries upstream and add a test that verifies retrying works.

Closes #50462
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jan 2, 2020
Use the default retry condition instead of never retrying in the discovery plugin causing hot retries upstream and add a test that verifies retrying works.

Closes elastic#50462
original-brownbear added a commit that referenced this issue Jan 2, 2020
Use the default retry condition instead of never retrying in the discovery plugin causing hot retries upstream and add a test that verifies retrying works.

Closes #50462
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
Use the default retry condition instead of never retrying in the discovery plugin causing hot retries upstream and add a test that verifies retrying works.

Closes elastic#50462
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants