[discovery-ec2] Plugin doesn't retry EC2 describe instance calls #50462

Bukhtawar · 2019-12-23T13:19:31Z

Problem Description

Discovery EC2 doesn't retry EC2 describe calls on throttle or exceptions due to NO_RETRY_CONDITION which has the potential to make the throttling worse. Although there is a logic to exponentially backoff in such cases, backoff doesn't kick in.

elasticsearch/plugins/discovery-ec2/src/main/java/org/elasticsearch/discovery/ec2/AwsEc2ServiceImpl.java

Lines 82 to 90 in 7203cee

    
               RetryPolicy.RetryCondition.NO_RETRY_CONDITION, 
        
               (originalRequest, exception, retriesAttempted) -> { 
        
                  // with 10 retries the max delay time is 320s/320000ms (10 * 2^5 * 1 * 1000) 
        
                  logger.warn("EC2 API request failed, retry again. Reason was:", exception); 
        
                  return 1000L * (long) (10d * Math.pow(2, retriesAttempted / 2.0d) * (1.0d + rand.nextDouble())); 
        
               }, 
        
               10, 
        
               false); 
        
           clientConfiguration.setRetryPolicy(retryPolicy);

AWS reference

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-12-23T13:38:39Z

Pinging @elastic/es-distributed (:Distributed/Discovery-Plugins)

Bukhtawar · 2019-12-24T14:37:33Z

Do you mind if I can raise a PR for the same @original-brownbear . It's a small change I'm fine either ways

original-brownbear · 2019-12-24T16:54:11Z

@Bukhtawar I'm not sure this is entirely a small change. It seems we're not just using the EC2 SDK to make calls to the EC2 metadata service but also use org.elasticsearch.discovery.ec2.Ec2NameResolver#resolve which just works from java.net.URL#openConnection() to load single metadata values. Wouldn't we have to fix those as well to really get proper retrying here? It seems like we'd only be fixing part of the calls to retry if we just did something like move to using the default retry condition in the code you pasted?

(I ran into this when I tried to test retries by faking 500s in the EC2 mock we use, if we only fixed this one spot which I'm not sure is all that valueable it still wouldn't be entirely trivial to test the fix either would it or do you have an idea?)

Bukhtawar · 2019-12-25T06:53:17Z

@original-brownbear Please correct me if I misunderstood your point but aren't the two calls made to different endpoint and hence logically different entities.

DescribeInstance - EC2 control plane(via SDK)
Instance Metadata - http://169.254.169.254/latest/meta-data/ (via URL#openConnection())

While for (1) we need retries for cases when a large cluster(100 node or more) loses a master and then PeerFinder on all nodes do a per second EC2 call causing account level limits to be breached.
(2) might be different than this though(haven't encountered throttling so far for this) and I certainly think (1) is valueable and worth the change.

Testing
We tested (1) by restarting masters on a 150 node cluster, which lead to more calls getting throttled without the fix. While DEFAULT_RETRY ensured back off kicks in. We could see from the timestamp of the exception stack trace. Having said this, we are working on a better test plan

original-brownbear · 2019-12-27T09:20:04Z

Thanks @Bukhtawar, you are correct I think. (Thanks @DaveCTurner for chiming in here!). I think I found a way for a proper test and was will raise a PR for this one shortly. We have a bunch of HTTP mocking infrastructure from the snapshot repository tests that we can reuse here and I think I was able to craft a valid test from that :)

Bukhtawar · 2019-12-27T10:18:48Z

Thanks @original-brownbear and @DaveCTurner. I had one more proposal wrt to the backoff strategy. It is to use the SDKDefaultBackoffStrategy which has equal/full jitters configured for throttle/non-throttle exceptions in the newer AWS SDK versions. Defaults for some configurations can be overridden as below.

RetryPolicy retryPolicy = new RetryPolicy(PredefinedRetryPolicies.DEFAULT_RETRY_CONDITION,
            (originalRequest, exception, retriesAttempted) -> {
                logger.warn("EC2 API request failed, retry again. Reason was:", exception);
                return new PredefinedBackoffStrategies.SDKDefaultBackoffStrategy(500,
                    1000,
                    200 * 1000).delayBeforeNextRetry(originalRequest, exception, retriesAttempted);

            },
            10,
            false);
        clientConfiguration.setRetryPolicy(retryPolicy);

Use the default retry condition instead of never retrying in the discovery plugin causing hot retries upstream and add a test that verifies retrying works. Closes elastic#50462

Use the default retry condition instead of never retrying in the discovery plugin causing hot retries upstream and add a test that verifies retrying works. Closes #50462

Use the default retry condition instead of never retrying in the discovery plugin causing hot retries upstream and add a test that verifies retrying works. Closes elastic#50462

Use the default retry condition instead of never retrying in the discovery plugin causing hot retries upstream and add a test that verifies retrying works. Closes #50462

Use the default retry condition instead of never retrying in the discovery plugin causing hot retries upstream and add a test that verifies retrying works. Closes elastic#50462

original-brownbear added the :Distributed Coordination/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure label Dec 23, 2019

Bukhtawar changed the title ~~[discovery-ec2] Plugin doesn't retry EC2 describe calls~~ [discovery-ec2] Plugin doesn't retry EC2 describe instance calls Dec 23, 2019

original-brownbear added the >bug label Dec 24, 2019

original-brownbear self-assigned this Dec 24, 2019

original-brownbear mentioned this issue Jan 2, 2020

Make EC2 Discovery Plugin Retry Requests #50550

Merged

original-brownbear closed this as completed in #50550 Jan 2, 2020

original-brownbear mentioned this issue Jan 2, 2020

Make EC2 Discovery Plugin Retry Requests (#50550) #50558

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[discovery-ec2] Plugin doesn't retry EC2 describe instance calls #50462

[discovery-ec2] Plugin doesn't retry EC2 describe instance calls #50462

Bukhtawar commented Dec 23, 2019

elasticmachine commented Dec 23, 2019

Bukhtawar commented Dec 24, 2019 •

edited

Loading

original-brownbear commented Dec 24, 2019 •

edited

Loading

Bukhtawar commented Dec 25, 2019 •

edited

Loading

original-brownbear commented Dec 27, 2019

Bukhtawar commented Dec 27, 2019

[discovery-ec2] Plugin doesn't retry EC2 describe instance calls #50462

[discovery-ec2] Plugin doesn't retry EC2 describe instance calls #50462

Comments

Bukhtawar commented Dec 23, 2019

Problem Description

elasticmachine commented Dec 23, 2019

Bukhtawar commented Dec 24, 2019 • edited Loading

original-brownbear commented Dec 24, 2019 • edited Loading

Bukhtawar commented Dec 25, 2019 • edited Loading

original-brownbear commented Dec 27, 2019

Bukhtawar commented Dec 27, 2019

Bukhtawar commented Dec 24, 2019 •

edited

Loading

original-brownbear commented Dec 24, 2019 •

edited

Loading

Bukhtawar commented Dec 25, 2019 •

edited

Loading