Skip to content

Conversation

@andrew4699
Copy link
Contributor

Previously BasePolarisCatalog had very aggressive retries. It would do up to 12 retries and would retry unauthorized, not found, and other typically non-retryable errors.

This PR makes the retry behavior more sensible. I added a config for the max number of retries, with a default value of 12 for backward compatibility. This is an improvement over the hard-coded value of 12. I also restricted the retries to timeouts, rate limits, 500, 503, and unknown errors.

@andrew4699 andrew4699 requested a review from dimas-b February 22, 2025 01:18
Copy link
Contributor

@dimas-b dimas-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@andrew4699 andrew4699 force-pushed the aguterman-improve-basepolariscatalog-retries branch from 30b91e2 to d475c5a Compare February 24, 2025 17:52
Comment on lines 269 to 275
public static final PolarisConfiguration<Integer> MAX_METADATA_REFRESH_RETRIES =
PolarisConfiguration.<Integer>builder()
.key("MAX_METADATA_REFRESH_RETRIES")
.description(
"How many times to retry refreshing metadata when the previous error was retryable")
.defaultValue(12)
.build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the backward compatibility,
[doubt] These retries are on top of SDK retries ?
why are we did we go with a such high number ? default for hive catalog impl is 2 more details
: apache/iceberg#5510 (comment)

Copy link
Contributor

@flyrain flyrain Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These retries are on top of SDK retries

I believe so. This is done by Iceberg Task under the hood. Is it possible to make the default a reasonable number while changing the current deployment config to 12 for compatibility? The concern is that users normally don't change the default value until they hit issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

current deployment config

Curious what you mean by this. Do you mean setting this in one of the k8s/application properties files?

I'd be very much in favor of lowering the default as 12 is quite large and potentially dangerous. I'd be surprised if anyone is getting benefit from it being so high.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean setting this in one of the k8s/application properties files?

I was talking about the Snowfalke's Polaris deployment.

I'd be surprised if anyone is getting benefit from it being so high.

I believe not. The only compatibility concern is the current production deployment, including Snowflake, Zsaler, etc. Given that we don't have a release yet, the impact will manageable. In general, it should be fine to prefer a reasonable retry number(2 or 3) to compatibility. cc @MonkeyCanCode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated it to 2!

Comment on lines 269 to 275
public static final PolarisConfiguration<Integer> MAX_METADATA_REFRESH_RETRIES =
PolarisConfiguration.<Integer>builder()
.key("MAX_METADATA_REFRESH_RETRIES")
.description(
"How many times to retry refreshing metadata when the previous error was retryable")
.defaultValue(12)
.build();
Copy link
Contributor

@flyrain flyrain Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These retries are on top of SDK retries

I believe so. This is done by Iceberg Task under the hood. Is it possible to make the default a reasonable number while changing the current deployment config to 12 for compatibility? The concern is that users normally don't change the default value until they hit issues.

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @andrew4699 !

/** Signifies that we could not extract an HTTP code from a given cloud exception */
public static final int UNKNOWN_CLOUD_HTTP_CODE = -1;

@VisibleForTesting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we could remove this tag as it needs to be public for both non-test code and test code.

return true;
}
return false;
/** Helper to retrieve the max number of metadata refresh retries */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I guess this comment isn't necessary as the method name is descriptive enough.

Arguments.of(
new HttpResponseException("", new FakeAzureHttpResponse(code), ""),
true)))
.flatMap(Function.identity());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I believe there's a "concat" method for Streams.

@dimas-b dimas-b enabled auto-merge (squash) February 26, 2025 22:24
@dimas-b dimas-b merged commit a4c62b3 into apache:main Feb 26, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from Ready to merge to Done in Basic Kanban Board Feb 26, 2025
gh-yzou pushed a commit to gh-yzou/polaris that referenced this pull request Mar 6, 2025
* More sensible BasePolarisCatalog retry behavior

* Reuse AWS/GCP retryable property

* Update default to 2
travis-bowen pushed a commit to travis-bowen/polaris that referenced this pull request Jun 20, 2025
…1046)" (apache#32)

* More sensible BasePolarisCatalog retry behavior (apache#1046)

* More sensible BasePolarisCatalog retry behavior

* Reuse AWS/GCP retryable property

* Update default to 2

* Fix conflicts

---------

Co-authored-by: Andrew Guterman <andrew.guterman1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants