Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-2116: Add blocking retries to RT #2124

Merged
merged 5 commits into from
Feb 24, 2022

Conversation

tomazfernandes
Copy link
Contributor

Before we hardcoded a no-ops back off in the DefaultErrorHandler used in the Retryable Topics feature.
Adds a setter to let the user provide their own back off policy and configure blocking retries in conjunction with RT.

I also thought of making this configuration possible on a per-topic basis by adding it to the builder and annotation, but I think this is enough for this release since we won't have much time to review. We can open an issue for that if that's the case
and I can provide a PR soon.

Copy link
Contributor

@garyrussell garyrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I will merge this for 2.8.3; need to think it through some more. As I understand the logic, it will always go through blocking retries first and then through non-blocking (assuming default config). I think we need the reverse logic - i.e. don't retry (blocking) for any exceptions unless explicitly set by the user; I think that will need changes in the DLPR.

@tomazfernandes
Copy link
Contributor Author

tomazfernandes commented Feb 21, 2022

I don't think I will merge this for 2.8.3; need to think it through some more. As I understand the logic, it will always go through blocking retries first and then through non-blocking (assuming default config). I think we need the reverse logic - i.e. don't retry (blocking) for any exceptions unless explicitly set by the user; I think that will need changes in the DLPR.

Yes, I thought about this too. I think we could solve this by allowing the ExceptionClassifier class to have retryOn (i.e. allowListable) if that's acceptable for the other subclasses, and setting the default classification behavior to "false" in the DEH in RT logic.

@tomazfernandes
Copy link
Contributor Author

We'd then have something like:

@Bean(name = RetryTopicInternalBeanNames.LISTENER_CONTAINER_FACTORY_CONFIGURER_NAME)
public ListenerContainerFactoryConfigurer lcfc(KafkaConsumerBackoffManager kafkaConsumerBackoffManager,
											   DeadLetterPublishingRecovererFactory deadLetterPublishingRecovererFactory,
											   @Qualifier(RetryTopicInternalBeanNames
													   .INTERNAL_BACKOFF_CLOCK_BEAN_NAME) Clock clock) {
	ListenerContainerFactoryConfigurer lcfc = new ListenerContainerFactoryConfigurer(kafkaConsumerBackoffManager, deadLetterPublishingRecovererFactory, clock);
	lcfc.setBlockingRetriesBackOff(new FixedBackOff(50, 3));
	lcfc.setErrorHandlerCustomizer(commonErrorHandler -> ((DefaultErrorHandler) commonErrorHandler)
			.addRetryableException(ShouldRetryViaBlockingException.class, ShouldRetryViaBothException.class));
	return lcfc;
}

@tomazfernandes
Copy link
Contributor Author

And just to be clear, with this PR as is the default behavior would stay the same, with the no ops BackOff policy. Only if the user provides their own BackOff policy the blocking retries would kick in.

@garyrussell
Copy link
Contributor

Right; I get that; I just have a problem with the default behavior if you just add a back off.

Thanks for all your work on these issues, by the way.

@tomazfernandes
Copy link
Contributor Author

Sure, no worries, I'm glad we were able to go through them in time for this release, thanks.

Let me know if there's anything else I can help with for this issue. We can close this PR and open another one when we have a clearer picture if you prefer.

If nothing more pressing comes up I should start working on improving the way RT's components are configured in general, if that's ok. So hopefully users will have a nice and clean API to use these features instead of all current non-intuitive verbosity. I'll open an issue when I have a draft so we can all take a look into it before actual implementation starts.

@garyrussell
Copy link
Contributor

I am ok with this PR, I think we just need a modification to ExceptionClassifier so it can be configured with a classifier that doesn't retry by default as you suggested #2124 (comment)

@garyrussell
Copy link
Contributor

Here's a proposal to support classifying exceptions as false by default...

garyrussell@a6bf67b

@garyrussell
Copy link
Contributor

Correction: garyrussell@63c5a2f

garyrussell added a commit to garyrussell/spring-kafka that referenced this pull request Feb 22, 2022
garyrussell added a commit to garyrussell/spring-kafka that referenced this pull request Feb 22, 2022
artembilan pushed a commit that referenced this pull request Feb 22, 2022
Option to not retry by default and retry specific.

* Polishing and Tests

In preparation for #2124
artembilan pushed a commit that referenced this pull request Feb 22, 2022
Option to not retry by default and retry specific.

* Polishing and Tests

In preparation for #2124
@tomazfernandes
Copy link
Contributor Author

With that improvement in the ExceptionClassifier I can resume working on this, thanks. Should have something later today for your appreciation.

Before we hardcoded a no-ops back off in the DefaultErrorHandler used in the Retryable Topics feature.
Adds a setter to let the user provide their own back off policy and configure blocking retries in conjunction with RT.
With this we no longer need a no ops back off.
Some minor adjustments were needed to maintain behavior when the logic gets to DLPR.
@tomazfernandes
Copy link
Contributor Author

Now that we have the DHE set to default false, we no longer need the No Ops back off for the retryable topics - the user can just add Retryable exceptions and use the DHE default one.

Since the logic for when the back off policy has no retries and the logic for when the classification returns false are a bit different, I had to make some minor adjustments to other classes so that the DLPR partition checking behavior stays the same.

As far as I could check this adjustment should have no impact besides enabling partition checking for users that use DLPR with exception classification that returns false, which I think should be a good thing for consistency and easy to disable if needed since there's a flag for that.

Of course, if that's a problem I can revert to using the NoOps back off.

Thanks

With this we no longer need a no ops back off.
Some minor adjustments were needed to maintain behavior when the logic gets to DLPR.
@tomazfernandes
Copy link
Contributor Author

I'll just add a small improvement to the API and the docs.

* Set a {@link BackOff} to be used with blocking retries.
* You can specify the exceptions to be retried using the method
* {@link org.springframework.kafka.listener.ExceptionClassifier#addRetryableExceptions(Class[])}
* By default, no exceptions are retried via blocking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add something like "If the backoff execution returns STOP (retries are exhausted), the record will then go to the next retry topic (if present)."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the aggregated behavior of blocking and non-blocking can be better documented in general. I'll see what I can come up with.

protected DefaultErrorHandler createDefaultErrorHandlerInstance(DeadLetterPublishingRecoverer deadLetterPublishingRecoverer) {
return this.providedBlockingBackOff != null
? new DefaultErrorHandler(deadLetterPublishingRecoverer, this.providedBlockingBackOff)
: new DefaultErrorHandler(deadLetterPublishingRecoverer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't right; the default fixed back off is (0, 9), we need (0, 0).

Why not just set the field to new FixedBackOff(0, 0);? We then don't need this test. Rename the field too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, considering all exceptions will return false when classified by default, it doesn't really matter what's the backoff policy - we can just leave the default one and then the user would have only to add the retryable exceptions, which would then make the default or provided back off kick in.

Anyway, we can set the no ops back off too, no worries. Either way I think the other classes modifications to pass the consumer to DLPR are still needed, since we might go that route for exception classifications that return false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh; Ok; makes sense to leave it.

@tomazfernandes
Copy link
Contributor Author

Improved the documentation and made a slight improvement to the API.

Now retryable exceptions can be set directly in the lcfc class.
Improved the docs on how to combine blocking and non-blocking behaviors.
Added what's new entry for this feature.
Copy link
Contributor

@garyrussell garyrussell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Thanks. Just one question.

@@ -185,7 +186,8 @@ public boolean removeNotRetryableException(Class<? extends Exception> exceptionT
* @see #addNotRetryableExceptions(Class...)
* @see #setClassifications(Map, boolean)
*/
public boolean removeClassification(Class<? extends Exception> exceptionType) {
@Nullable
public Boolean removeClassification(Class<? extends Exception> exceptionType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This change is because as is if we try to remove an exception that is not there we get a NPE for trying to convert null to a boolean. With this change it'll return null and the user can deal with that as they wish.

I for one thought it would return false for not finding the exception in the map, but turns out it returns the current value, or null if it's not found.

* <p>Returns the value to which this map previously associated the key, * or {@code null} if the map contained no mapping for the key.

It's not something extremely necessary, I can roll it back if you prefer. Or maybe it's worth adding some explanation to the javadocs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops; thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change the javadoc to @return the classification of the exception if the removal was successful; null otherwise.

I can do it during the merge if you like; but I have to run out for a short while.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that, no worries, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the javadoc and added two assertions to the LCFC new methods

Also add assertions to the LCFC new methods to warn the user if they already set the blocking configurations.
@garyrussell garyrussell merged commit 01549a6 into spring-projects:main Feb 24, 2022
garyrussell pushed a commit that referenced this pull request Feb 24, 2022
* GH-2116: Add blocking retries to RT

Before we hardcoded a no-ops back off in the DefaultErrorHandler used in the Retryable Topics feature.
Adds a setter to let the user provide their own back off policy and configure blocking retries in conjunction with RT.

* Change DHE in LCFC to defaultFalse

With this we no longer need a no ops back off.
Some minor adjustments were needed to maintain behavior when the logic gets to DLPR.

* Change DHE in LCFC to defaultFalse

With this we no longer need a no ops back off.
Some minor adjustments were needed to maintain behavior when the logic gets to DLPR.

* Improve API and docs

Now retryable exceptions can be set directly in the lcfc class.
Improved the docs on how to combine blocking and non-blocking behaviors.
Added what's new entry for this feature.

* Improve ExceptionClassifier JavaDoc

Also add assertions to the LCFC new methods to warn the user if they already set the blocking configurations.
@garyrussell
Copy link
Contributor

...and cherry-picked as a7621a1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants