Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Is maxUncommittedMessagesToHandle infinitely growable? #490

Closed
ang-zeyu opened this issue Dec 6, 2022 · 8 comments
Closed

Question: Is maxUncommittedMessagesToHandle infinitely growable? #490

ang-zeyu opened this issue Dec 6, 2022 · 8 comments
Labels
question Further information is requested

Comments

@ang-zeyu
Copy link

ang-zeyu commented Dec 6, 2022

Hi there! Thank you for this great library, we are exploring this as an option for a cross-team use case whereby we cannot increase the number of partitions to scale consumption easily.

One question regarding a parameter, is maxUncommittedMessagesToHandle infinitely growable in terms of memory usage and overflow (I assume something like u64::MAX)? (or anything else)

The scenario we are planning for is that:

  • We encounter a poison pill that is being infinitely retried
  • We don't have the resources to set up DLTs or retry topics
  • We would still like to continue processing incoming messages in a timely manner, until manual intervention is done to move the committed offset past the poison pill, and restart the consumer.

If relevant we are thinking of using unordered or "ordered by key" processing modes.

@astubbs
Copy link
Contributor

astubbs commented Dec 8, 2022

You're most welcome! I really enjoy working on it. And there are some big improvements about to be released too (keep an eye only fork if curious).

Where did you see maxUncommittedMessagesToHandle, or are you speaking hypothetically?
Ah! I just found it - oh dear. Yes, it's in the readme, but the configuration was removed a long time ago, when we added per record ack persistence, didn't think it was needed as it's original intent was to prevent big replay upon rebalance (as per record ack wasn't persisted). I'll remove it from the docs.

At the moment, there is no limit enforced (although there used to be). If you would like this, please file a feature request. It hasn't been something many have asked about, but I still think it's a good idea.

So people should still monitor their traditional consumer lag as normal.

You can also monitor the record context in your user function, to see if a record has failed too many times, and decide what to do about it. So although I think it should be implemented, there are ways to handle the situation as is. But I'm keen to hear your thoughts!

astubbs added a commit that referenced this issue Dec 8, 2022
The configuration was removed a long time ago, when we added per record ack persistence, didn't think it was needed as it's original intent was to prevent big replay upon rebalance (as per record ack wasn't persisted).

See #490
@astubbs
Copy link
Contributor

astubbs commented Dec 8, 2022

Removing in #503

@astubbs
Copy link
Contributor

astubbs commented Dec 8, 2022

We would still like to continue processing incoming messages in a timely manner, until manual intervention is done to move the committed offset past the poison pill, and restart the consumer.

FYI just in case - you can do this programatically, but just not throwing an error from your function, and the PC will assume it processed successfully.

Newer versions you will be able to throw a Skip exception, but the effect is the same, just more readable and better logs.

astubbs added a commit that referenced this issue Dec 8, 2022
…#503)

The configuration was removed a long time ago, when we added per record ack persistence, didn't think it was needed as it's original intent was to prevent big replay upon rebalance (as per record ack wasn't persisted).

See #490
@astubbs astubbs added the question Further information is requested label Dec 8, 2022
@ang-zeyu
Copy link
Author

Thank you for clarifying! Having no limit is actually just what we need.

Message loss is not desired in our case, so we're actually thinking of having infinite retries. (Let consumer lag alert kick in and notify us, and we fix the underlying problem, or at least whitelist that poison pill scenario specifically)

We'd still like to have the other incoming messages that are fine get processed normally, while poison pills continuously get retried with backoff. I think its in the README here too.

Wondering if there are limits to how far past the committed offset PC can still process messages:

x - poison pill
y - message that would be processed successfully

x y y y 
x y y y times 10000 - will the 10000th message be processed normally?
x y y y times 10000000 - are there any other concerns at this point? (e.g. memory usage)

Big replay upon restarting the consumer is ok too, it is more important for us not to lose any messages and be notified of the issue.

@ang-zeyu
Copy link
Author

Slightly unrelated but a little more about use case too:

Poison pills should be quite rare / non-existent. We currently have a RabbitMQ up and running with infinite retries and no DLQs for a couple years now, and am planning migration to Kafka. So DLT/retry topics are lower priority, but we'd still like to confirm what would happen in the event one occurs.

Our event volume is also fairly high (~< 10 million per day).

@astubbs
Copy link
Contributor

astubbs commented Dec 21, 2022

Wondering if there are limits to how far past the committed offset PC can still process messages:

It's effectively Integer.MAX_VALUE, and in next version Long.MAX_VALUE

x y y y times 10000 - will the 10000th message be processed normally?

yes :) - you can try this in a simple test.

x y y y times 10000000 - are there any other concerns at this point? (e.g. memory usage)

nope :) . Memory usage is proportional mainly to max concurrency target, and number of /failing/ messages.

Big replay upon restarting the consumer is ok too, it is more important for us not to lose any messages and be notified of the issue.

Should be no replay of ack'd records.

So DLT/retry topics are lower priority, but we'd still like to confirm what would happen in the event one occurs.

poison pills - your configured retry settings kick in (so infinite would be Integer.MAX_VALUE retires - however, we can make it /actually/ infinite if you'd like?).

And you can have any retry delay policy you wish.

Our event volume is also fairly high (~< 10 million per day).

awesome! looking forward to hearing about your journey some more! and your feature requests :)

@ang-zeyu
Copy link
Author

@astubbs Thank you for the detailed reply. 😄

It's effectively Integer.MAX_VALUE, and in next version Long.MAX_VALUE

so infinite would be Integer.MAX_VALUE retires - however, we can make it /actually/ infinite if you'd like?

That's great to know, and quite a bit of bandwidth for our use case, I think we would long have fixed the issue before ever reaching these limits.

My general thoughts on the second item are that it would be very rare in practice to hit this limit, with something like exponential retry backoff especially. I don't think it would make sense to have a quick retry policy with PC as well since consumption goes in parallel, resource is not wasted.

I'm also thinking the retry count in PC would simply overflow (cmiiw) which is not critical for any application that is implementing infinite retry (the retry count wouldn't be used anyway).

Should be no replay of ack'd records.

I hadn't noticed this section before. That is really amazing stuff 🤩

@ang-zeyu
Copy link
Author

I'll close this as that's all my questions for now. Thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants