-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add multiple caches for accelerating available container count calculation #667
add multiple caches for accelerating available container count calculation #667
Conversation
There is a bit missing for tracking the counts in ProcessingShard.removeStaleWorkContainersFromShard() - i think it would need to remove from retryQueue, check and decrement expiredRetryContainerCnt and the main availableWorkContainerCnt. |
Hmm the more i am getting my head round this the more i think we should solve it in two-fold approach. So what i am proposing is to keep the |
@rkolesnev Thanks a lot for your review. I think your solution also a good thought but still have to scan through the retryQueue. I totally understand your concern but if you could take a look at the above flow diagram and some explanations. Let me give you an example: for old flow:
for new flow:
Conclusion
|
@sangreal - i had tested the calculations using integration tests and they are matching my explanation below, i can provide the test if needed - i had existing test modified in place to have a large retry queue that is drained slowly and observed the The problem with updating the available work that is in retry queue and has its delay elapsed is that the
For example if we have 1000 items in the retry queue, all with delay of 5 seconds - once the delay elapsed - all 1000 are available for work - but we would only update the available work as we are taking them into processing and say with 16 processing threads - we would take only 16 and the rest will still be in retry queue but not counted as available work. It all goes back to the fact that the retry delay is time based - and the only way to really know which/how many containers in retry queue have their retry delay elapsed and are available as work - is to scan the retry queue. It does not have to be a full scan - as the retry queue is sorted by In general i don't think scanning retry queue will have big performance implications - with thinking that - if the retry queue is small - then the scan is fast and if the retry queue is large - than we are in bad state anyway and probably not that concerned about the overhead introduced by scanning it - as processing is already slowed down by having a lot of messages to retry. |
@rkolesnev Thanks for the detailed explanation. I got your points. |
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/WorkContainer.java
Outdated
Show resolved
Hide resolved
@rkolesnev I have updated the pr according to your suggestions, meanwhile I keep the |
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ProcessingShard.java
Outdated
Show resolved
Hide resolved
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/WorkContainer.java
Outdated
Show resolved
Hide resolved
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ShardManager.java
Outdated
Show resolved
Hide resolved
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ShardManager.java
Outdated
Show resolved
Hide resolved
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ShardManager.java
Outdated
Show resolved
Hide resolved
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ShardManager.java
Outdated
Show resolved
Hide resolved
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ProcessingShard.java
Show resolved
Hide resolved
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ShardManager.java
Show resolved
Hide resolved
@rkolesnev Thanks for your detailed review. I have fixed according to your comment, except for one. Please help review again |
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ShardManager.java
Outdated
Show resolved
Hide resolved
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ProcessingShard.java
Show resolved
Hide resolved
Regarding this, |
Sure - but we have to keep the logic uniform - we cannot exclude inflight work that was never retried, but include inflight work that is being retried - that would just give a weird number / behaviour that differs based on wether the messages were retried or not. Let me have another look at the code - i am thinking if |
Ok - yeah - that is already taken care of by decrementing |
Ok - so i am happy enough with it - thank you very much for going back and forth with the PR with me. |
Thanks for your time for the review! Let me change a bit on my previous pr's code related to stale containers handling and get back to you |
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ProcessingShard.java
Show resolved
Hide resolved
parallel-consumer-core/src/main/java/io/confluent/parallelconsumer/state/ShardManager.java
Show resolved
Hide resolved
@rkolesnev please help check the updates on the stale container removal for retryQueue, thanks a lot for your review again. |
Hi @sangreal - the PR is ready to be merged - can you please sign the Contributor License Agreement (CLA)? |
@rkolesnev I find it quite weird since I signed it last year already since I already contributed. And when I try to check, I could not signed since it showed I already signed. Please let me know if this is a blocker. If it is, I will revoke this one and try sign again. |
Description...
This pr is for reopening for previous approval pr : #644
Explanation
actually available count = all available container count - (still pending for running retry containers count)
prev flow
current flow
Possible Question
Q : Previously the expired retry is calculated on fly which looks more accurate than new flow?
A : Actually they will be eventually same since for
controlLoop
, it will wait for <= (latest retry time) and it will update the caches. Previous flow, the available work container count also only update after (latest retry time)Checklist