-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify BatchingBolt implementation to just use tick tuples #125
Comments
Not necessarily a smooth transition. You can only set one tick value cluster-wide. I suppose if you always set it to e.g. "1", then the I do agree that part of the purpose of tick tuples is to avoid the need for multi-threaded "batch/flush" cycles. Perhaps we could just improve BatchingBolt such that you can provide |
I like that idea. I also think this would make it a lot easier for us to test |
I've been working on some other ideas regarding the AsyncBolt I showed Dan this morning, but I think for the interim using the ticks could work. I'm still not a huge fan of only having time-based batching, but eliminating the threading could definitely work. Have we verified that tick tuples keep coming through even when the topology is shutting down? The lack of tuples coming in is the current reason for threaded batching. |
Is this actually a problem? I mean, isn't the normal use case for Storm "Turn it on and never turn it off"? And if you're shutting down your topology, Storm will just kill your Bolts along with your Spouts, so how would you guarantee they were processed even with the threading? |
There's a topology shutdown period that's equal to your tuple timeout value where it waits for everything in-process to finish. It stops reading new tuples from the spout, so there's no new work coming in, but it still gives you an opportunity to finish anything in-flight. If you're waiting for more tuples to come in before checking the time and releasing the next batch, those tuples never show up and you never handle tuples which are waiting -- the machine just locks up. This doesn't sound too bad if you have good bookkeping of what's been acked, but if you're using Kafka with auto-ack, those tuples have now fallen into a black hole. |
Ah, that explains it then. When I was at ETS our system had a kind of crazy ack setup, so I never used auto-ack. |
It just occurred to me that now that we have a
process_tick
method (#124) , we could just use that forBatchingBolt
instead ofprocess_batch
(or just makeprocess_tick
callprocess_batch
if we want to not break the API). I think we could also do away with the threading complications entirely using that approach. @kbourgoin, what do you think?The text was updated successfully, but these errors were encountered: