Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify BatchingBolt implementation to just use tick tuples #125

Closed
dan-blanchard opened this issue Apr 14, 2015 · 6 comments · Fixed by #137
Closed

Simplify BatchingBolt implementation to just use tick tuples #125

dan-blanchard opened this issue Apr 14, 2015 · 6 comments · Fixed by #137
Assignees
Milestone

Comments

@dan-blanchard
Copy link
Member

It just occurred to me that now that we have a process_tick method (#124) , we could just use that for BatchingBolt instead of process_batch (or just make process_tick call process_batch if we want to not break the API). I think we could also do away with the threading complications entirely using that approach. @kbourgoin, what do you think?

@amontalenti
Copy link
Contributor

Not necessarily a smooth transition. You can only set one tick value cluster-wide. I suppose if you always set it to e.g. "1", then the BatchingBolt could do something like, "every N ticks, flush". But, if you wanted to have different batch rates for different bolts, things might get complicated.

I do agree that part of the purpose of tick tuples is to avoid the need for multi-threaded "batch/flush" cycles. Perhaps we could just improve BatchingBolt such that you can provide ticks_between_batches instead of secs_between_batches?

@dan-blanchard
Copy link
Member Author

Perhaps we could just improve BatchingBolt such that you can provide ticks_between_batches instead of secs_between_batches?

I like that idea. I also think this would make it a lot easier for us to test BatchingBolt, because currently the tests require sleeps to make sure enough time has gone by, which is prone to issues when the Travis machines are overloaded.

@kbourgoin
Copy link
Member

I've been working on some other ideas regarding the AsyncBolt I showed Dan this morning, but I think for the interim using the ticks could work. I'm still not a huge fan of only having time-based batching, but eliminating the threading could definitely work. Have we verified that tick tuples keep coming through even when the topology is shutting down? The lack of tuples coming in is the current reason for threaded batching.

@dan-blanchard
Copy link
Member Author

Have we verified that tick tuples keep coming through even when the topology is shutting down?

Is this actually a problem? I mean, isn't the normal use case for Storm "Turn it on and never turn it off"? And if you're shutting down your topology, Storm will just kill your Bolts along with your Spouts, so how would you guarantee they were processed even with the threading?

@kbourgoin
Copy link
Member

There's a topology shutdown period that's equal to your tuple timeout value where it waits for everything in-process to finish. It stops reading new tuples from the spout, so there's no new work coming in, but it still gives you an opportunity to finish anything in-flight. If you're waiting for more tuples to come in before checking the time and releasing the next batch, those tuples never show up and you never handle tuples which are waiting -- the machine just locks up.

This doesn't sound too bad if you have good bookkeping of what's been acked, but if you're using Kafka with auto-ack, those tuples have now fallen into a black hole.

@dan-blanchard
Copy link
Member Author

if you're using Kafka with auto-ack, those tuples have now fallen into a black hole.

Ah, that explains it then. When I was at ETS our system had a kind of crazy ack setup, so I never used auto-ack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants