Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BulkIngester - Running listener code in separate thread pool #830

Merged
merged 6 commits into from
Jun 18, 2024

Conversation

l-trotta
Copy link
Contributor

@l-trotta l-trotta commented Jun 5, 2024

The current logic makes the current running thread execute whatever code is in the listener, so if that gets stuck somehow, or slows down, each ingester thread will end up stuck at some point. Fixing the issue by running the listener code in the same thread pool as the flusher thread.

@l-trotta l-trotta requested a review from swallez June 5, 2024 10:13
@l-trotta l-trotta force-pushed the bulk-ingester-listener-pool branch from 26d5542 to 1ea3164 Compare June 5, 2024 14:57
Copy link
Member

@swallez swallez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@l-trotta l-trotta merged commit 71d6401 into main Jun 18, 2024
7 checks passed
@l-trotta l-trotta deleted the bulk-ingester-listener-pool branch June 18, 2024 09:39
blackwinter added a commit to hbz/limetrans that referenced this pull request Aug 12, 2024
blackwinter added a commit to hbz/limetrans that referenced this pull request Aug 12, 2024
blackwinter added a commit to hbz/limetrans that referenced this pull request Aug 12, 2024
blackwinter added a commit to hbz/limetrans that referenced this pull request Aug 13, 2024
blackwinter added a commit to hbz/limetrans that referenced this pull request Aug 14, 2024
@marcreichman-pfi
Copy link

@swallez @l-trotta can someone help me to understand the impact of these changes, if listener threads were to do some non-trivial work and take time? would the overall processor slow down because of the tied-up flusher thread? we've written some code which has, to date, scaled with the number of listener threads to do work, which blocks new ingesting work in a predictable way. the current code worked equivalently to the HLRC bulk processor and the transportclient bulk processor before that. I'm concerned that this makes a fundamental shift in the (desired) blocking role these threads have held against new indexing when listeners are executing their methods.

blackwinter added a commit to blackwinter/limetrans that referenced this pull request Aug 16, 2024
@l-trotta
Copy link
Contributor Author

Hello @marcreichman-pfi, sorry for the wait, we were still in the process of fixing bulk ingester related bugs and we wanted the situation to be stable before answering.

As explained in the first comment, before these changes the Bulk Ingester used to run Listener tasks directly with the same threads that perform the add and flush operations, meaning that heavy Listener tasks would slow down the Bulk Ingester, or in extreme cases where an operation can get stuck waiting, it could cause every available Ingester thread to be stuck there, stopping its execution entirely.

The new approach has a thread pool execute Listener tasks, and this thread pool can be managed by the user (external) or be a simple one that will be provided by default (internal). Either way, it will be both used to schedule flush operations and to run Listener Tasks, which will be submitted to the thread pool right after an operation has been sent to Elasticsearch.

Initially we didn't consider the fact that Listener tasks could take a longer time and not be done in by the time the Bulk Ingester finishes its operations and gets close()d, PR #867 fixes this issue by introducing a synchronization mechanism that makes sure that every Listener task has been completed before allowing the Bulk Ingester to start the shutdown procedures.

I hope this clarifies the new changes, let me know if there's anything else.

@marcreichman-pfi
Copy link

@l-trotta Thanks for your response. I do understand the rationale for this and the subsequent change. My biggest concern is that, since the days of the original BulkProcessor in the transport client, and then the versions in the HLRC and in this client, the fact that there is blocking on bulk index (and delete) calls when all requests are going out has been helpful to establish a sliding window of sorts. My concern is that with these changes, the indexing calls will all be accepted causing a memory expansion situation.

I will have to experiment and see how things go. Which is the first shipping version of this code and is there documentation on the option you mentioned to provide a custom thread pool for the listener execution threads?

@l-trotta
Copy link
Contributor Author

@marcreichman-pfi I think if you have some blocking mechanism that slows down incoming requests according to the execution of Listener tasks it should still work the same. You can test this right now in version 8.15.0, but beware of the fact that when closing the Bulk Ingester some Listener tasks could still be enqueued and could be ignored, the fix for it should be out in the next patch.

The explanation for the custom thread pool is in the method signatures of the Bulk Ingester builder, to summarize:

  • You can use any thread pool that implements ScheduledExecutorService and pass it either to scheduler() or to flushInterval() to set up the scheduled flusher thread.
  • The default one is defined like so:
Executors.newScheduledThreadPool(maxRequests + 1, (r) -> {
                    Thread t = Executors.defaultThreadFactory().newThread(r);
                    t.setName("bulk-ingester-executor#" + ingesterId + "#" + t.getId());
                    t.setDaemon(true);
                    return t;
                });

@marcreichman-pfi
Copy link

@l-trotta I will take a look and test, thank you for your explanations and responses.

My instinct is that in the original model, any bulk request looking to go out would block if all threads were tied up, whether they were making their own bulk requests or waiting for the listener response. In the new model, regardless of which pool is involved, it might be that if listeners are doing non-trivial work, the listening pool (same as the flushing pool?) would grow a large queue of work, and bulk requests may continue to go out. Given that the listening/flushing is the same pool, maybe it'll all work out to the same equivalent pattern, but I'd just be hesitant that was once an outside-observable queue/blocking wait on calls to .index, .delete, etc. would now fly through those methods, and an inside, non-observable thread pool queue would build up inside the bulk ingester code. This would be concerning since it's harder to figure out what is going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants