-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove usage of IndexSearcher#Search(Query, Collector) from monitor package #13735
Conversation
OK, I think this is still a safe implementation but the change is that multiple collection threads will now be concurrently calling @romseygeek I think you originally authored most of this. Do you have time to have a look at this change? |
I am not familiar with this code but I believe you are correct. This should work despite we use a plain |
Also, thanks a lot for the help here, much appreciated! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @gsmiller
Concurrency in the monitor is handled in Parallel/PartitionMatcher outside of the index searcher, and the searcher is always built internally and not exposed to clients so there's no opportunity for an executor to be set on it.
Separately, I wonder if it's worth adding some sugar methods to CollectorManager that wrap up single-threaded collectors? I'm guessing that there are going to be many instances of the pattern "return a collector from newCollector and null from reduce", and it seems a bit of a retrograde step to replace one line of code with 13, especially if you don't actually use multithreading.
Thanks @romseygeek for providing more detail on the monitor implementation and confirming this is safe. As for your suggestion of providing a sugar method in CollectorManager to do the wrapping for situations like this, I'm a bit conflicted. I took a crack at updating the PR here with a revision that does this for discussion. On one hand, I agree with you that it's pretty verbose (and annoying if I'm honest) for users to have to do their own Collector wrapping like this, so a convenience method would be nice. On the other hand, I'm concerned it could be trappy for users that don't read the "fine print" and end up creating concurrency bugs by incorrectly wrapping in a case where an IndexSearcher is using an Executor (and their Collector isn't thread-safe). I think it's a balance of how often this single-threaded-collector pattern occurs and how potentially trappy this sugar method is. Right now, I'm leaning away from providing this sugar method. I want to avoid situations where separate user code is creating the CollectorManager and creating the IndexSearcher (which is easy to imagine). In that case, I worry that this could be done correct at first, only to have the IndexSearcher instantiation logic change to use an Executor, introducing tricky bugs. Thoughts? Or maybe there's a better way to do this that's still convenient but less trappy? |
There is a test that does the following:
This uses public API and allows to provide a searcher externally, doesn't it? Do we need to enforce that the searcher used in monitor should not have an executor and/or should not try to apply concurrency? |
* {@link IndexSearcher#IndexSearcher(IndexReader, Executor)}), or the provided collector is | ||
* threadsafe. | ||
*/ | ||
static <C extends Collector> CollectorManager<C, ?> wrap(C in) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not fantastic but I suspect we may lean on this solution for other leftover usages of Collector vs CollectorManager. Shall we clarify in the name that this is for sequential execution / single threaded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I'll revise the name (but if you have suggestions, I'm open to them; naming can be tricky)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming is the most difficult bit :) Some options: forSequentialExecution
, createSequentialManager
, singleThreadedManager
, or something along those lines.
@@ -37,7 +38,7 @@ abstract class CollectingMatcher<T extends QueryMatch> extends CandidateMatcher< | |||
@Override | |||
public void matchQuery(final String queryId, Query matchQuery, Map<String, String> metadata) | |||
throws IOException { | |||
searcher.search(matchQuery, new MatchCollector(queryId, scoreMode)); | |||
searcher.search(matchQuery, CollectorManager.wrap(new MatchCollector(queryId, scoreMode))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a comment about whether this is entirely safe, I was under the impression that one can provide an external searcher but I am not familiar with monitor code ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you could call it with an externally built searcher, but it wouldn't be very useful as the whole point is that it's a searcher over a batch of documents submitted to the monitor for checking. It's public to allow clients to build wrapping matchers.
There may be a nicer way to do this though, let me experiment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can make the public-facing methods all take DocumentBatch
here rather than IndexSearcher
, which will make building the index searcher an implementation detail. I'll put together a PR.
OK, I think the latest iteration of this is ready to merge, pending any remaining feedback, concerns, etc. I voiced some concerns earlier about the static wrapping method, wanting to make sure we don't provide some trappy implementation for users, but the suggestions around error checking/asserts convinced me this is reasonable (still don't love it, but I think it's reasonable right now). I'll get this merged in the next couple days if I don't hear any additional concerns, otherwise happy to iterate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ship it! :) thanks again for picking this up. I think we will reuse the method you added in a couple of other places.
Heads up: I saw there were merge conflicts caused by my recent push to main hence I went ahead and addressed those. |
Thanks @javanna, @romseygeek ! FYI @msfroh just merged and back ported. I saw you had another PR waiting on this. |
I've been changing my mind a few times on the topic, but I am concluding that it probably was not a good idea to expose the sequential collector manager. I think we are ok using it internally at our own risk, for situations where we know for sure that we can't possibly have a searcher with an executor set. But exposing it publicly makes it too easy for users to get unexpected failures, and the disclaimer "don't use this with a searcher that has an executor set" is kind of odd. What if those users do have other queries that make already use of concurrency, and use the same searcher for these others that get converted to leveraging the sequential collector manager? It is also trappy that they get errors only once there's more than one slice, so the behaviour may be hard to follow for users. I would propose that we make the new collector manager private with a bug disclaimer on when it should be used, and make a plan to remove it perhaps in the medium to long term. We should really design all new features with concurrency in mind, and migrate old ones to support concurrency. There is some urgency around this especially for the 9.12 release, I am happy to open a PR. |
We have recently (see apache#13735) introduced this utility method that creates a collector manager which only works when a searcher does not have an executor set, otherwise it throws exception once we attempt to create a new collector for more than one slice. While we discussed it should be safe to use in some specific scenarios like the monitor module, we should be careful exposing this utility publicly, because while we'd like to ease migration from the search(Query, Collector) method, we may end up making users like even worse, in that it exposes them to failures whenever an executor is set and there are more than one slice created, which is hard to follow and does not provide a good user experience. My proposal is that we use a similar collector manager locally, where safe and required, but we don't expose it to users. In most places, we should rather expose collector managers that do support search concurrency, rather than working around the lack of those.
We have recently (see #13735) introduced this utility method that creates a collector manager which only works when a searcher does not have an executor set, otherwise it throws exception once we attempt to create a new collector for more than one slice. While we discussed it should be safe to use in some specific scenarios like the monitor module, we should be careful exposing this utility publicly, because while we'd like to ease migration from the search(Query, Collector) method, we may end up making users like even worse, in that it exposes them to failures whenever an executor is set and there are more than one slice created, which is hard to follow and does not provide a good user experience. My proposal is that we use a similar collector manager locally, where safe and required, but we don't expose it to users. In most places, we should rather expose collector managers that do support search concurrency, rather than working around the lack of those.
We have recently (see #13735) introduced this utility method that creates a collector manager which only works when a searcher does not have an executor set, otherwise it throws exception once we attempt to create a new collector for more than one slice. While we discussed it should be safe to use in some specific scenarios like the monitor module, we should be careful exposing this utility publicly, because while we'd like to ease migration from the search(Query, Collector) method, we may end up making users like even worse, in that it exposes them to failures whenever an executor is set and there are more than one slice created, which is hard to follow and does not provide a good user experience. My proposal is that we use a similar collector manager locally, where safe and required, but we don't expose it to users. In most places, we should rather expose collector managers that do support search concurrency, rather than working around the lack of those.
Relates to: #12892