Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove usage of IndexSearcher#Search(Query, Collector) from monitor package #13735

Merged
merged 7 commits into from
Sep 11, 2024

Conversation

gsmiller
Copy link
Contributor

@gsmiller gsmiller commented Sep 6, 2024

Relates to: #12892

@gsmiller
Copy link
Contributor Author

gsmiller commented Sep 6, 2024

OK, I think this is still a safe implementation but the change is that multiple collection threads will now be concurrently calling CandidateMatcher#addMatch. I believe this is safe but need another set of eyes on it. The reason I think this works is that we should never be interleaving calls with the same doc value since these are global docIDs.

@romseygeek I think you originally authored most of this. Do you have time to have a look at this change?

@javanna
Copy link
Contributor

javanna commented Sep 6, 2024

I am not familiar with this code but I believe you are correct. This should work despite we use a plain ArrayList, because each thread should only get and compute its own docs. Testing would verify that, but I believe we never provide an executor to the searcher we use, neither in test nor in the prod code. Should we try and add that?

@javanna
Copy link
Contributor

javanna commented Sep 6, 2024

Also, thanks a lot for the help here, much appreciated!

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @gsmiller

Concurrency in the monitor is handled in Parallel/PartitionMatcher outside of the index searcher, and the searcher is always built internally and not exposed to clients so there's no opportunity for an executor to be set on it.

Separately, I wonder if it's worth adding some sugar methods to CollectorManager that wrap up single-threaded collectors? I'm guessing that there are going to be many instances of the pattern "return a collector from newCollector and null from reduce", and it seems a bit of a retrograde step to replace one line of code with 13, especially if you don't actually use multithreading.

@gsmiller
Copy link
Contributor Author

gsmiller commented Sep 9, 2024

Thanks @romseygeek for providing more detail on the monitor implementation and confirming this is safe.

As for your suggestion of providing a sugar method in CollectorManager to do the wrapping for situations like this, I'm a bit conflicted. I took a crack at updating the PR here with a revision that does this for discussion. On one hand, I agree with you that it's pretty verbose (and annoying if I'm honest) for users to have to do their own Collector wrapping like this, so a convenience method would be nice. On the other hand, I'm concerned it could be trappy for users that don't read the "fine print" and end up creating concurrency bugs by incorrectly wrapping in a case where an IndexSearcher is using an Executor (and their Collector isn't thread-safe).

I think it's a balance of how often this single-threaded-collector pattern occurs and how potentially trappy this sugar method is. Right now, I'm leaning away from providing this sugar method. I want to avoid situations where separate user code is creating the CollectorManager and creating the IndexSearcher (which is easy to imagine). In that case, I worry that this could be done correct at first, only to have the IndexSearcher instantiation logic change to use an Executor, introducing tricky bugs. Thoughts?

Or maybe there's a better way to do this that's still convenient but less trappy?

@javanna
Copy link
Contributor

javanna commented Sep 9, 2024

There is a test that does the following:

QueryMatch.SIMPLE_MATCHER.createMatcher(searcher);

This uses public API and allows to provide a searcher externally, doesn't it? Do we need to enforce that the searcher used in monitor should not have an executor and/or should not try to apply concurrency?

* {@link IndexSearcher#IndexSearcher(IndexReader, Executor)}), or the provided collector is
* threadsafe.
*/
static <C extends Collector> CollectorManager<C, ?> wrap(C in) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not fantastic but I suspect we may lean on this solution for other leftover usages of Collector vs CollectorManager. Shall we clarify in the name that this is for sequential execution / single threaded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll revise the name (but if you have suggestions, I'm open to them; naming can be tricky)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is the most difficult bit :) Some options: forSequentialExecution , createSequentialManager, singleThreadedManager , or something along those lines.

@@ -37,7 +38,7 @@ abstract class CollectingMatcher<T extends QueryMatch> extends CandidateMatcher<
@Override
public void matchQuery(final String queryId, Query matchQuery, Map<String, String> metadata)
throws IOException {
searcher.search(matchQuery, new MatchCollector(queryId, scoreMode));
searcher.search(matchQuery, CollectorManager.wrap(new MatchCollector(queryId, scoreMode)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment about whether this is entirely safe, I was under the impression that one can provide an external searcher but I am not familiar with monitor code ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you could call it with an externally built searcher, but it wouldn't be very useful as the whole point is that it's a searcher over a batch of documents submitted to the monitor for checking. It's public to allow clients to build wrapping matchers.

There may be a nicer way to do this though, let me experiment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make the public-facing methods all take DocumentBatch here rather than IndexSearcher, which will make building the index searcher an implementation detail. I'll put together a PR.

@gsmiller
Copy link
Contributor Author

OK, I think the latest iteration of this is ready to merge, pending any remaining feedback, concerns, etc. I voiced some concerns earlier about the static wrapping method, wanting to make sure we don't provide some trappy implementation for users, but the suggestions around error checking/asserts convinced me this is reasonable (still don't love it, but I think it's reasonable right now).

I'll get this merged in the next couple days if I don't hear any additional concerns, otherwise happy to iterate.

Copy link
Contributor

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ship it! :) thanks again for picking this up. I think we will reuse the method you added in a couple of other places.

@javanna
Copy link
Contributor

javanna commented Sep 10, 2024

Heads up: I saw there were merge conflicts caused by my recent push to main hence I went ahead and addressed those.

@gsmiller gsmiller merged commit c26c2d6 into apache:main Sep 11, 2024
3 checks passed
@gsmiller gsmiller deleted the GH/collector-dep/monitor branch September 11, 2024 13:26
@gsmiller
Copy link
Contributor Author

Thanks @javanna, @romseygeek ! FYI @msfroh just merged and back ported. I saw you had another PR waiting on this.

@gsmiller gsmiller added this to the 9.12.0 milestone Sep 11, 2024
@javanna
Copy link
Contributor

javanna commented Sep 14, 2024

I've been changing my mind a few times on the topic, but I am concluding that it probably was not a good idea to expose the sequential collector manager. I think we are ok using it internally at our own risk, for situations where we know for sure that we can't possibly have a searcher with an executor set. But exposing it publicly makes it too easy for users to get unexpected failures, and the disclaimer "don't use this with a searcher that has an executor set" is kind of odd. What if those users do have other queries that make already use of concurrency, and use the same searcher for these others that get converted to leveraging the sequential collector manager? It is also trappy that they get errors only once there's more than one slice, so the behaviour may be hard to follow for users.

I would propose that we make the new collector manager private with a bug disclaimer on when it should be used, and make a plan to remove it perhaps in the medium to long term. We should really design all new features with concurrency in mind, and migrate old ones to support concurrency.

There is some urgency around this especially for the 9.12 release, I am happy to open a PR.

javanna added a commit to javanna/lucene that referenced this pull request Sep 14, 2024
We have recently (see apache#13735) introduced this utility method that creates a
collector manager which only works when a searcher does not have an executor
set, otherwise it throws exception once we attempt to create a new collector
for more than one slice.

While we discussed it should be safe to use in some specific scenarios like the
monitor module, we should be careful exposing this utility publicly, because
while we'd like to ease migration from the search(Query, Collector) method, we
may end up making users like even worse, in that it exposes them to failures
whenever an executor is set and there are more than one slice created, which
is hard to follow and does not provide a good user experience.

My proposal is that we use a similar collector manager locally, where safe and
required, but we don't expose it to users. In most places, we should rather
expose collector managers that do support search concurrency, rather than working
around the lack of those.
javanna added a commit that referenced this pull request Sep 17, 2024
We have recently (see #13735) introduced this utility method that creates a
collector manager which only works when a searcher does not have an executor
set, otherwise it throws exception once we attempt to create a new collector
for more than one slice.

While we discussed it should be safe to use in some specific scenarios like the
monitor module, we should be careful exposing this utility publicly, because
while we'd like to ease migration from the search(Query, Collector) method, we
may end up making users like even worse, in that it exposes them to failures
whenever an executor is set and there are more than one slice created, which
is hard to follow and does not provide a good user experience.

My proposal is that we use a similar collector manager locally, where safe and
required, but we don't expose it to users. In most places, we should rather
expose collector managers that do support search concurrency, rather than working
around the lack of those.
javanna added a commit that referenced this pull request Sep 17, 2024
We have recently (see #13735) introduced this utility method that creates a
collector manager which only works when a searcher does not have an executor
set, otherwise it throws exception once we attempt to create a new collector
for more than one slice.

While we discussed it should be safe to use in some specific scenarios like the
monitor module, we should be careful exposing this utility publicly, because
while we'd like to ease migration from the search(Query, Collector) method, we
may end up making users like even worse, in that it exposes them to failures
whenever an executor is set and there are more than one slice created, which
is hard to follow and does not provide a good user experience.

My proposal is that we use a similar collector manager locally, where safe and
required, but we don't expose it to users. In most places, we should rather
expose collector managers that do support search concurrency, rather than working
around the lack of those.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants