Skip to content

consider reusing IndexSearcher objects #1132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
vladak opened this issue Jul 11, 2016 · 12 comments · Fixed by #1186
Closed

consider reusing IndexSearcher objects #1132

vladak opened this issue Jul 11, 2016 · 12 comments · Fixed by #1186
Milestone

Comments

@vladak
Copy link
Member

vladak commented Jul 11, 2016

In issue #1116 there was a suggestion to find out if the initial part of searchMultiDatabase() where FSDirectory.open is called and MultiReader is instantiated could be called just once / after each reindex. For webapp, the place for potential modification is SearchHelper#prepareExec().

If this is doable, this should speed up multi-project searches with high number of projects significantly (assuming file system cache is warmed up).

@vladak
Copy link
Member Author

vladak commented Jul 11, 2016

FWIW MMapDirectory.map() can perform multiple mmap() syscalls per index - it splits the mapping into chunks (either 2^30 bytes = 1GB in case of 64-bit JRE or 2^28 = 256 MB) so the cost is even higher.

@vladak
Copy link
Member Author

vladak commented Jul 11, 2016

The proof of concept code shaved thousands of syscalls to mere hundreds. I don't have the performance numbers handy but it feels definitely faster.

@vladak
Copy link
Member Author

vladak commented Jul 11, 2016

Note-to-self: the key for good testing coverage is to also perform couple of back-to-back negative searches as they trigger suggestions which also use IndexReaders (and close it in the end).

@tarzanek
Copy link
Contributor

nice so they can get reused eventually?

@vladak
Copy link
Member Author

vladak commented Jul 12, 2016

That's the whole point.

@tarzanek
Copy link
Contributor

lovely :)

@vladak
Copy link
Member Author

vladak commented Jul 12, 2016

Performed some performance measurements. The file system cache was warmed beforehands with vmtouch, took 10 measurements in each case by searching all 774 projects with keyword 'current' in quick succession (using browser) and recorded the times reported by the webapp.

Before (0.13rc1):
2108
2243
2119
2147
1790
1892
2124
2166
2168
2288

After:
3492
921
677
1067
1416
724
741
811
733
1122

Obviously, the first search 'after' had to mmap/open all index files so it took longer. After that the object reuse is visible.

In total, the average search times (2104 vs. 1170 miliseconds) make it even more clear that the performance win is substantial (i.e. almost 2 times faster).

@vladak
Copy link
Member Author

vladak commented Jul 12, 2016

The question is what happens with a search request during reindex if the index files change while the mappings are still established. This could be happening before this change, just not on the same scale.

@vladak
Copy link
Member Author

vladak commented Jul 12, 2016

This should be fine as each reindex creates new files (e.g. _1_Lucene54_0.dvd, _2_Lucene54_0.dvd, ...) writes data to them and unlinks the old files afterwards (plus there is a rename happening for the pending_segments_XX -> segments_YY file) and reading from mmap'ed segment of unlinked file (and its file descriptor closed) still works.

I wonder if this behavior is given by these 2 lines in IndexDatabase.java#update():

343            IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
344            iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);

and whether it is desired or not for this use case.

@vladak
Copy link
Member Author

vladak commented Jul 15, 2016

I asked on the Lucene java-users mailing list and Uwe Schindler responded:

You should keep the IndexReader open for the whole time! Otherwise there are more bottlenecks and slowdowns.

If you are updating the Index, you should use SearcherManager that reopens the index reader accordingly. After updating the index you should also not completely close and reopen the index. SearcherManager uses the DirectoryReader.reopen() method, which just updates the "view" currently seen and involves minimal syscalls (none at all if nothing changes).

My worry is what happens if indexer runs and writes to the index files
while they are mmap'ed in memory - could this lead to corrupted search ?

No, because Lucene never changes existing files. All stuff is done in new files which get visible after flushing/committing or reopening as described above. In addition merging of those immutable segments is done in the background while indexing, but all files currently referred by IndexReaders/IndexSearchers are still immutable and stay alive until the IndexReader is closed.

@vladak vladak added this to the 0.13 milestone Jul 15, 2016
@tarzanek
Copy link
Contributor

@vladak
Copy link
Member Author

vladak commented Jul 18, 2016

IndexReader has a method which should save the reopen/mmap on every search request:
https://lucene.apache.org/core/6_1_0/core/org/apache/lucene/index/DirectoryReader.html#openIfChanged-org.apache.lucene.index.DirectoryReader- however we would have to close the old IndexReader and make sure there is no search underway using it. This could be achieved with SearcherManager.

More on this from Uwe:

Have a separate searcher manager for every directory. On every incoming search request, fetch the actual DirectoryReaders from the searcher managers and build a MultiReader from it. This costs nothing, as MultiReader is just a thin wrapper where no caching is involved. On top of this MultiReader create an IndexSearcher (which is also cheap).

The actual search and caching is always executed on the segments collected from all indexes, the wrapping with MultiReaders and IndexSearchers is neglectible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants