-
Notifications
You must be signed in to change notification settings - Fork 673
Bump to Lucene 5.5.2 #1168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump to Lucene 5.5.2 #1168
Conversation
I do notice that it pulls in new dependencies that were previously not there, and that don't seem to be used anywhere. Like |
@fzs possibly not, they are not needed. I believe they have been pulled in as transitive dependencies during the build. However, I'll try removing them and see how ti goes :-) |
18736f6
to
bad1802
Compare
@fzs OK, seems to work even without them. It was actually the build process that was modifying them, possibly because the new Lucene version changes the list of transitive dependencies. |
I have excluded the new dependencies in the |
@fzs thanks for the further contribution, that worked for me ! |
Hi Luca!
I don't think 4) is a good option as I expect it to have an impact on performance. Maybe James wants to chime in on what would be the usual way to address this in Gitblit. |
We discussed this issue in the Gerrit mailing list and actually we don't have index migration problems there. According to @dborowitz a Lucene library update should not introduce binary format compatibility issues. In Gerrit we have a reindexing tool that is mostly used when the data format changes inside the index, not the Lucene library version. With regards to the options you listed, in Gerrit we have:
Luca. |
But it does.
So more specifically it would be
Maybe 3) and 4) are the same, but I have no idea. But so far I foresee searches breaking when we ship with only the version update of the jars used so far. |
Got the point: so Dave was referring to the backward codecs. I believe we should ship with them anyway, because regardless of the option that we will chose, there will always be the need to read the old ones from the new code-base. My suggestion is always to do one step at a time :-) |
There are 2 types of Lucene indexes in Gitblit.
Repos are indexed by the LuceneService. This service understands versioning. As part of this PR we should advance the version number. Tickets are another beast. I forgot to add a version number when designing the indexing. :( Gitblit does have a tool for manually reindexing tickets which can be run offline or online. I think we could probably figure out an appropriate place to stick a ticket index version number and force an automatic reindex using this tool. |
By the way, since the lucene version constants are not used in the changed
source files anymore, they should be removed if they are also not used
anywhere else anymore.
|
Why not go with 6.x ? |
The purpose of the bump is to ease integration with Gerrit for Luca's Gerrit-Gitblit plugin. I assume Gerrit is using 5.5.x. |
oh i see. thanks for explaining |
I gave this a test, and it doesn't work. The latter is because there is a sort used on ticket fields and that won't work with the current index and the new Lucene anymore. Of course that can be worked around or fixed. Either by changing the index to created index fields with DocValues, or by wrapping the reader in one that will create them on-the-fly. I haven't implemented any of that so far. There is a general point of index versions in there somewhere. I'll come back to that again later, as right now I'm too tired for it. Because I spent hours fighting with Wicket, and losing. On the Lucene search page, there is a link to the Lucene Query Parser Syntax, which is by now outdated and probably should be kept updated when the Lucene version changes. But that link is in the internationalization files, so updating it would mean updating it in all files.
The page:
The code:
The properties:
@gitblit, any idea how to get Wicket to behave? |
@fzs That looks like a neat trick, if you can get it to work. I was unaware that was possible. Sorry that I don't have more sage advice. |
More thoughts. Update scenario: install the new version in parallel, run it to see if it works for me. If something is wrong, switch back to the old version. In most cases that shouldn't be a problem. But as for indexes, if the index is replaced with a new one, I now have won two reindexing sessions. Again, they might take some time. For smaller installations (I don't know what normal would be) a no-hands automatic reindexing is probably most convenient. For large installations, I believe an administrator would like more control over the process and keep the old index until it is no longer needed. In general I am talking about reindexing due to a change of version, either of the index structure or of the Lucene codec. Since the index version needs to be saved somewhere, how about introducing a new subdirectory under I addition to that a configuration setting can either have the index automatically updated and the old one deleted, or the new index was created offline (with a tool, before the switch) or online and an administrator needs to delete the old one manually. In general I believe indexing could use some improvement by running it in a different thread. (But that is a different discussion, not related to this PR). So, what are other opinions and how much should be done now? Is the update needed soon so that the minimal compatibility should be implemented now and the rest later? Or all at once? |
To add a number: indexing the master branch of Torvald's kernel repository took 2:20 on my machine. ⌛ |
Re-indexing repos is easy enough by changing the internal version - probably should be run on a different thread. I really like your idea of versioned index directories. Very simple and downgrade friendly. Re-indexing the tickets is the current pain point, but it shouldn't be that bad/expensive - certainly less than reindexing master of the Linux kernel, I would expect. That's an extreme example, I think @ 650K commits & 57K blobs. Is that 2m:20s or 2h:20m ? |
That was 2m 22s. So I guess we should switch to versioned index directories. Reindexing would still be automatic. In a second step a tool can be provided to generate indexes off-line, which can be used by administrators. An open question is then when and how the old index directory should be deleted. |
I have moved the link to the syntax page to the page code, out of the language files. I have tried it with default and |
@lucamilanesio Hi, could we fork gitblit to use this lucene version please? It's for the gitblit plugin which will be broken until we support this. |
d788d79
to
6875dbb
Compare
@lucamilanesio because the current gitblit plugin is broken on master due to gerrit core using lucene 5. See here https://gerrit-review.googlesource.com/#/c/96281/ where I am fixing building the plugin against master. Though the resources doint seem to be loading but they are included in the plugin. |
But yes looks like this could be merged as it includes lucene-backward-codecs |
Luca, the blocking issue is with tickets. Right now the update will break the tickets index, which will result in no tickets being shown at all any more, since the ticket list already is based on a search. |
@fzs I thought that @gitblit mentioned: In theory, Tickets can be reindexed upon upgrade and, actually, that could be a good thing as we may then be able to stick a versioning to it. |
ping ... |
Sorry, I currently have a hard time spending the amount of time I'd like to on Gitblit. |
@gitblit , here's a question for you regarding ticket index: |
Answering my own question. Now with a ticket index version introduced, after starting the server, no tickets are visible until they are reindexed. So the question is when and how that should happen. I guess it should happen automatically at startup. Check if there is no index and reindex tickets. |
Thanks for the update: I am planning to spend more time in the GitBlit / Gerrit integration. |
@gitblit, I opted for a), even though this changes the interface. I doubt that there are a lot of other custom implementations of a ticket service out there. It keeps the logic for reindexing local to the ticket service. Sorry, I forgot to squash the fix-ups. |
Exclude Lucene dependencies `lucene-spatial` and `lucene-join`. They were added during the update but are not needed. This patch excludes them explicitly so that they do not show up in the generated IDE files and `ext` directory.
To be able to read and migrate Lucene indices from old (4.x) formats to new (5.x) ones, add the `lucene-backward-codecs` library to the project. It is added to the `ext` directory and therefore to the classpath. According to the Lucene documentation, having it in the classpath can affect performance. But right now the `ext` directory is the only one available and even for a separate tool for offline migration the library would be needed.
… page. Update the link target to the query parser syntax page of the 5.5 version. Refactor the `LuceneSearchPage` to use an `ExternalLink` for the link to the lucene page, so that the link target is kept and updated in the Java code. Move the link out of the language files. This was way too cumbersome to update the link target (which is probably why no one ever did). The query help text is changed to contain a variable: `gb.queryHelp = here be some ${querySyntax} help`, which is replaced by Wicket with a link. The link text is a new lange file property: `gb.querySyntax`.
Also replace deprecated `search` method with the one without a filter argument, since the filter isn't used anyhow.
In order to support sorting, Lucene 5 needs DocValue fields in an index. So in order to make the ticket index work, i.e. show any tickets on the tickets page, the ticket index needs to be changed, adding a DocValues field. The DocValuesFields are implemented for the current index, which does not use multiple values for a field. Should at any time in the future an existing numeric field get multiple values stored in a document, then the index needs to know that and use SortedNumeric DocValues and SortFields instead.
In order to be able to update the index definition, the ticket index is assigned a version number, 2. This way the definiton can be updated and compatability with existing index files can be checked. The actual index is stored in a directory of name `indexVersion_codecVersion`. This wayit is veriy easy to check if an index of a certain version exists on the filesystem. It allows to have multiple indexes of different versions present, so that a downgrade of the software is possible without having to reindex again. Of coure, this is only possible if no new tickets were created since these would be missing in the old index. A new class `LuceneIndexStore` is introduced, which abstracts away the versioned index directory. The idea is, that this provides one place to keep the Lucene codec version and to allow to code compatibility rules into this class, so that older indices can still be used if they are compatible.
Change from the index version of a repository index being stored in a config file to also using index directories with the version in the name. For that, `LuceneRepoIndexStore` is added, which adds the fixed `lucene` part to the path. It also gives out the location of the `lucene.conf` file, which is now stored in the index directory. This way it is automatically deleted when the directory is deleted. I believe that it should also provide means to store branch aliases and tips, i.e. hide the config file completely. But this isn't implemented with this commit, the `LuceneService` is still aware that a config file is used.
This reverts commit 662fb90.
Check if tickets need to be reindexed when the server starts. This is the case if no ticket index exists. In that case the ticket index is built. This is done during the start of the `ITicketService`. For this the interface of `ITicketService` needed to change. The `start` method was defined abstract and the specific ticket services had to implement it. None does any real starting stuff in it. The `start` method is now final. It calls a new abstract method `onStart` which the specific ticket services need to implement. In the existing implementations I just changed `start` to `onStart`.
dcc59b8
to
63dbdfd
Compare
I rebased the branch on gitblit/master. I tested it some locally, and believe this will do the trick. |
I'm traveling for work this week and I won't have time.
|
Well, I don't know how else to test this, since I have no public, non-work related installation or test installation. Should probably get one. So I'll merge this, then, believing that it works. |
No description provided.