Skip to content

Bump to Lucene 5.5.2 #1168

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 18, 2017
Merged

Conversation

lucamilanesio
Copy link
Contributor

No description provided.

@flaix
Copy link
Member

flaix commented Dec 25, 2016

I do notice that it pulls in new dependencies that were previously not there, and that don't seem to be used anywhere. Like lucene-spatial and spatial4j. Is this really necessary?
Gitblit is large enough as it is, already. I think we should avoid adding external packages that aren't even needed. There should be a way to exclude these from the project definition.

@lucamilanesio
Copy link
Contributor Author

@fzs possibly not, they are not needed. I believe they have been pulled in as transitive dependencies during the build. However, I'll try removing them and see how ti goes :-)

@lucamilanesio
Copy link
Contributor Author

@fzs OK, seems to work even without them. It was actually the build process that was modifying them, possibly because the new Lucene version changes the list of transitive dependencies.

@flaix
Copy link
Member

flaix commented Jan 2, 2017

I have excluded the new dependencies in the build.moxie file so that the IDE files are kept consistent. In your commit the Eclipse file was different from the IntelliJ file, making it hard to see which version works now. Do you want to run a clean build and see if this still works for you?

@lucamilanesio
Copy link
Contributor Author

@fzs thanks for the further contribution, that worked for me !

@lucamilanesio
Copy link
Contributor Author

@fzs @gitblit are we good to go with this upgrade? Do you foresee any problem?

@flaix
Copy link
Member

flaix commented Jan 6, 2017

Hi Luca!
We still haven't addressed the issue of index migration, have we? After the update existing indexes will not work anymore. I see four ways to approach this:

  1. Document this and have the user recreate all indexes.
  2. Document this and provide a tool/script for the user to migrate the existing indexes to the new format.
  3. Detect an old index format at Gitblit start and automatically migrate it to the new format.
  4. Include backward compatibility in Gitblit so that old indexes can stay in the old format and still work.

I don't think 4) is a good option as I expect it to have an impact on performance. Maybe James wants to chime in on what would be the usual way to address this in Gitblit.
Personally I think 3) is best for users but the most effort to implement. 1) is the least effort on our side but I don't know what happens when someone has to recreate indexes for many and large repositories.

@lucamilanesio
Copy link
Contributor Author

We discussed this issue in the Gerrit mailing list and actually we don't have index migration problems there. According to @dborowitz a Lucene library update should not introduce binary format compatibility issues.

In Gerrit we have a reindexing tool that is mostly used when the data format changes inside the index, not the Lucene library version.

With regards to the options you listed, in Gerrit we have:

  1. We call this "off-line reindexing": still by far the best option for users. Nobody really complained, apart from the time required to execute the reindex.

  2. We don't do index data migration, never.

  3. We have an "online reindex" mechanism: if you haven't reindexed the changes using the new data format, we'll keep the old one and create in background the new one. Once the new index is ready, it will be swapped with the old one.

  4. It seems that Lucene should do the job for us, possibly is there something missing in Giblit?

Luca.

@flaix
Copy link
Member

flaix commented Jan 6, 2017

According to @dborowitz a Lucene library update should not introduce binary format compatibility issues.

But it does.
The Lucene migration guide says:

Lucene 5 will by default only read indexes created with Lucene 5. To read and upgrade Lucene 4.x indexes, you must add the lucene-backward-codecs.jar to the classpath. It is recommended to upgrade the old indexes with the IndexUpgrader tool, so you can remove the backward-codecs module from classpath. This will also improve performance.

So more specifically it would be

  1. Have the user reindex everything. Although I would consider this an on-line, not off-line.
  2. Ship with the IndexUpgrader tool and have the user run it off-line.
  3. Do some detection of an old index (fails to load) and run 2. automatically inside Gitblit.
  4. Simply include the backward codecs JAR (as Dave also mentioned) and hopefully Lucene will use it to read the old index format. Or does it then automatically upgrade them, too?

Maybe 3) and 4) are the same, but I have no idea. But so far I foresee searches breaking when we ship with only the version update of the jars used so far.

@lucamilanesio
Copy link
Contributor Author

Got the point: so Dave was referring to the backward codecs.

I believe we should ship with them anyway, because regardless of the option that we will chose, there will always be the need to read the old ones from the new code-base.

My suggestion is always to do one step at a time :-)

@gitblit
Copy link
Collaborator

gitblit commented Jan 6, 2017

There are 2 types of Lucene indexes in Gitblit.

  1. Per-repo index of commits/blobs
  2. Global index of tickets

Repos are indexed by the LuceneService. This service understands versioning. As part of this PR we should advance the version number.

Tickets are another beast. I forgot to add a version number when designing the indexing. :( Gitblit does have a tool for manually reindexing tickets which can be run offline or online. I think we could probably figure out an appropriate place to stick a ticket index version number and force an automatic reindex using this tool.

@flaix
Copy link
Member

flaix commented Jan 6, 2017 via email

@paladox
Copy link
Contributor

paladox commented Jan 9, 2017

Why not go with 6.x ?

@gitblit
Copy link
Collaborator

gitblit commented Jan 9, 2017

The purpose of the bump is to ease integration with Gerrit for Luca's Gerrit-Gitblit plugin. I assume Gerrit is using 5.5.x.

@paladox
Copy link
Contributor

paladox commented Jan 9, 2017

oh i see. thanks for explaining

@flaix
Copy link
Member

flaix commented Jan 14, 2017

I gave this a test, and it doesn't work.
More precisely: After updating the version I tried the search on the existing old index. As expected I got no results without the backwards codec JAR. With it, the search for commits worked again. But tickets could not be searched, and with how they are implemented, they don't show up at all anymore.

The latter is because there is a sort used on ticket fields and that won't work with the current index and the new Lucene anymore. Of course that can be worked around or fixed. Either by changing the index to created index fields with DocValues, or by wrapping the reader in one that will create them on-the-fly. I haven't implemented any of that so far. There is a general point of index versions in there somewhere. I'll come back to that again later, as right now I'm too tired for it.

Because I spent hours fighting with Wicket, and losing. On the Lucene search page, there is a link to the Lucene Query Parser Syntax, which is by now outdated and probably should be kept updated when the Lucene version changes. But that link is in the internationalization files, so updating it would mean updating it in all files.
There must be a better way, so I tried to move the link into the page, like so. Which failed, no matter what I did, with

2017-01-14 23:59:11 [ERROR] No get method defined for class: class com.gitblit.wicket.pages.LuceneSearchPage$1 expression: querySyntaxLink
org.apache.wicket.WicketRuntimeException: No get method defined for class: class com.gitblit.wicket.pages.LuceneSearchPage$1 expression: querySyntaxLink

The page:

<wicket:message key="gb.queryHelp">Standard query syntax is supported.
		<a target="_new" href="#" wicket:id="querySyntaxLink"><wicket:message key="gb.querySyntaxLabel"/></a>
</wicket:message>

The code:

ExternalLink syntaxLink = new ExternalLink("querySyntaxLink", "https://lucene.apache.org/core/5_5_2/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description");
add(syntaxLink);
syntaxLink.add(new Label("querySyntaxLabel", new ResourceModel("gb.querySyntaxLink")));

The properties:

gb.queryHelp = Standard query syntax is supported.<p/><p/>Please see ${querySyntaxLink} for details.
gb.querySyntaxLink = Lucene Query Parser Syntax

@gitblit, any idea how to get Wicket to behave?

@flaix flaix added this to the 1.9.0 milestone Jan 15, 2017
@gitblit
Copy link
Collaborator

gitblit commented Jan 16, 2017

@fzs That looks like a neat trick, if you can get it to work. I was unaware that was possible. Sorry that I don't have more sage advice.

@flaix
Copy link
Member

flaix commented Jan 16, 2017

More thoughts.
In the current case, reindexing is unavoidable, for tickets.
I have no experience with really large repositories but I could imagine that reindexing (in general, also commits) can take a long time and some users would rather avoid it. Or at least have more control over it.

Update scenario: install the new version in parallel, run it to see if it works for me. If something is wrong, switch back to the old version. In most cases that shouldn't be a problem. But as for indexes, if the index is replaced with a new one, I now have won two reindexing sessions. Again, they might take some time.

For smaller installations (I don't know what normal would be) a no-hands automatic reindexing is probably most convenient. For large installations, I believe an administrator would like more control over the process and keep the old index until it is no longer needed.

In general I am talking about reindexing due to a change of version, either of the index structure or of the Lucene codec.

Since the index version needs to be saved somewhere, how about introducing a new subdirectory under lucene, named after the version? Personally, I would include the codec version, too, as I imagine that running the compatibility ones can hamper performance. So I would have something like lucene/41_7/.... Now a new index would be placed in lucene/53_7/. Or for tickets in lucene/52_1/.
This way the old index matching the old version is still available.

I addition to that a configuration setting can either have the index automatically updated and the old one deleted, or the new index was created offline (with a tool, before the switch) or online and an administrator needs to delete the old one manually.

In general I believe indexing could use some improvement by running it in a different thread. (But that is a different discussion, not related to this PR).

So, what are other opinions and how much should be done now? Is the update needed soon so that the minimal compatibility should be implemented now and the rest later? Or all at once?

@flaix
Copy link
Member

flaix commented Jan 18, 2017

To add a number: indexing the master branch of Torvald's kernel repository took 2:20 on my machine. ⌛

@gitblit
Copy link
Collaborator

gitblit commented Jan 18, 2017

Re-indexing repos is easy enough by changing the internal version - probably should be run on a different thread. I really like your idea of versioned index directories. Very simple and downgrade friendly. Re-indexing the tickets is the current pain point, but it shouldn't be that bad/expensive - certainly less than reindexing master of the Linux kernel, I would expect. That's an extreme example, I think @ 650K commits & 57K blobs. Is that 2m:20s or 2h:20m ?

@flaix
Copy link
Member

flaix commented Jan 18, 2017

That was 2m 22s.
Not too shabby, really. I'm not sure about extreme. I believe that Gitblit is used in many professional settings. So many and large repositories are probably a real thing. For example development of an Android based device. Then it is not only the Kernel but Android, extra packages and on top the applications running on it.

So I guess we should switch to versioned index directories. Reindexing would still be automatic. In a second step a tool can be provided to generate indexes off-line, which can be used by administrators. An open question is then when and how the old index directory should be deleted.

@flaix
Copy link
Member

flaix commented Jan 21, 2017

I have moved the link to the syntax page to the page code, out of the language files. I have tried it with default and _de, but not the other languages. I didn't have a way to tell Chrome to send a different language. Starting it with --lang=fr changed the interface but the Gitblit page didn't see the "fr" locale.

@paladox
Copy link
Contributor

paladox commented Feb 9, 2017

@lucamilanesio Hi, could we fork gitblit to use this lucene version please? It's for the gitblit plugin which will be broken until we support this.

@lucamilanesio
Copy link
Contributor Author

@paladox why should we fork if we are on the right track to get this merged? @fzs @gitblit is there any blocking issue preventing this PR to go?

@paladox
Copy link
Contributor

paladox commented Feb 10, 2017

@lucamilanesio because the current gitblit plugin is broken on master due to gerrit core using lucene 5.

See here https://gerrit-review.googlesource.com/#/c/96281/ where I am fixing building the plugin against master.

Though the resources doint seem to be loading but they are included in the plugin.

@paladox
Copy link
Contributor

paladox commented Feb 10, 2017

But yes looks like this could be merged as it includes lucene-backward-codecs

@flaix
Copy link
Member

flaix commented Feb 13, 2017

Luca, the blocking issue is with tickets. Right now the update will break the tickets index, which will result in no tickets being shown at all any more, since the ticket list already is based on a search.
The second issue is to version the ticket index and move to a versioned directory model. That might not be a blocking issue for a merge, just for a new release.

@lucamilanesio
Copy link
Contributor Author

@fzs I thought that @gitblit mentioned:
"Tickets are another beast. I forgot to add a version number when designing the indexing. :( Gitblit does have a tool for manually reindexing tickets which can be run offline or online. I think we could probably figure out an appropriate place to stick a ticket index version number and force an automatic reindex using this tool."

In theory, Tickets can be reindexed upon upgrade and, actually, that could be a good thing as we may then be able to stick a versioning to it.

@lucamilanesio
Copy link
Contributor Author

ping ...

@flaix
Copy link
Member

flaix commented Feb 26, 2017

Sorry, I currently have a hard time spending the amount of time I'd like to on Gitblit.
The first issue with tickets has nothing to do with re-indexing. Even if you re-index the tickets, the new index will not work with the new Lucene version. It needs a change in the index definition, adding DocValues. I'll have a look at that now.
The second change would be to move to version directories. Which is not a blocking issue, if you are fine with basing the plugin on master, not a released version.

@flaix
Copy link
Member

flaix commented Mar 3, 2017

@gitblit , here's a question for you regarding ticket index:
When no ticket index is found during startup, should one be created? Either by checking if there is an older existing index, or by checking if there is any ticket in the ticket service? Right now, if the index version changes no tickets are visible anymore without a reindex. Which would seem odd to a user that knows that tickets do exist.

@flaix
Copy link
Member

flaix commented Mar 4, 2017

Answering my own question. Now with a ticket index version introduced, after starting the server, no tickets are visible until they are reindexed. So the question is when and how that should happen.

I guess it should happen automatically at startup. Check if there is no index and reindex tickets.
The question is the how. I guess it should be done in the startup phase, so that no other thread can interfere and that tickets are available immediately. Which might be uncomfortable if it takes a long time.
I see two possibilities.
a) Change the ITicketService.start method not to be abstract, but introduce a different one that derived services have to implement. Then use start to start the ticket service implementation and afterwards reindex if necessary.
b) Introduce ITicketService.reindexIfNecessary which is called explicitly from GitblitContext after starting the ITicketService.

@lucamilanesio
Copy link
Contributor Author

Thanks for the update: I am planning to spend more time in the GitBlit / Gerrit integration.

@flaix
Copy link
Member

flaix commented Mar 5, 2017

@gitblit, I opted for a), even though this changes the interface. I doubt that there are a lot of other custom implementations of a ticket service out there. It keeps the logic for reindexing local to the ticket service.

Sorry, I forgot to squash the fix-ups. ☹️ I'll do that when rebasing on master to update the new locale file.

lucamilanesio and others added 11 commits March 5, 2017 20:20
Exclude Lucene dependencies `lucene-spatial` and `lucene-join`.
They were added during the update but are not needed. This patch
excludes them explicitly so that they do not show up in the
generated IDE files and `ext` directory.
To be able to read and migrate Lucene indices from old (4.x)
formats to new (5.x) ones, add the `lucene-backward-codecs`
library to the project.
It is added to the `ext` directory and therefore to the classpath.
According to the Lucene documentation, having it in the classpath
can affect performance. But right now the `ext` directory is the
only one available and even for a separate tool for offline
migration the library would be needed.
… page.

Update the link target to the query parser syntax page of the 5.5 version.

Refactor the `LuceneSearchPage` to use an `ExternalLink` for the link
to the lucene page, so that the link target is kept and updated in the
Java code. Move the link out of the language files. This was way too
cumbersome to update the link target (which is probably why no one ever
did).

The query help text is changed to contain a variable:
`gb.queryHelp = here be some ${querySyntax} help`, which is replaced by
Wicket with a link.
The link text is a new lange file property: `gb.querySyntax`.
Also replace deprecated `search` method with the one without a filter
argument, since the filter isn't used anyhow.
In order to support sorting, Lucene 5 needs DocValue fields in an index.
So in order to make the ticket index work, i.e. show any tickets on the
tickets page, the ticket index needs to be changed, adding a DocValues
field.

The DocValuesFields are implemented for the current index, which does not
use multiple values for a field. Should at any time in the future an
existing numeric field get multiple values stored in a document, then
the index needs to know that and use SortedNumeric DocValues and SortFields
instead.
In order to be able to update the index definition, the ticket index
is assigned a version number, 2. This way the definiton can be updated
and compatability with existing index files can be checked.
The actual index is stored in a directory of name `indexVersion_codecVersion`.
This wayit is veriy easy to check if an index of a certain version exists on the
filesystem. It allows to have multiple indexes of different versions present,
so that a downgrade of the software is possible without having to reindex
again. Of coure, this is only possible if no new tickets were created since these
would be missing in the old index.

A new class `LuceneIndexStore` is introduced, which abstracts away the versioned
index directory. The idea is, that this provides one place to keep the Lucene
codec version and to allow to code compatibility rules into this class, so that
older indices can still be used if they are compatible.
Change from the index version of a repository index being stored in a config
file to also using index directories with the version in the name. For that,
`LuceneRepoIndexStore` is added, which adds the fixed `lucene` part to the path.
It also gives out the location of the `lucene.conf` file, which is now stored in
the index directory. This way it is automatically deleted when the directory is
deleted.

I believe that it should also provide means to store branch aliases and tips,
i.e. hide the config file completely. But this isn't implemented with this
commit, the `LuceneService` is still aware that a config file is used.
Check if tickets need to be reindexed when the server starts. This is the
case if no ticket index exists. In that case the ticket index is built.

This is done during the start of the `ITicketService`.

For this the interface of `ITicketService` needed to change. The `start`
method was defined abstract and the specific ticket services had to
implement it. None does any real starting stuff in it.
The `start` method is now final. It calls a new abstract method `onStart`
which the specific ticket services need to implement. In the existing
implementations I just changed `start` to `onStart`.
@flaix flaix force-pushed the bump-to-lucene-5.5.2 branch from dcc59b8 to 63dbdfd Compare March 5, 2017 19:30
@flaix flaix requested a review from gitblit March 5, 2017 19:58
@flaix
Copy link
Member

flaix commented Mar 5, 2017

I rebased the branch on gitblit/master. I tested it some locally, and believe this will do the trick.
@gitblit, if you have a minute, you may want to look over it if you see any problems. Do you have a larger installation on which you could try it out? I only tested it with five tickets or so, and two repos getting indexed.

@gitblit
Copy link
Collaborator

gitblit commented Mar 5, 2017 via email

@flaix
Copy link
Member

flaix commented Mar 18, 2017

Well, I don't know how else to test this, since I have no public, non-work related installation or test installation. Should probably get one. So I'll merge this, then, believing that it works.

@flaix flaix merged commit 2f74123 into gitblit-org:master Mar 18, 2017
@lucamilanesio
Copy link
Contributor Author

@fzs @gitblit thanks for your help, GitBlit is very much used with Gerrit, looking forward to the new updated plugin !

@lucamilanesio lucamilanesio deleted the bump-to-lucene-5.5.2 branch March 2, 2018 17:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants