-
Notifications
You must be signed in to change notification settings - Fork 779
Duplicate Results while searching in Opengrok #4180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
One more thing to add here is that say I made a commit in a repo, say made a function Some more information - Indexer is not run on localhost, indexing happens twice daily with history enabled. I tried to reproduce this, but couldn't with a new instance. I'll try to look into the index files maybe using Luke. Any suggestions on what might be going wrong here ? |
This might be a dup of #4030, specifically see #4030 (comment) |
I'm sure the last version which is not affected is 1.7.31 but this version is also mentioned in #4030 |
What was the sequence of reindex actions after you upgraded to 1.8.1 ? |
All of my testing was done without re-using the index. Brand new docker container and data volume. The mirror sync script periodically pulls new changes into the repo. |
Do you have an idea how many times the sync/reindex cycle happened since switching to 1.8.1 ? |
I think the docker default is every 10 minutes? I set this up a couple hours after 1.8.1 came out and checked on it maybe 10 hours later, so that works out to about 60 re-indexes |
Could you tell if there were any changes to the file within these 60 iterations ? Also, could you test after the 1st from-scratch reindex ? |
No, the files for those were not changed or even merged during the time I had 1.8.1 running. |
@dpsi I already mentioned the last good version is 1.7.31 The bug was introduced by commits after that release. |
You did completely wipe out the data root (i.e. the index) before upgrading to 1.8.x, right ? |
In my case I did wipe data and even sources. I couldn't find a pattern in these nondeterministic dups. E.g. I make a full search for a class name. It is found in 3 git repositories but duplicates occur only in 2 of them. Commits seem to be irrelevant as history ends in 2022 in one of the problematic repos. Number of duplicate search results increases +1 after every indexing run. Opengrok and tomcat run in the official docker image. And once again: the problem exists for months it is not introduced in 1.8+. |
That's interesting. Just to confirm: for the problematic repo that has history till 2022, do the multiple search results happen immediately after single reindex from scratch with 1.8.x ? |
Yes. In fact it was still indexing when the first duplicate appears in the repo A. But... the second duplicate didn't appear in the repo B until I restart docker container. I'm not sure about some caching on browser side but it'is unlikely (I was clicking search button and ctrl+r). Repo C still contains only one valid entry. BTW my setup contains several projects but one project is much "fatter" - there are hundreds of git repos inside and repos A, B and C belong to this massive project. But the same configuration works fine with opengrok 1.7.31. |
Could you make an experiment and place repo A into separate project, preferably inside distinct container with its own source+data root directories and see if the duplicate results appear ? |
I did as you had suggested: just one repo A in src, empty data before run. Unfortunately the same duplicate results. Project is small, file contains just 45 lines. When I find time I will make diff b/n 1.7.31 and the next release. |
That's actually good news w.r.t. reproducibility. I will try to come up with some changes in the upcoming releases that will hopefully provide some clues. |
1.8.2 contains some changes to help debug this issue. Firstly, the log level has to be raised in the Tomcat |
Hi @vladak I have installed on my side 1.8.1 on my UAT environment. Not yet possible for me to provide feedback regarding the fix. Regarding --indexCheck option, I don't understand what enhancement it brings to indexing process. |
The If you run the indexer with
|
Ok... Will try it! Thanks! |
For the record, here's how to bump the log level for the webapp running in Docker container:
Also, this should be done in a way to capture the initial reindex (i.e. the one with empty data root), so if the indexing has already started while copying the Here's the rationale: the problem with multiple live documents sharing the same path described in #4030 surfaced in reindex only after deleted documents were added to the index. That is, for the initial reindex where there should be only added files, this should not happen. So, I am curious what is going on in that case. |
Upgrade to 1.8.2 done. Tomcat logging level upgraded. I can see logging when searching. |
I'm having the same issue with OpenGrok 1.7.42 and 1.8.1.
|
Could you upgrade to 1.8.2 and gather some debug output ? Specifically:
|
Also, assuming these are Git repositories for which this happens, I wonder what is the last changeset, i.e. the |
Upgraded to 1.8.2, but Indexer doesn't seem to do anything with either
Search related logs are:
|
Run the indexer with just The web app logs basically says that there are multiple documents (the number is document ID which is unique in the index) sharing the same path. This should not happen and is clearly a bug. The question is which bug this is. What does the output from |
Hi @vladak I am trying --checkIndex. |
Just run indexer with |
Is this logging expected?
|
Thanks for the tip. The logs are:
It is indeed a git merge commit with more than one parent, as you mentioned earlier. |
I find it strange that the index check has passed. Anyway:
If you run Could you raise the indexer log level to |
Unless you are running version older than 1.8.2, that's not expected. It seems to me that the |
Pfff... My bad @vladak :-( I ran this command on my OG PRD instance running 1.7.35... Without documents, it is ok. |
Got 207 out of 2146 index checks failed despite I had reindexed from scratch when installing 1.8.2. |
So, is there anything special about these 207 repositories ? What was the actual output of the index check ? |
I cannot share the data unfortunately. |
Looking for 2 pages reported by the search, I get this on Tomcat log:
|
That's actually excellent news for reproducibility and root causing. What is the situation for that repository w.r.t. index check right after reindex from scratch ? |
No dup after index from scratch.
When searching, I got:
|
Okay, that's important data point, I'd say. Currently I am inclined towards this being a bug when handling the merge changesets, aka dup of #4027. Could you gather index logs for couple of subsequent reindexes (without changing the repository), with the log level being |
Also, @ChristopheBordieu I'd be interested in (redacted) output from |
Just to be sure. |
I have just reindexed once and got... dup:
|
|
|
I have reindexed a second time (so 3 indexings since removal of data) and I have got... 2 dups! |
Thanks for the data. I can now also reproduce locally and I am pretty sure this is #4027 so closing as such. |
For the record, the check done by |
Describe the bug
When I search in Opengrok, I get duplicate results in the Search Window which is unexpected. This duplication seems to be related to the number of commits made. We run the indexer everyday twice, and we have observed that duplications happen only to those files in which new commits are made.
Opengrok - 1.7.42
JDK - Java 11
OS - RHEL7/RHEL8
Tomcat - 10.0.27
SCM - git
To Reproduce
Steps to reproduce the behavior:
No such reproducer we can come up with. But this happens after indexing. Before we had 2 copies of a single files, now we have 6 copies of the same file, where regular commit are made.
Expected behavior
Searches should not be duplicate
The text was updated successfully, but these errors were encountered: