Skip to content

Duplicate Results while searching in Opengrok #4180

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Leo-Adlakha opened this issue Feb 23, 2023 · 53 comments
Closed

Duplicate Results while searching in Opengrok #4180

Leo-Adlakha opened this issue Feb 23, 2023 · 53 comments

Comments

@Leo-Adlakha
Copy link

Describe the bug
When I search in Opengrok, I get duplicate results in the Search Window which is unexpected. This duplication seems to be related to the number of commits made. We run the indexer everyday twice, and we have observed that duplications happen only to those files in which new commits are made.

Opengrok - 1.7.42
JDK - Java 11
OS - RHEL7/RHEL8
Tomcat - 10.0.27
SCM - git

To Reproduce
Steps to reproduce the behavior:

No such reproducer we can come up with. But this happens after indexing. Before we had 2 copies of a single files, now we have 6 copies of the same file, where regular commit are made.

Expected behavior
Searches should not be duplicate

@Leo-Adlakha
Copy link
Author

One more thing to add here is that say I made a commit in a repo, say made a function test_read_xyz() in a file, for some instances I am not able to search in Opengrok. It wouldn't show any results, but this happens for some instances only. I also see OpenGrokLatestRev in the projects, so not sure why this behaviour is happening.

Some more information - Indexer is not run on localhost, indexing happens twice daily with history enabled.

I tried to reproduce this, but couldn't with a new instance. I'll try to look into the index files maybe using Luke. Any suggestions on what might be going wrong here ?

@vladak
Copy link
Member

vladak commented Feb 23, 2023

This might be a dup of #4030, specifically see #4030 (comment)

@piotr-g
Copy link

piotr-g commented Feb 27, 2023

I'm sure the last version which is not affected is 1.7.31 but this version is also mentioned in #4030

@dpsi
Copy link

dpsi commented Feb 28, 2023

I am still seeing duplicate results on 1.8.1, see image:
Screenshot 2023-02-27 at 9 46 37 PM

I know 1.6.4 doesn't have any issues with duplicate search results or history.

@vladak
Copy link
Member

vladak commented Feb 28, 2023

I know 1.6.4 doesn't have any issues with duplicate search results or history.

What was the sequence of reindex actions after you upgraded to 1.8.1 ?

@dpsi
Copy link

dpsi commented Feb 28, 2023

All of my testing was done without re-using the index. Brand new docker container and data volume. The mirror sync script periodically pulls new changes into the repo.

@vladak
Copy link
Member

vladak commented Feb 28, 2023

Do you have an idea how many times the sync/reindex cycle happened since switching to 1.8.1 ?

@dpsi
Copy link

dpsi commented Feb 28, 2023

I think the docker default is every 10 minutes? I set this up a couple hours after 1.8.1 came out and checked on it maybe 10 hours later, so that works out to about 60 re-indexes

@vladak
Copy link
Member

vladak commented Feb 28, 2023

Could you tell if there were any changes to the file within these 60 iterations ? Also, could you test after the 1st from-scratch reindex ?

@dpsi
Copy link

dpsi commented Mar 1, 2023

No, the files for those were not changed or even merged during the time I had 1.8.1 running.
There were several merge commits during this time period, but none involved the file in the screenshot.
I am trying to bisect the issue in my spare time. So far 1.6.5 is also clean and I am testing 1.7.0 now.

@piotr-g
Copy link

piotr-g commented Mar 1, 2023

@dpsi I already mentioned the last good version is 1.7.31 The bug was introduced by commits after that release.

@vladak
Copy link
Member

vladak commented Mar 1, 2023

No, the files for those were not changed or even merged during the time I had 1.8.1 running. There were several merge commits during this time period, but none involved the file in the screenshot. I am trying to bisect the issue in my spare time. So far 1.6.5 is also clean and I am testing 1.7.0 now.

You did completely wipe out the data root (i.e. the index) before upgrading to 1.8.x, right ?

@piotr-g
Copy link

piotr-g commented Mar 2, 2023

In my case I did wipe data and even sources. I couldn't find a pattern in these nondeterministic dups. E.g. I make a full search for a class name. It is found in 3 git repositories but duplicates occur only in 2 of them. Commits seem to be irrelevant as history ends in 2022 in one of the problematic repos. Number of duplicate search results increases +1 after every indexing run. Opengrok and tomcat run in the official docker image. And once again: the problem exists for months it is not introduced in 1.8+.

@vladak
Copy link
Member

vladak commented Mar 2, 2023

In my case I did wipe data and even sources. I couldn't find a pattern in these nondeterministic dups. E.g. I make a full search for a class name. It is found in 3 git repositories but duplicates occur only in 2 of them. Commits seem to be irrelevant as history ends in 2022 in one of the problematic repos. Number of duplicate search results increases +1 after every indexing run. Opengrok and tomcat run in the official docker image. And once again: the problem exists for months it is not introduced in 1.8+.

That's interesting. Just to confirm: for the problematic repo that has history till 2022, do the multiple search results happen immediately after single reindex from scratch with 1.8.x ?

@piotr-g
Copy link

piotr-g commented Mar 2, 2023

Yes. In fact it was still indexing when the first duplicate appears in the repo A. But... the second duplicate didn't appear in the repo B until I restart docker container. I'm not sure about some caching on browser side but it'is unlikely (I was clicking search button and ctrl+r). Repo C still contains only one valid entry. BTW my setup contains several projects but one project is much "fatter" - there are hundreds of git repos inside and repos A, B and C belong to this massive project. But the same configuration works fine with opengrok 1.7.31.

@vladak
Copy link
Member

vladak commented Mar 3, 2023

Could you make an experiment and place repo A into separate project, preferably inside distinct container with its own source+data root directories and see if the duplicate results appear ?

@piotr-g
Copy link

piotr-g commented Mar 3, 2023

I did as you had suggested: just one repo A in src, empty data before run. Unfortunately the same duplicate results. Project is small, file contains just 45 lines. When I find time I will make diff b/n 1.7.31 and the next release.

@vladak
Copy link
Member

vladak commented Mar 3, 2023

I did as you had suggested: just one repo A in src, empty data before run. Unfortunately the same duplicate results. Project is small, file contains just 45 lines. When I find time I will make diff b/n 1.7.31 and the next release.

That's actually good news w.r.t. reproducibility. I will try to come up with some changes in the upcoming releases that will hopefully provide some clues.

@vladak
Copy link
Member

vladak commented Mar 3, 2023

1.8.2 contains some changes to help debug this issue. Firstly, the log level has to be raised in the Tomcat logging.properties file to FINEST and the container restarted for this to be put into effect. Then, when a search is performed via the UI, the webapp will produce a sequence of log messages for each document found on given search results page. This will help to determine whether there are multiple live documents in the index sharing the same path.

@ChristopheBordieu
Copy link

ChristopheBordieu commented Mar 3, 2023

Hi @vladak

I have installed on my side 1.8.1 on my UAT environment. Not yet possible for me to provide feedback regarding the fix.
Will upgrade now to 1.8.2 to benefit from the debug logging.

Regarding --indexCheck option, I don't understand what enhancement it brings to indexing process.
Any doc is available for this argument?
What am I missing?

@vladak
Copy link
Member

vladak commented Mar 3, 2023

Hi @vladak

I have installed on my side 1.8.1 on my UAT environment. Not yet possible for me to provide fedback regarding the fix. Will upgrade now to 1.8.2 to benefit from the debug logging.

Regarding --indexCheck option, I don't understand what enhancement it brings to indexing process. Any doc is available for this argument? What am I missing?

The --indexCheck option merely performs some basic check of the index(es) and exits the indexer with respective error code. In this case, I'd like to use it for debugging.

If you run the indexer with --help it will display help for the commands. In this case it emits the following:

  --checkIndex [mode]
        Check index, exit with 0 on success,
        with 1 on failure.
        With no mode specified, performs basic index check.
        Selectable modes (performs given test on top of the basic check):
          documents - checks duplicate documents in the index

@ChristopheBordieu
Copy link

Ok... Will try it! Thanks!

@vladak
Copy link
Member

vladak commented Mar 3, 2023

For the record, here's how to bump the log level for the webapp running in Docker container:

  1. start the container
  2. copy the logging.properties from the container: docker cp <container_id>:/usr/local/tomcat/conf/logging.properties .
  3. modify the logging.properties file in the following way:
  • change the java.util.logging.ConsoleHandler.level = FINE to FINEST
  • add org.opengrok.level = FINEST line to the end of the file
  1. start the container again with the local file mounted on top of the old file. For docker run, use the -v argument. For docker-compose, relevant snippet of the YAML configuration would look like this:
    volumes:
       - '~/opengrok/logging.properties:/usr/local/tomcat/conf/logging.properties' # webapp debugging

Also, this should be done in a way to capture the initial reindex (i.e. the one with empty data root), so if the indexing has already started while copying the logging.properties file from the container in step 1, or after modifying it in step 3, the data root should be wiped out before starting the container again in step 4.

Here's the rationale: the problem with multiple live documents sharing the same path described in #4030 surfaced in reindex only after deleted documents were added to the index. That is, for the initial reindex where there should be only added files, this should not happen. So, I am curious what is going on in that case.

@ChristopheBordieu
Copy link

ChristopheBordieu commented Mar 3, 2023

Upgrade to 1.8.2 done. Tomcat logging level upgraded. I can see logging when searching.
Let's wait now...

@dpsi
Copy link

dpsi commented Mar 8, 2023

After a while, duplicates start to appear using 1.7.32. Same setup as my previous ones, clean index, data, etc. I ran 1.7.31 for a week and saw no issues.
image

The results that are duplicated have not been changed in the time that opengrok was setup and running.
My guess its something to do with branching and merges.

@yingnan
Copy link

yingnan commented Mar 8, 2023

I'm having the same issue with OpenGrok 1.7.42 and 1.8.1.

  • Index from the scratch shows no duplicates, but each reindex seems to add one.
  • There are 50+ projects in the index, the duplicate results are found in some of them but not every project. Previous comments mentioned it might have something to do with branches, commits, and merges, however I haven't noticed any obvious pattern yet.

Screenshot 2023-03-08 at 11 17 25 AM

@vladak
Copy link
Member

vladak commented Mar 9, 2023

I'm having the same issue with OpenGrok 1.7.42 and 1.8.1.

* Index from the scratch shows no duplicates, but each reindex seems to add one.

* There are 50+ projects in the index, the duplicate results are found in some of them but not every project. Previous comments mentioned it might have something to do with branches, commits, and merges, however I haven't noticed any obvious pattern yet.

Could you upgrade to 1.8.2 and gather some debug output ? Specifically:

  • run the indexer with the usual options plus --checkIndex documents and provide the output
  • raise log level in Tomcat (see Duplicate Results while searching in Opengrok #4180 (comment) for guidance on how to do this for Docker; for standalone install it will be much simpler - just edit the logging.properties file inside Tomcat configuration directory and restart Tomcat), perform the search that results in multiple documents with the same path and provide the relevant part of the Tomcat logs.

@vladak
Copy link
Member

vladak commented Mar 9, 2023

Also, assuming these are Git repositories for which this happens, I wonder what is the last changeset, i.e. the HEAD. Is this perhaps a merge changeset ? (one that has more than one parent)

@vladak
Copy link
Member

vladak commented Mar 9, 2023

After a while, duplicates start to appear using 1.7.32. Same setup as my previous ones, clean

That's interesting data point, thanks for that. In 1.7.32, history based reindex was introduced (PR #3951), so my guess is that you are hitting #4027.

@yingnan
Copy link

yingnan commented Mar 9, 2023

  • run the indexer with the usual options plus --checkIndex documents and provide the output

Upgraded to 1.8.2, but Indexer doesn't seem to do anything with either --checkIndex or --checkIndex documents options. No errors are thrown, indexer exits with 0. The output is:

Mar 09, 2023 8:02:14 PM org.opengrok.indexer.configuration.Configuration read
INFO: Reading configuration from /home/opengrok/etc/read-only.xml
Mar 09, 2023 8:02:15 PM org.opengrok.indexer.index.Indexer parseOptions
INFO: Indexer options: [-c, /usr/bin/ctags, -s, src, -d, index, -H, -P, -S, -G, -R, etc/read-only.xml, -W, etc/configuration.xml, -U, http://localhost:8080/, --checkIndex, documents]
  • raise log level..., perform the search that results in multiple documents with the same path and provide the relevant part of the Tomcat logs.

Search related logs are:

FINEST: directory hash contents for search hits (0,5):
Mar 09, 2023 9:12:34 PM org.opengrok.indexer.search.Results createMap
FINEST: 8,779: /redacted/src/test/java/redacted/SampleTest.java
Mar 09, 2023 9:12:34 PM org.opengrok.indexer.search.Results createMap
FINEST: 9,021: /redacted/src/test/java/redacted/SampleTest.java
Mar 09, 2023 9:12:34 PM org.opengrok.indexer.search.Results createMap
FINEST: 9,052: /redacted/src/test/java/redacted/SampleTest.java
Mar 09, 2023 9:12:34 PM org.opengrok.indexer.search.Results createMap
FINEST: 9,141: /redacted/src/test/java/redacted/SampleTest.java
Mar 09, 2023 9:12:34 PM org.opengrok.indexer.search.Results createMap
FINEST: 9,294: /redacted/src/test/java/redacted/SampleTest.java

@vladak
Copy link
Member

vladak commented Mar 10, 2023

Run the indexer with just -R configuration.xml --checkIndex documents where configuration.xml is path to the configuration created by the indexer previously with -W.

The web app logs basically says that there are multiple documents (the number is document ID which is unique in the index) sharing the same path. This should not happen and is clearly a bug. The question is which bug this is.

What does the output from git log -1 HEAD look like ?

@ChristopheBordieu
Copy link

Hi @vladak

I am trying --checkIndex.
What is the expected syntax for document?

@vladak
Copy link
Member

vladak commented Mar 10, 2023

Hi @vladak

I am trying --checkIndex. What is the expected syntax for document?

Just run indexer with -R configuration.xml --checkIndex documents where configuration.xml is the configuration written by the indexer, not read-only configuration.

@ChristopheBordieu
Copy link

Is this logging expected?

$ $JAVA_HOME/bin/java -Djava.util.logging.config.file=/srv/opengrok-var/log/logging.properties -jar $OPENGROK_JAR -R /srv/opengrok-var/etc/configuration.xml --checkIndex documents; echo $?
16:36:30 FINE: Installing default uncaught exception handler
16:36:30 INFO: Reading configuration from /srv/opengrok-var/etc/configuration.xml
16:36:31 INFO: Indexer options: [-R, /srv/opengrok-var/etc/configuration.xml, --checkIndex, documents]
16:36:31 INFO: History based reindex is on, however history is off. History has to be enabled for history based reindex.
16:36:32 FINE: Checking for Lucene index version mismatch in /srv/opengrok-var/index
16:36:32 FINER: Checking Lucene index version in project /srv/opengrok-src/documents
16:36:32 FINE: no index found in 'MMapDirectory@/srv/opengrok-var/index/srv/opengrok-src/documents lockFactory=org.apache.lucene.store.NativeFSLockFactory@6d4d66d2'
0

@yingnan
Copy link

yingnan commented Mar 10, 2023

Run the indexer with just -R configuration.xml --checkIndex documents where configuration.xml is path to the configuration created by the indexer previously with -W.

Thanks for the tip. The logs are:

Mar 10, 2023 7:30:53 PM org.opengrok.indexer.configuration.Configuration read
INFO: Reading configuration from /home/opengrok/etc/configuration.xml
Mar 10, 2023 7:30:53 PM org.opengrok.indexer.index.Indexer parseOptions
INFO: Indexer options: [-c, /usr/bin/ctags, -s, src, -d, index, -H, -P, -S, -G, -R, etc/configuration.xml, --checkIndex, documents]
Mar 10, 2023 7:30:54 PM org.opengrok.indexer.index.IndexCheck checkDirNoExceptions
INFO: Checking index in '/home/opengrok/index/index/redacted'
Mar 10, 2023 7:30:56 PM org.opengrok.indexer.index.IndexCheck checkDirNoExceptions
INFO: Index check for directory '/home/opengrok/index/index/redacted' passed

What does the output from git log -1 HEAD look like ?

It is indeed a git merge commit with more than one parent, as you mentioned earlier.

@vladak
Copy link
Member

vladak commented Mar 13, 2023

I find it strange that the index check has passed. Anyway:

What does the output from git log -1 HEAD look like ?

It is indeed a git merge commit with more than one parent, as you mentioned earlier.

If you run git log --stat -1 HEAD, what kind of files are displayed, if any ?

Could you raise the indexer log level to FINE and let the indexer run for bunch of iterations without changing the source root ? I'd be interested in the log contents of these indexer runs, in particular in the Add: and Remove file: log entries.

@vladak
Copy link
Member

vladak commented Mar 13, 2023

Is this logging expected?

$ $JAVA_HOME/bin/java -Djava.util.logging.config.file=/srv/opengrok-var/log/logging.properties -jar $OPENGROK_JAR -R /srv/opengrok-var/etc/configuration.xml --checkIndex documents; echo $?
16:36:30 FINE: Installing default uncaught exception handler
16:36:30 INFO: Reading configuration from /srv/opengrok-var/etc/configuration.xml
16:36:31 INFO: Indexer options: [-R, /srv/opengrok-var/etc/configuration.xml, --checkIndex, documents]
16:36:31 INFO: History based reindex is on, however history is off. History has to be enabled for history based reindex.
16:36:32 FINE: Checking for Lucene index version mismatch in /srv/opengrok-var/index
16:36:32 FINER: Checking Lucene index version in project /srv/opengrok-src/documents
16:36:32 FINE: no index found in 'MMapDirectory@/srv/opengrok-var/index/srv/opengrok-src/documents lockFactory=org.apache.lucene.store.NativeFSLockFactory@6d4d66d2'
0

Unless you are running version older than 1.8.2, that's not expected. It seems to me that the documents argument was lumped into project list.

@ChristopheBordieu
Copy link

Pfff... My bad @vladak :-( I ran this command on my OG PRD instance running 1.7.35... Without documents, it is ok.
I am just running now on my OG UAT instance running 1.8.2. I have plenty of indexes that fail check. Run still on going.

@ChristopheBordieu
Copy link

Got 207 out of 2146 index checks failed despite I had reindexed from scratch when installing 1.8.2.

@vladak
Copy link
Member

vladak commented Mar 13, 2023

Got 207 out of 2146 index checks failed despite I had reindexed from scratch when installing 1.8.2.

So, is there anything special about these 207 repositories ? What was the actual output of the index check ?

@ChristopheBordieu
Copy link

ChristopheBordieu commented Mar 13, 2023

I cannot share the data unfortunately.
I an focusing on a project with few discrepancies:
09:50:15 FINE: index in '/srv/opengrok-var/index/LIA' has document path set 47 vs document list 49 discrepancy
On this project, was able to find a search reporting 13 dups on 2 files.
The branch has 11 commits whose 3 are merged commits. Only 1 merged commit was done after the full reindex from scratch.
I do not see anything special on this code (I am not owner of).
What commands would you like me to run on this repo or these files reported as dups?

@ChristopheBordieu
Copy link

ChristopheBordieu commented Mar 13, 2023

Looking for 2 pages reported by the search, I get this on Tomcat log:

13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 70: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 85: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 100: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 118: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 156: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 174: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 194: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 232: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 261: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 279: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 305: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 30: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 53: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 68: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 83: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 99: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 117: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 154: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 172: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 193: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:52:41.718 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 230: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:52:41.719 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 259: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:52:41.719 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 278: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:52:41.719 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 303: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:52:41.719 FINEST [http-nio-8080-exec-5] org.opengrok.indexer.search.Results.createMap 25: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:52:41.736 FINE [http-nio-8080-exec-5] org.opengrok.indexer.search.context.OGKUnifiedHighlighter.getRepoFileContent Last-modified differs for: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.738 FINE [http-nio-8080-exec-5] org.opengrok.indexer.search.context.OGKUnifiedHighlighter.getRepoFileContent Last-modified differs for: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 10:52:41.755 FINE [http-nio-8080-exec-5] org.opengrok.indexer.search.context.OGKUnifiedHighlighter.getRepoFileContent Last-modified differs for: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:56:35.788 FINEST [http-nio-8080-exec-7] org.opengrok.indexer.search.Results.createMap 51: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 10:56:35.790 FINE [http-nio-8080-exec-7] org.opengrok.indexer.search.context.OGKUnifiedHighlighter.getRepoFileContent Last-modified differs for: /LIA/repoWithDups/workbench.yaml

@vladak
Copy link
Member

vladak commented Mar 13, 2023

I cannot share the data unfortunately. I an focusing on a project with few discrepancies: 09:50:15 FINE: index in '/srv/opengrok-var/index/LIA' has document path set 47 vs document list 49 discrepancy On this project, was able to find a search reporting 13 dups on 2 files. The branch has 11 commits whose 3 are merged commits. Only 1 merged commit was done after the full reindex from scratch. I do not see anything special on this code (I am not owner of). What commands would you like me to run on this repo or these files reported as dups?

That's actually excellent news for reproducibility and root causing. What is the situation for that repository w.r.t. index check right after reindex from scratch ?

@ChristopheBordieu
Copy link

No dup after index from scratch.
I ran these commands:

rm -rf index/LIA historycache/LIA xref/LIA/
opengrok-indexer -a $OPENGROK_JAR -- -c $CTAGS_BIN/ctags -H -U http://localhost:8080/ -R /srv/opengrok-var/etc/configuration.xml -P LIA

When searching, I got:

13-Mar-2023 13:23:12.728 FINEST [http-nio-8080-exec-1] org.opengrok.indexer.search.Results.createMap 24: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 13:23:12.728 FINEST [http-nio-8080-exec-1] org.opengrok.indexer.search.Results.createMap 23: /LIA/repoWithDups/workbench.yaml

@vladak
Copy link
Member

vladak commented Mar 13, 2023

No dup after index from scratch. I ran these commands:

Okay, that's important data point, I'd say. Currently I am inclined towards this being a bug when handling the merge changesets, aka dup of #4027.

Could you gather index logs for couple of subsequent reindexes (without changing the repository), with the log level being FINE (or higher) as above and present them in some redacted way ?

@vladak
Copy link
Member

vladak commented Mar 13, 2023

Also, @ChristopheBordieu I'd be interested in (redacted) output from git log --stat -2 HEAD for that repository and also the difference (if any) between the HEAD changeset ID and the changeset stored inside the historycache/OpenGroklatestRev file.

@ChristopheBordieu
Copy link

Could you gather index logs for couple of subsequent reindexes (without changing the repository), with the log level being FINE (or higher) as above and present them in some redacted way ?

Just to be sure.
I do not change the code.
I reindex then I search then I provide the Tomcat log.
I reindex then I search then I provide the Tomcat log.
...
This is the process you want me to follow, and the logs you want (you don't want the logs of index java job)?

@ChristopheBordieu
Copy link

I have just reindexed once and got... dup:

13-Mar-2023 15:54:05.481 FINEST [http-nio-8080-exec-1] org.opengrok.indexer.search.Results.createMap 13: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 15:54:05.482 FINEST [http-nio-8080-exec-1] org.opengrok.indexer.search.Results.createMap 59: /LIA/repoWithDups/Jenkinsfile
13-Mar-2023 15:54:05.482 FINEST [http-nio-8080-exec-1] org.opengrok.indexer.search.Results.createMap 12: /LIA/repoWithDups/workbench.yaml
13-Mar-2023 15:54:05.482 FINEST [http-nio-8080-exec-1] org.opengrok.indexer.search.Results.createMap 81: /LIA/repoWithDups/workbench.yaml

@ChristopheBordieu
Copy link

~/repoWithDups$ git log --stat -2 HEAD
commit 2abc5cf574ade45d837a92fa6ae35b417461565f
Merge: 7145c5f a08450a
Author: ------------- <------------------------->
Date:   Thu Mar 2 13:09:49 2023 +0000

    Pull request #-: ---------------
    
    Merge in LIA/repoWithDups from ~-------/repoWithDups:master to master
    
    * commit 'a08450a3be63d2b8aa0d3e5ead3deaacc2a6a5f3':
      ---------------

commit a08450a3be63d2b8aa0d3e5ead3deaacc2a6a5f3
Author: ------- <------------------------->
Date:   Thu Mar 2 12:56:31 2023 +0000

    ---------------

 Jenkinsfile    | 2 +-
 workbench.yaml | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)
~/repoWithDups$ 

@ChristopheBordieu
Copy link

~/repoWithDups$ more .git/refs/heads/master 
2abc5cf574ade45d837a92fa6ae35b417461565f
~/repoWithDups$ more /srv/opengrok-var/historycache/LIA/repoWithDups/OpenGroklatestRev 
a08450a3
~/repoWithDups$ 

@ChristopheBordieu
Copy link

ChristopheBordieu commented Mar 13, 2023

I have reindexed a second time (so 3 indexings since removal of data) and I have got... 2 dups!
FYI. For these tests, each time I (re)index, I restart Tomcat.

@vladak
Copy link
Member

vladak commented Mar 13, 2023

Thanks for the data. I can now also reproduce locally and I am pretty sure this is #4027 so closing as such.

@vladak
Copy link
Member

vladak commented Mar 14, 2023

For the record, the check done by --checkIndex documents does not detect this sort of situation, because it is using uid term index traversal and the duplicate documents have identical uids as they match the same set of files in a single changeset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants