Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr: Try Soft Commit on Indexing #10547

Merged

Conversation

qqmyers
Copy link
Member

@qqmyers qqmyers commented May 8, 2024

What this PR does / why we need it:
This is a quick test to use softindexing/commit within functionality to see if it improves Solr performance. If it works, we might want COMMIT_WITHIN to be configurable

Note that the PR contains code updates to use commitWithin and avoid explicit hard commits, and a solrconfig.xml update to use autoSoftCommit and increate the autoHardCommit time. To test both, you have to install DV and swap the solrconfig.xml file in and restart Solr. Nominally they could be tried separately.

Which issue(s) this PR closes:

Closes #

Special notes for your reviewer:

Suggestions on how to test this: There are two parts to this PR - one is a change to turn on autoSoftCommit and increase the length of the hard commit time, both in solrConfig.xml and the other is changes in the code to use add/delete with a COMMIT_WITHIN parameter and to not have a separate commit() statement for indexing. These could be tested separately. Nominally the only affect should be on performance, so the main testing should be to verify that each part independently speeds things up. The solrConfig.xml change can be deployed to an existing machine and tested with a solr restart. The code changes have to be deployed via the war.

There is one ~unrelated test change in this PR - for the CacheFactoryBeanTest. While testing, I observed a failure where the testAuthenticatedUserGettingRateLimited() failed at line 169 when the final count was 122 instead of 120. I'm guessing this was due to some other test running in parallel and contributing to the count. I added an @ResourceLock(value = "cache") on the test to stop this. I haven't seen a repeat since but it is clearly rare overall, so I'm not sure how to test for QA.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@qqmyers qqmyers added the Size: 3 A percentage of a sprint. 2.1 hours. label May 8, 2024
@qqmyers
Copy link
Member Author

qqmyers commented May 8, 2024

One issue with this is tests - I see several failing - guessing that this might be because the changes add a small delay. That might be acceptable in practice, but might require adding delays in the tests. (Could also be a real problem/bug, not sure yet.)

@landreev
Copy link
Contributor

If I were to experiment with/benchmark these code changes, should I be using this branch, or solr-soft-commit2?

@qqmyers
Copy link
Member Author

qqmyers commented May 20, 2024

This branch has just the autoSoftCommit related changes. The other branch currently has that and changes to avoid deleting file docs unless needed. I'm pretty sure the changes here are all good. I haven't done much testing of the delete changes yet.

@qqmyers
Copy link
Member Author

qqmyers commented May 20, 2024

Though this branch doesn't have the adds to timing in the tests to make them work yet, so OK to benchmark, but not ready to merge.

@landreev
Copy link
Contributor

Could you please sync the branch with develop - thanks.

@coveralls
Copy link

coveralls commented May 20, 2024

Coverage Status

coverage: 20.585% (+0.005%) from 20.58%
when pulling 1c6fd45 on GlobalDataverseCommunityConsortium:solr-soft-commit
into e6b5856 on IQSS:develop.

@qqmyers qqmyers marked this pull request as ready for review May 22, 2024 16:44
@qqmyers qqmyers added this to the 6.3 milestone May 22, 2024
@qqmyers qqmyers removed their assignment May 22, 2024
@pdurbin
Copy link
Member

pdurbin commented May 22, 2024

@qqmyers is going to take a look at failing tests (thanks).

@jp-tosca
Copy link
Contributor

Hi @qqmyers 👋🏼!

I did some more testing and added a wait of 5 seconds before:

Response searchPublishedSubtreeSuper = UtilIT.search(searchPart, apiTokenSuper, "&subtree="+parentDataverseAlias);

worked for me, no matter how long a wait is before or after:

assertTrue(UtilIT.sleepForSearch(searchPart, apiToken, "&subtree="+parentDataverseAlias, UtilIT.MAXIMUM_INGEST_LOCK_DURATION), "Failed test if search exceeds max duration " + searchPart);

Produce the same error.

I can dig a bit more into what is happening in the test but adding a wait here worked for me but needs to be done before/with: searchPublishedSubtreeSuper

@jp-tosca
Copy link
Contributor

One change that would make this test pass would be just swapping these 2 lines 117-118 since the sleep is done on 118 but from what I got yesterday from the comments is that the super user shouldn't have to wait?

@qqmyers
Copy link
Member Author

qqmyers commented May 30, 2024

I think this is OK to go forward. For some reason, it appears that testDeepLinks test is significantly slower with this PR although the overall effect appears to be more than a factor of 2 in speed improvement when reindexing ~225 datasets/4400 files.. @jp-tosca reports some sensitivity to the order of two searches that exists in the current develop branch which might be part of the increase (since I've swapped the order to do the search that requires all indexing to complete before the one that would succeed without everything being complete.) (Has work been done to avoid repeat indexing of collections like @ErykKul did for datasets?).

In any case, tests are now passing (except for the confirmEmail one fixed in a separate PR) so I think review/QA can proceed/I'll remove myself as an assignee.

@qqmyers qqmyers removed their assignment May 30, 2024
@jp-tosca
Copy link
Contributor

jp-tosca commented Jun 4, 2024

👋🏼 @qqmyers @pdurbin @landreev

I was doing some reading on the soft/hard commit and particularly I was wondering if we need to do something additional, here is some of the documentation:

A common configuration might be to 'hard' auto-commit every 1-10 minutes and 'soft' auto-commit every second.
With this configuration, new documents will show up within about a second of being added, and if the power goes out, you will be certain to have a consistent index up to the last 'hard' commit.

@qqmyers
Copy link
Member Author

qqmyers commented Jun 4, 2024

line 293 in the solrconfig file should be doing a hard commit at a maximum of 30 seconds after a given addition.

@jp-tosca
Copy link
Contributor

jp-tosca commented Jun 4, 2024

But if something happens between now and those 30 seconds the document would be lost no? I was looking at this SO where they mention:

the commitWithin is a soft-commit by default. Soft-commits are very efficient in terms of making the added documents immediately searchable. But! They are not on the disk yet. That means the documents are being committed into RAM. In this setup you would use updateLog to be solr instance crash tolerant.

@qqmyers
Copy link
Member Author

qqmyers commented Jun 4, 2024

lines 266-269 in the solrconfig file turn on transaction logging which is what is required to not lose info during a crash (see comments starting at 253). At startup, the transaction log is played to update the indexes - basically doing a hard commit. Nominally, hard commits can be less frequent than 30 seconds, but the longer it is/the bigger the log, the more time required after a crash to restart.

@jp-tosca
Copy link
Contributor

jp-tosca commented Jun 4, 2024

Thanks! That makes sense. FWIW for whoever tests this, I killed my instance before the indexing and the item was indexed just as the instance went back up! 👍🏼

Copy link
Contributor

@jp-tosca jp-tosca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 I reviewed the changes and left some questions about the changes but everything seems to be fine. All tests are passing after the changes and the last update from develop.

@jp-tosca jp-tosca removed their assignment Jun 4, 2024
@pdurbin
Copy link
Member

pdurbin commented Jun 4, 2024

Something else to keep in mind is that the Solr data is just a copy of the data in PostgreSQL. So while it would be sad to lose some Solr transactions, we can always reindex, if necessary.

@stevenwinship stevenwinship self-assigned this Jun 6, 2024
@stevenwinship
Copy link
Contributor

stevenwinship commented Jun 6, 2024

I'm not sure if this is good or not. I'm running a test multiple times against this branch and 2 others, including develop.

Time from dataset create to seeing it in a search:
develop - 500ms
this branch - 1,160ms

@qqmyers
Copy link
Member Author

qqmyers commented Jun 10, 2024

The intent of the PR is not to decrease the latency of any given indexing operation - it makes changes to reduce the overall cost of indexing while keeping latency 'acceptable'. Right now, the autoSoftCommit in solrConfig.xml is set to 1 second, so the maximum latency should be ~1 second - consistent with what your test shows. If that's not acceptable, the autoSoftCommit time can be reduced. (Same for the autoHardCommit - it can be changed from 30 seconds if the time to restart (which causes the server to process the n seconds since the last hard commit) is not acceptable.)

The key test is probably the full reindex time on a significantly sized database. On a QDR test box with only thousands of files, the two parts of this PR made the full reindex more than twice as fast.

@stevenwinship stevenwinship merged commit 5bf6b6d into IQSS:develop Jun 10, 2024
15 checks passed
@stevenwinship stevenwinship removed their assignment Jun 10, 2024
@landreev
Copy link
Contributor

Happy to see this PR merged!
🎉

luddaniel pushed a commit to Recherche-Data-Gouv/dataverse that referenced this pull request Jun 12, 2024
* try soft commit

* keep softcommit short to avoid delays in visibility

* add test delay for autosoft, make hardcommit 30s like auto setting

* add 1-2 second delays in tests for softAutocomplete at 1s

* more sleeps

* more delays

* remove commented out deletes

* more commented out code to remove

* add 1 sec on failing tests

* add missing perm reindex

* change waiting

* fix index object and add null check for unit test

* remove test-specific null check

* reindex linking dv

* general solr release note

* more fixes

* revert change - was correct

* another sleepforsearch

* test adding explicit reindexing

* avoid other uses of cache in test that looks for exact counts

* Adding longer max sleep, add count param to sleep method

* Revert "add missing perm reindex"

This reverts commit 317038a.
@landreev
Copy link
Contributor

The key test is probably the full reindex time on a significantly sized database. On a QDR test box with only thousands of files, the two parts of this PR made the full reindex more than twice as fast.

Just want to put it on record that re-indexing a copy of the IQSS prod. database, the result is essentially the same: the speedup is just a bit under 2X.

@dsmiley
Copy link

dsmiley commented Aug 28, 2024

If you can differentiate a full "reindexing" (aka "zero day" scenario, aka bulk updates) from regular incremental updates, then you could eschew the auto soft commits in the solrconfig.xml and instead have only the incremental update requests add commitWithin since only those should be expected to be found within a second.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 3 A percentage of a sprint. 2.1 hours.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants