-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solr: don't delete docs that will just change #10579
Solr: don't delete docs that will just change #10579
Conversation
This is only used in determining the most recent version a dataset is in on the file page, e.g. for https://demo.dataverse.org/file.xhtml ?persistentId=doi:10.70122/FK2/FO0MPQ/KNG6PA&version=3.0 I confirmed that demo shows version 1 in this example whereas it should show version 2 (which this commit fixes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
(One somewhat unexpected thing is that you didn't have to touch SolrIndexServiceBean at all in order to achieve this. Cool!)
An interesting performance result: As a definite positive, it really looks like a reindex of a large db over an existing, fully-populated index is no longer taking longer than the first, from-scratch indexing. (i.e., run a full reindex starting w/ an empty solr instance; then run a reindex again; compare the times at a few set increments - the numbers are virtually identical). I'll take this tradeoff any day. But does this actually make sense/should this be expected, for the first indexing - when there are no existing documents to delete - to take longer w/ the code in this PR? |
Also, could you please sync the branch w/ develop. (what I'm testing is my own checkout of the branch that I synced w/ develop locally) |
I wouldn't have expected much of a slow-down for an initial index, since as far as I can tell, the bulk reindex is done with doNormalSolrCleanup = false, which skips deleting docs entirely. Just in case, I added a commit to skip trying to remove file docs from the delete list if it is empty to start with. It's possible I've picked up some query that is expensive. There is a writeDebugInfo method that is called when FINE logging is enabled - that could be slow. The one thing I expect is more expensive would be the /status and /clear-orphan calls where more work is being done now to find/remove permission docs where there isn't a corresponding content doc. |
I guess there is a possibility that when I synced my local checkout with upstream develop something went sideways, and I ended up with the "soft commit" parts different from what's now in develop. |
What this PR does / why we need it: The existing code made a list of all files in all versions of a dataset (using a list so with repeats) and then added all of the file docs actually in solr, and tried to delete them all. And then tried to delete them again per card. The PR changes that to find which docs exist in solr (same query as before) and then looks to find any that are not in the one/two versions of the dataset that will be indexed and sends a delete request only for the orphaned docs.
It also adds one minor improvement - skipping the file indexing doc creation loop when indexableDataset.isFilesShouldBeIndexed() is false (deaccessioned datasets) instead of going through doc creation and then not submitting it.
Which issue(s) this PR closes:
Closes #
Special notes for your reviewer:
Suggestions on how to test this: FWIW: I've been testing with FINE logging to see what docs exist and which ones are deleted for various cases (creating, editing, publishing, deaccessioning while adding/deleting files). Nominally, just checking api/admin/index/status and verifying that there are no unexpected orphans at the end should be sufficient (there shouldn't be any orphan content docs. The PR improves detecting permission orphans and you may see datafile_id_draft_permission ophan docs being created (a file in a published version who's metadata isn't changed in a new draft). The PR isn't creating them, just making them visible. Using the feature flag to turn the new code on/off might be helpful to confirm that, e.g. make the changes and reindex using the old code and verify that with the flag the /status call sees the orphan.).
I'm not sure how to assess performance changes. Nominally doing a clear-orphans to get rid of the permissions doc orphans that weren't visible before could speed things up. Avoiding the extra deletes should help too, but it looks like the reindex all api call avoids doing the 'normalSolrCleanup', which is deleting the old docs, so the performance/load reduction may only be visible when reindexing is triggered by dataset edits, etc., versus it being visible as a reduction in a reindex all in place run.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?: included
Additional documentation: some thought over in #10469 (comment), mostly about possible additions now that old docs aren't automatically deleted.