-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3437 OAI records for datasets that still exist, but are temporarily un-indexed should not be marked deleted #10222
Conversation
I want to have it in order to be able to create an api test for for a specific OAI set export case. But I figure it could be useful otherwise. #3437
This comment has been minimized.
This comment has been minimized.
I know why the Jenkins run failed. The test I just added destroyed a dataset that another test in that class relied on. |
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick unofficial review of docs.
The tests are passing now. |
This comment has been minimized.
This comment has been minimized.
My other OAI-PMH PR got merged yesterday; the testing class HarvestingServerIT.java was modified in that PR as well, but a different part of it - so there are no merge conflicts. I've synced the branch, Jenkins running now. |
This comment has been minimized.
This comment has been minimized.
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
What this PR does / why we need it:
A rare case of an issue with a very low number that's still relevant.
Our OAI sets are defined by search (solr) queries *) therefore when a set is re-exported and the query no longer finds a certain dataset, its OAI record is marked as "deleted" - to communicate to remote clients who may have harvested it that they need to purge it.
A side effect of this is that if an admin decides to clear their index and run a full reindex overnight - and it can take a while on a large database - if it coincides with a set re-export, many existing datasets can end up marked as "deleted".
As Kevin explains in the opening comment, like many OAI problems this one is not fatal, but transient: once the holdings are reindexed, the next set re-export will undo this and mark all such datasets as active again. However, a remote client may be forced to delete and re-harvest a large number of our datasets unnecessarily in the process. So we may as well fix this.
The PR fixes this by adding an extra database check - if the dataset that has disappeared from the search is still there, published and exported properly, the record will be kept in the set as active.
*) This is true for all sets, except for the main, default, "everything" set - which is defined by a database query. The code in the PR recognizes this case, and avoids unnecessary secondary database checks when the main set is re-exported.
Which issue(s) this PR closes:
Special notes for your reviewer:
Note that I have added a simple Index API endpoint -
-X DELETE .../api/admin/index/datasets/<id>
- for clearing an individual dataset from Solr. I needed it to help create a meaningful API test for this functionality; I could use the existing API for clearing Solr completely, but this felt cleaner, and I feel like it could be useful on its own.Suggestions on how to test this:
Create an OAI set in the harvesting server dashboard. It can be a set with a single published dataset in it (can be defined with a query like
dsPersistentId:B9CEYI
(for example).Export the set. Run
http://localhost:8080/oai?verb=ListIdentifiers&set=YOURSETNAME&metadataPrefix=oai_dc
, make sure the record is there.Clear your solr index. Re-export the set from the dashboard. Run the above again, confirm that the record is NOT marked as deleted.
As a control, destroy the dataset, and re-export the set again. This time, the record should be showing with
<header status="deleted">
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?:
Additional documentation: