-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TDL/7493 Batch Archiving #8610
TDL/7493 Batch Archiving #8610
Conversation
TDL/7493-batch_archiving-only
FWIW: tagging this as 3a since it includes changes to the archive api (POST vs GET, namechange) that we should have before trying to sync with 3A services again. |
@scolapasta - 3A could/should have the changes to the API naming that is in the PR. I can pull that out elsewhere if we don't want to handle the new batch call now. |
*/ | ||
public List<DatasetVersion> getUnarchivedDatasetVersions(){ | ||
|
||
String queryString = "SELECT OBJECT(o) FROM DatasetVersion AS o WHERE o.releaseTime IS NOT NULL and o.archivalCopyLocation IS NULL"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this a named query?
// Note - the user is being set in the session so it becomes part of the | ||
// DataverseRequest and is sent to the back-end command where it is used to get | ||
// the API Token which is then used to retrieve files (e.g. via S3 direct | ||
// downloads) to create the Bag | ||
session.setUser(au); // TODO: Stop using session. Use createDataverseRequest instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we make this change while we're in this code?
// DataverseRequest and is sent to the back-end command where it is used to get | ||
// the API Token which is then used to retrieve files (e.g. via S3 direct | ||
// downloads) to create the Bag | ||
session.setUser(au); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, if we do decide to make the change
String className = settingsService.getValueForKey(SettingsServiceBean.Key.ArchiverClassName); | ||
AbstractSubmitToArchiveCommand cmd = ArchiverUtil.createSubmitToArchiveCommand(className, dvRequestService.getDataverseRequest(), dsl.get(0)); | ||
final DataverseRequest request = dvRequestService.getDataverseRequest(); | ||
if (cmd != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you explain what this line is doing? (i.e. why is a command being created before this and checked for null)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The create method is trying to read the property and instantiate the specified class (a subclass of AbstractSubmitToArchiveCommand) via reflection. If the class doesn't exist, it would fail/return null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha ot could actually return null, then. Maybe just q quick comment to make that clear, since it's different than when we normally create a command?
logger.info("Archiving complete: " + successes + " Successes, " + failures + " Failures. See prior log messages for details."); | ||
} | ||
}).start(); | ||
return ok("Archiving all unarchived published dataset versions using " + cmd.getClass().getCanonicalName() + ". Processing can take significant time for large datasets/ large numbers of dataset versions. View log and/or check archive for results."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh is it to havr the command here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really - I could return the className string as done in the null case in the else clause.
|
||
The archiveAllUnarchivedDatasetVersions call takes 3 optional configuration parameters. | ||
* listonly=true will cause the API to list dataset versions that would be archived but will not take any action. | ||
* limit=<n> will limit the number of dataset versions archived in one api call to <= <n>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get how this works, but what's the reason to limit this way? (counting both successes and failures)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
listonly=true gives you a list, so with limit working this way you can make sure that only the things you listed will get processed when you drop listonly=true.
Overall, the concern is about load, particularly if/when something is misconfigured and everything will fail after all the work to create a bag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these guaranteed to work with the list in the same order each time (i.e. if something added, it would be added at the end, so limit is guaranteed to get the things from the last listAll)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
~yes - it's the new named query that is getting the list so unless something affects the return order from that (which I think will go in id order by default), it wouldn't change.
TDL/7493-batch_archiving-only
|
||
The archiveAllUnarchivedDatasetVersions call takes 3 optional configuration parameters. | ||
* listonly=true will cause the API to list dataset versions that would be archived but will not take any action. | ||
* limit=<n> will limit the number of dataset versions archived in one api call to <= <n>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these guaranteed to work with the list in the same order each time (i.e. if something added, it would be added at the end, so limit is guaranteed to get the things from the last listAll)?
|
||
try { | ||
@SuppressWarnings("unchecked") | ||
List<DatasetVersion> dsl = em.createNamedQuery("DatasetVersion.findUnarchivedReleasedVersion").getResultList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Named Query should still be able to take a type, si you don't need the SuppressWarnings.
String className = settingsService.getValueForKey(SettingsServiceBean.Key.ArchiverClassName); | ||
AbstractSubmitToArchiveCommand cmd = ArchiverUtil.createSubmitToArchiveCommand(className, dvRequestService.getDataverseRequest(), dsl.get(0)); | ||
final DataverseRequest request = dvRequestService.getDataverseRequest(); | ||
if (cmd != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha ot could actually return null, then. Maybe just q quick comment to make that clear, since it's different than when we normally create a command?
// the API Token which is then used to retrieve files (e.g. via S3 direct | ||
// downloads) to create the Bag | ||
session.setUser(au); // TODO: Stop using session. Use createDataverseRequest instead. | ||
// Note - the user is being set in the session so it becomes part of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just noticed I think this comment can go, no?
try { | ||
AuthenticatedUser au = findAuthenticatedUserOrDie(); | ||
|
||
// Note - the user is being set in the session so it becomes part of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Not seeing json output, logger needs to be fine rather than info, seeing null ptr at end of log output. A system exception occurred during an invocation on EJB Admin, method: public javax.ws.rs.core.Response edu.harvard.iq.dataverse.api.Admin.archiveAllUnarchivedDatasetVersions(boolean,java.lang.Integer,boolean)]] Caused by: java.lang.NullPointerException |
What this PR does / why we need it: This PR implements an api call to allow archiving of all unarchived datasets/only the latest version of unarchived datasets that can be used in lieu of configuring a post-publish workflow, or used as a catchup mechanism to archive datasets that were published prior to an archiving workflow being configured. The api also has a view-only param that will list dataset versions that will be archived without doing it.
Which issue(s) this PR closes:
Closes #7493
Special notes for your reviewer: The code includes a ToDo to update the logging - once the changes to support Failure/Pending/Success status (#8696) is merged, this code should/will be updated to appropriately count Failure status as failed and Pending status as a success (the processing is async so pending and success are both successful launches w.r.t. this code.
Suggestions on how to test this: Setup your favorite archiver (e.g. the Local one), publisha few datasets /versions and then run this api to see if all/the latest versions cause bags to be generated and sent to your archive location. (Can also try the list-only option described in the docs to just list which ones will be done and then confirm that the same list is actually done.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
Is there a release notes update needed for this change?: planning one release note across PRs
Additional documentation: