-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate the physical files when publishing a dataset #6558
Comments
@landreev agreed this is a good idea and can reuse the publishing in progress locks that file PIDs use. (it could also conceivably use the publish workflows that @michbarsinai added).File exists and checksum matches seem like the right checks. |
I moved this to ready. Since this is a case where there would need to be intervention from the installation's support contact, the red alert message to contact support should be presented. |
|
We ran into a problem of creating a RestAssured failure test - specifically, how to create an invalid datafile, with the checksum that doesn't match the content. Since RestAssured tests are run remotely, there's no direct access to the filesystem to mess with the file (delete/overwrite it, etc) after the datafile was created via API. |
…ve failed validation, that need to be fixed or removed before the dataset can be published. (#6558)
@kcondon - The scenario you brought up yesterday - when somebody is publishing a version with nothing but a single typo correction in the metadata (and no new files added) - it really does feel wrong to go through validating every file in the dataset, again. |
I haven't looked at the details, but (related to what you suggest) only do these checks if/when files change? So even if uploading a new file, maybe we still check them all, but we don't have to do checks if only metadata changes. Or similarly, do check for any major version. (this could still mean only metadata changes, as the user can select that, but a) may make the logic simple, and b) might be a good check to happen in those circumstances anyway. |
@landreev Well, your validation scenario is to confirm that the set of files was not modified in some way due to an errant write operation, such as an unsuccessful delete. I was focusing more on validation of new files so did not consider as deeply what you were trying to detect. I suppose that, however unlikely, any write operation might impact files so reverifying the set makes sense, contingent on performance. I just wonder whether there might ultimately be a ux impact, as there is now, when updating a dataset with a large number of files but where the current changes are minimal. Now, all file doi metadata entries are rewritten at DataCite, regardless of what has changed and that can take a while with lots of files. |
We never try to delete the physical file associated with the datafile, once it's published. At the same time, it's not entirely accurate to say that we NEVER touch the physical file once the datafile is published. We have the re-ingest and un-ingest APIs now, for example. Let me think about it some more. Maybe something along the lines of @scolapasta's suggestions - major versions only? - would be a good balance. |
…t's modified during the first stage of the async publishing. #6558.
This is a proposal/an idea to consider.
Analyzing the issues we've had with users' physical files over the years, most were caused by something that went wrong with the files still being in the Draft/unpublished state.
This is because we only attempt/allow to delete the physical file before the datafile gets published. And that's when something can go wrong - most commonly, an attempted delete or ingest would remove or corrupt the physical file, while preserving the datafile entry in the database. Then the user publishes the version, without realizing the physical file is missing.
This makes me think we should consider going through the files as they are being published, and validating/confirming that they are still present, that the checksums add up, etc.
Somewhat similarly to how we attempt to register global ids for datafiles on publish.
Doing so would prevent most, if not all possible situations when a physical file is missing from a published dataset.
The text was updated successfully, but these errors were encountered: