Validate the physical files when publishing a dataset #6558

landreev · 2020-01-22T16:13:48Z

This is a proposal/an idea to consider.

Analyzing the issues we've had with users' physical files over the years, most were caused by something that went wrong with the files still being in the Draft/unpublished state.

This is because we only attempt/allow to delete the physical file before the datafile gets published. And that's when something can go wrong - most commonly, an attempted delete or ingest would remove or corrupt the physical file, while preserving the datafile entry in the database. Then the user publishes the version, without realizing the physical file is missing.

This makes me think we should consider going through the files as they are being published, and validating/confirming that they are still present, that the checksums add up, etc.
Somewhat similarly to how we attempt to register global ids for datafiles on publish.

Doing so would prevent most, if not all possible situations when a physical file is missing from a published dataset.

scolapasta · 2020-02-14T19:53:23Z

@landreev agreed this is a good idea and can reuse the publishing in progress locks that file PIDs use. (it could also conceivably use the publish workflows that @michbarsinai added).File exists and checksum matches seem like the right checks.

djbrooke · 2020-02-18T18:44:38Z

I moved this to ready. Since this is a case where there would need to be intervention from the installation's support contact, the red alert message to contact support should be presented.

djbrooke · 2020-02-26T19:35:32Z

Could take place before we initiate File PID registration (note this would introduce a new lock for those installations not using File PIDs)
Should we update the messaging on the "Locks in progress..."?
The person should receive a message to contact the installation support if this check fails
In future cases such as TRSAs, we may only be able to validate the files that Dataverse manages
We should get an idea of the performance hit here (we are OK with some hit for the sake of data integrity)

…lication (#6558)

landreev · 2020-04-01T14:39:43Z

We ran into a problem of creating a RestAssured failure test - specifically, how to create an invalid datafile, with the checksum that doesn't match the content. Since RestAssured tests are run remotely, there's no direct access to the filesystem to mess with the file (delete/overwrite it, etc) after the datafile was created via API.
I've proposed to address it in #6783, since this would be easy to achieve when S3 is used for storage.

…6558)

…e. (#6558)

…ve failed validation, that need to be fixed or removed before the dataset can be published. (#6558)

…in API (#6558)

…user guide. (#6558)

landreev · 2020-04-09T17:50:58Z

@kcondon - The scenario you brought up yesterday - when somebody is publishing a version with nothing but a single typo correction in the metadata (and no new files added) - it really does feel wrong to go through validating every file in the dataset, again.
Should I go ahead and change it to validating new (unpublished) files only?

scolapasta · 2020-04-09T17:55:55Z

I haven't looked at the details, but (related to what you suggest) only do these checks if/when files change? So even if uploading a new file, maybe we still check them all, but we don't have to do checks if only metadata changes. Or similarly, do check for any major version. (this could still mean only metadata changes, as the user can select that, but a) may make the logic simple, and b) might be a good check to happen in those circumstances anyway.

kcondon · 2020-04-09T17:56:04Z

@landreev Well, your validation scenario is to confirm that the set of files was not modified in some way due to an errant write operation, such as an unsuccessful delete. I was focusing more on validation of new files so did not consider as deeply what you were trying to detect. I suppose that, however unlikely, any write operation might impact files so reverifying the set makes sense, contingent on performance.

I just wonder whether there might ultimately be a ux impact, as there is now, when updating a dataset with a large number of files but where the current changes are minimal. Now, all file doi metadata entries are rewritten at DataCite, regardless of what has changed and that can take a while with lots of files.

landreev · 2020-04-09T18:02:58Z

... to confirm that the set of files was not modified in some way due to an errant write operation, such as an unsuccessful delete. ...

We never try to delete the physical file associated with the datafile, once it's published. At the same time, it's not entirely accurate to say that we NEVER touch the physical file once the datafile is published. We have the re-ingest and un-ingest APIs now, for example. Let me think about it some more. Maybe something along the lines of @scolapasta's suggestions - major versions only? - would be a good balance.

…ommand. #6558

…t's modified during the first stage of the async publishing. #6558.

)

djbrooke assigned landreev Feb 26, 2020

djbrooke added the Medium label Feb 26, 2020

djbrooke unassigned landreev Feb 27, 2020

sekmiller self-assigned this Mar 20, 2020

sekmiller removed their assignment Mar 20, 2020

djbrooke assigned landreev Mar 23, 2020

landreev added a commit that referenced this issue Mar 31, 2020

Physical file validation framework, refined; (#6558)

f149e12

landreev added a commit that referenced this issue Mar 31, 2020

one added TODO in the finalize command (#6558)

ad0960c

landreev added a commit that referenced this issue Apr 1, 2020

renamed DatasetLock.Reason.pidRegister DatasetLock.Reason.finalizePub…

9280665

…lication (#6558)

landreev added a commit that referenced this issue Apr 1, 2020

documentation for validate-files-on-publish. (#6558)

58ac83b

landreev mentioned this issue Apr 1, 2020

Add S3 tests to the regular integration test suite #6783

Closed

landreev added a commit that referenced this issue Apr 2, 2020

More changes/refinements, dedicated "validation failed" lock, etc. (#…

790a5e5

…6558)

landreev added a commit that referenced this issue Apr 2, 2020

refresh the changed error/lock message after a file validation failur…

fdfb767

…e. (#6558)

landreev added a commit that referenced this issue Apr 2, 2020

cleaned up lock refresh and messaging. (#6558)

a3df9c5

landreev added a commit that referenced this issue Apr 2, 2020

another lock refresh tweak. (#6558)

3d21e9d

landreev added a commit that referenced this issue Apr 2, 2020

final (?) info messaging mechanism for the dataset page. (#6558)

2a6411a

landreev added a commit that referenced this issue Apr 3, 2020

rewrote/rearranged the troubleshooting guide for invalid files. (#6558)

bc6e37f

landreev added a commit that referenced this issue Apr 3, 2020

extra lock checking in the publish command. (#6558)

31f8df0

landreev added a commit that referenced this issue Apr 3, 2020

a typo/dropped word in the release notes. (#6558)

b102daf

landreev mentioned this issue Apr 3, 2020

6558 validate files on publish #6790

Merged

landreev added a commit that referenced this issue Apr 7, 2020

Added /admin API call that an admin can use to find the files that ha…

c3fbad2

…ve failed validation, that need to be fixed or removed before the dataset can be published. (#6558)

landreev added a commit that referenced this issue Apr 7, 2020

Extra documentation entries for the validate files across dataset adm…

9ecb82c

…in API (#6558)

landreev added a commit that referenced this issue Apr 7, 2020

Another documentation entry, the "dataset management" section of the …

608228b

…user guide. (#6558)

landreev added a commit that referenced this issue Apr 13, 2020

rearranged the order of operations inside FinalizeDatasetPublicationC…

093ed6d

…ommand. #6558

landreev added a commit that referenced this issue Apr 14, 2020

one other rearrangment - changes where the dataset is merged, after i…

b055cad

…t's modified during the first stage of the async publishing. #6558.

landreev added a commit that referenced this issue Apr 14, 2020

Cleaned up messaging; removed some commented out code. #6558

402a624

landreev added a commit that referenced this issue Apr 14, 2020

added an entry for the file number limit for async. handling. #6558

bc389b3

landreev added a commit that referenced this issue Apr 14, 2020

as discussed, skipping file validation for minor version releases (#6558

a2079d2

)

kcondon closed this as completed in #6790 Apr 14, 2020

kcondon pushed a commit that referenced this issue Apr 14, 2020

A release note for #6558.

811ab09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate the physical files when publishing a dataset #6558

Validate the physical files when publishing a dataset #6558

landreev commented Jan 22, 2020

scolapasta commented Feb 14, 2020

djbrooke commented Feb 18, 2020 •

edited

Loading

djbrooke commented Feb 26, 2020 •

edited

Loading

landreev commented Apr 1, 2020

landreev commented Apr 9, 2020

scolapasta commented Apr 9, 2020

kcondon commented Apr 9, 2020 •

edited

Loading

landreev commented Apr 9, 2020

Validate the physical files when publishing a dataset #6558

Validate the physical files when publishing a dataset #6558

Comments

landreev commented Jan 22, 2020

scolapasta commented Feb 14, 2020

djbrooke commented Feb 18, 2020 • edited Loading

djbrooke commented Feb 26, 2020 • edited Loading

landreev commented Apr 1, 2020

landreev commented Apr 9, 2020

scolapasta commented Apr 9, 2020

kcondon commented Apr 9, 2020 • edited Loading

landreev commented Apr 9, 2020

djbrooke commented Feb 18, 2020 •

edited

Loading

djbrooke commented Feb 26, 2020 •

edited

Loading

kcondon commented Apr 9, 2020 •

edited

Loading