-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset files cleanup #9132
Dataset files cleanup #9132
Conversation
@ErykKul - haven't yet looked at code, but thanks for this! I was just considering starting in on it. I would definitely like to see this expanded to preserve all the files that could be there, at least as an option (maybe the default) - thumbs, cached exports, aux files, external dataset thumbs, provenance file, original files (for ingested tabular files). For S3, something that would also remove partial multipart uploads would be useful but could/should probably be a separate PR. A list-only version that just identify extra files without deleting would be useful as well. I'm happy to contribute, test on S3, etc. to help you move this forward - just let me know when/where you want help. As a first step I'll try to look through your current code soon. |
Yes, thank you for this. Is it possible to have the endpoint default to a |
I have added the
|
It's about cleaning up files in S3 [edit by leonid: not just S3 - any file storage]. Files like those from direct upload. It's cleaning up cruft. You need to be sure that you are only deleting the cruft and not a file that is actually still required. This one needs some care in looking at. |
This may be related to work Oliver did on #8983 |
Thank you for this PR. |
This is the biggest potential problem I'm seeing so far. I don't think it is safe at all, to assume that the only useful dataset-level auxiliary files are the cached metadata export files. The export files are at least mostly safe to erase (they will be regenerated automatically when needed). But the Not sure what the best/cleanest way to deal with this is. It's possible that the best we can do is to only delete the files with the filenames that match Anyone? - Does the regex above really match all possible (reasonably modern) datafile-level physical files? (Does this work with RemoteOverlay?) OK, we'll figure this out. (Sorry to be difficult :) |
I agree that just deleting datafiles not in the database ( There are definitely more files that just export_* (which as you say, will be regenerated) at the dataset level that need to remain - the json provenance file is one of them - that is user input and can't be recreated. Right now, creating a whitelist would be more trouble than a blacklist approach. If/when the dataset-level aux files are organized like the file level ones (all with the same naming pattern), a whitelist might be easier. Not sure what that means for this PR. Having the ability to clean failed uploads is probably 90% of the problem and arguably safer, so it would be nice if that were an/the option. If there's interest in trying to assemble a whitelist of all additional files that can exist, I'd be happy to help trying to track down the sources, but I think it would be OK to make that future work as well. One other thing to be aware of - it is probably an infrequent case but files in a dataset may not all be in the same store. One can change the store used by the dataset to save new files after some files have been uploaded. I think the auxiliary files for the dataset are always in the store for the dataset but the datafiles and their auxfiles could be spread out. |
I think we should just go ahead and adopt the regex approach. In practice, I can't think of anything else that could be wasting any appreciable amount of space, other than leftover datafiles, from failed uploads or deletes. I would NOT be super comfortable with the idea of a white list; can't think of a truly sure way to keep it up-to-date.
Yeah, we messed up this part, the storage names for the dataset-level files. It's because of this also that StorageIO can listAuxObjects for a file; but not for a dataset. We save a file-level aux file with the tag "xyz" as "${basename}.xyz", but on the dataset level, we save it as just "xyz". Either some reserved dataset-level basename, or simply ".xyz" would be a way better scheme. (obviously, this is way outside this PR) |
(sorry, that regex I first posted was specifically for aux. files - i.e., for filenames with extensions; it needed one extra character to match the main files too; I corrected it in my comments above) |
…set_files_cleanup
…each version separately
Thank you @ErykKul. I'm looking at the latest changes now. |
Sorry for the delay with reviewing the PR; I keep getting distracted by other things, but will try to wrap it up asap. |
The first part of our generated physical file name - |
|
||
You might also want to remove cached export files or some temporary files, thumbnails, etc. | ||
|
||
All the files stored in the Dataset storage location that are not in the file list of that Dataset can be removed, as shown in the example below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to clarify here, that only the files that are looking like leftover datafiles will be deleted here, that Dataverse will skip any files that it can't identify as such for sure? So, if they happen to have anything else useful stored in the same directory - not that we ever recommend doing that - it will be spared?
I'm not sure about this, whether it is in fact necessary to explain this. So I'm leaving this up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it would be great to add a reminder here that this is an experimental feature, like in the release note; and a suggestion to make sure their backups are up-to-date before attempting this on prod. servers. And also, could you please document the "dryrun" parameter, and tell them sternly, to start with that. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I will rework the documentation to reflect the current solution and add the "experimental feature" warning there. The dry run parameter is a very good idea for testing before using it, and should be explained (I did not do that yet after adding the feature...). I will give extra warning to the swift users, as it is not yet tested on swift...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@landreev
Thank you for your review! I have done the remaining changes and updated the documentation. Can you have a look at the latest commits? Thanks.
I tested it under S3 a bit, it's all looking good - thank you! |
Swift is a problem. I can not test it neither. The current solution with your regex is much safer, we do not attempt removing files that do not match the pattern. And, if the remove fails, the file remains in the system. I think it is OK to release it with additional warnings for swift that it is not tested. The better solution would be to find someone who uses swift and can test it... |
Borealis (née Scholars Portal) is using Swift, I believe. They wrote "Ontario Library Research Cloud (SWIFT-S3 setup), locally managed with 5 nodes across Canada" under "indicate where Dataverse deployed" in the 2022 GDCC survey. They also ticked the box for using Swift: I'm not sure who would know for sure or who might be able to test this PR so I'll just spam some folks from Borealis (sorry! 😅 ): @amberleahey @lubitchv @JayanthyChengan @meghangoodchild |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to go ahead and approve the PR. Thank you for all the work on this!
As for swift - thank you @pdurbin for spamming some potential testers of this functionality.
I personally don't think the swift part of it should hold the PR, or be considered a blocker. So, if testing under swift becomes even remotely problematic, I'm just going to suggest that the cleanup implementation in the swift driver be commented out and replaced with a "not yet supported" exception, and a message to any potential swift user suggesting that they review the code and complete the task of adding the functionality.
sprint kickoff
|
What this PR does / why we need it:
In order to conserve storage space, you might want to clean up files in datasets' storage location. Most importantly, when uploading files directly, or by user action, there could be files that failed to upload fully and remain unlinked to the dataset. With the new API call, you can remove all the files not linked to the dataset.
Which issue(s) this PR closes:
Closes #9130
Special notes for your reviewer:
This implementation removes all the files that are not in the file list of a specific dataset. Maybe the thumbnails and export cache files should remain in the storage? It could be useful to be able to remove these files, e.g., in datasets that are not accessed very often. You could also have thumbnails of the files from previous versions of the dataset, possibly no longer present the newest version, etc.
For now, I have kept it simple, and the code removes all not linked files. If required, I could add a functionality that files with names that start with a name of a linked file are left untouched.
Most importantly, I have this code only tested with local file storage. This code could be dangerous if used in production without prior testing...