Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset files cleanup #9132

Merged
merged 15 commits into from
Jan 27, 2023
Merged

Dataset files cleanup #9132

merged 15 commits into from
Jan 27, 2023

Conversation

ErykKul
Copy link
Collaborator

@ErykKul ErykKul commented Nov 4, 2022

What this PR does / why we need it:
In order to conserve storage space, you might want to clean up files in datasets' storage location. Most importantly, when uploading files directly, or by user action, there could be files that failed to upload fully and remain unlinked to the dataset. With the new API call, you can remove all the files not linked to the dataset.

Which issue(s) this PR closes:

Closes #9130

Special notes for your reviewer:
This implementation removes all the files that are not in the file list of a specific dataset. Maybe the thumbnails and export cache files should remain in the storage? It could be useful to be able to remove these files, e.g., in datasets that are not accessed very often. You could also have thumbnails of the files from previous versions of the dataset, possibly no longer present the newest version, etc.

For now, I have kept it simple, and the code removes all not linked files. If required, I could add a functionality that files with names that start with a name of a linked file are left untouched.

Most importantly, I have this code only tested with local file storage. This code could be dangerous if used in production without prior testing...

@coveralls
Copy link

coveralls commented Nov 4, 2022

Coverage Status

Coverage: 19.967% (-0.03%) from 20.0% when pulling 635f32e on ErykKul:dataset_files_cleanup into ecc23c0 on IQSS:develop.

@qqmyers
Copy link
Member

qqmyers commented Nov 4, 2022

@ErykKul - haven't yet looked at code, but thanks for this! I was just considering starting in on it. I would definitely like to see this expanded to preserve all the files that could be there, at least as an option (maybe the default) - thumbs, cached exports, aux files, external dataset thumbs, provenance file, original files (for ingested tabular files). For S3, something that would also remove partial multipart uploads would be useful but could/should probably be a separate PR. A list-only version that just identify extra files without deleting would be useful as well.

I'm happy to contribute, test on S3, etc. to help you move this forward - just let me know when/where you want help. As a first step I'll try to look through your current code soon.

@donsizemore
Copy link
Contributor

Yes, thank you for this. Is it possible to have the endpoint default to a listonly or dryrun mode, then actually fire when passed an execute or force flag? Just so admins can see what would be deleted first...

@ErykKul
Copy link
Collaborator Author

ErykKul commented Nov 8, 2022

I have added the dryrun query parameter. I have also updated the filter to include the export files:

Predicate<String> filter = f -> {
    return !f.startsWith("export_") || files.stream().noneMatch(x -> f.startsWith(x));
};

@mreekie
Copy link

mreekie commented Jan 4, 2023

It's about cleaning up files in S3 [edit by leonid: not just S3 - any file storage]. Files like those from direct upload. It's cleaning up cruft. You need to be sure that you are only deleting the cruft and not a file that is actually still required. This one needs some care in looking at.
It's possible that the reviewer may find something significant to change.
There are no tests defined for this.
So the work should include adding tests by the orginal developer. (This will likely block progress at first)
Size 33

@mreekie mreekie added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Jan 4, 2023
@mreekie
Copy link

mreekie commented Jan 4, 2023

This may be related to work Oliver did on #8983

@landreev landreev self-requested a review January 12, 2023 16:05
@landreev landreev self-assigned this Jan 12, 2023
@landreev
Copy link
Contributor

Thank you for this PR.
For some reason I have always assumed it was easier to perform this cleanup outside of the application, with scripts that can directly access both the storage and the database. (This is how I've been doing it on Harvard prod. servers). I did realize at some point that we never even mention this issue in the guides (even saying "hey, it's the responsibility of the admin to clean up your storage and delete leftover junk, from incomplete uploads and/or deletes, etc. etc." would be better than nothing). I was thinking of making a documentation PR for this, but your solution is of course much better.
I'm looking at the code now/making sure no "good" files are being overlooked. (We may want to make the warnings contain even stronger language when we release this, about the feature being experimental and potentially dangerous, how nobody should even think of using it in production without ensuring they have full backups of everything, etc. etc.)

@landreev
Copy link
Contributor

landreev commented Jan 20, 2023

... I have also updated the filter to include the export files:

...
    return !f.startsWith("export_") || ...;

This is the biggest potential problem I'm seeing so far. I don't think it is safe at all, to assume that the only useful dataset-level auxiliary files are the cached metadata export files. The export files are at least mostly safe to erase (they will be regenerated automatically when needed). But the dataset_logo* files (custom dataset thumbnails) that it would end up deleting with the filter above are not easily recoverable. The only other real-life examples of non-export dataset-level aux. files I can think of right away were some GEO/WorldMap-specific files that are no longer in use. But we should definitely assume that it is possible, for such files to exist. I.e., if we hard-code the valid filename prefixes that are in use now, future developers will need to know/remember to add to the list if they ever need to add aux. files that start with something else (and that would be asking for trouble).

Not sure what the best/cleanest way to deal with this is. It's possible that the best we can do is to only delete the files with the filenames that match ^[0-9a-f]{11}-[0-9a-f]{12}\.?.*. That should catch all the unlinked datafile-level main files and their derivatives under our current standard file naming scheme, I think. This will NOT catch some ancient files for a (very few) legacy installations - such as Harvard and Odum and a couple of others - that have grandfathered files that don't look like this. But I don't think this is a big issue - if any of us old-timer Dataverses still have some junk files from a decade+ back wasting storage space - that's probably on us.

Anyone? - Does the regex above really match all possible (reasonably modern) datafile-level physical files? (Does this work with RemoteOverlay?)

OK, we'll figure this out. (Sorry to be difficult :)

@qqmyers
Copy link
Member

qqmyers commented Jan 20, 2023

I agree that just deleting datafiles not in the database ( ^[0-9a-f]{11}-[0-9a-f]{12}\..* ) would be a safer option. I didn't check that regex just now but the new code that assures you can only send in valid storageidentifiers should be using a regex so we can compare with that.

There are definitely more files that just export_* (which as you say, will be regenerated) at the dataset level that need to remain - the json provenance file is one of them - that is user input and can't be recreated. Right now, creating a whitelist would be more trouble than a blacklist approach. If/when the dataset-level aux files are organized like the file level ones (all with the same naming pattern), a whitelist might be easier.

Not sure what that means for this PR. Having the ability to clean failed uploads is probably 90% of the problem and arguably safer, so it would be nice if that were an/the option. If there's interest in trying to assemble a whitelist of all additional files that can exist, I'd be happy to help trying to track down the sources, but I think it would be OK to make that future work as well.

One other thing to be aware of - it is probably an infrequent case but files in a dataset may not all be in the same store. One can change the store used by the dataset to save new files after some files have been uploaded. I think the auxiliary files for the dataset are always in the store for the dataset but the datafiles and their auxfiles could be spread out.

@landreev
Copy link
Contributor

landreev commented Jan 20, 2023

I agree that just deleting datafiles not in the database ( ^[0-9a-f]{11}-[0-9a-f]{12}\.?.* ) would be a safer option.

I think we should just go ahead and adopt the regex approach. In practice, I can't think of anything else that could be wasting any appreciable amount of space, other than leftover datafiles, from failed uploads or deletes.

I would NOT be super comfortable with the idea of a white list; can't think of a truly sure way to keep it up-to-date.

If/when the dataset-level aux files are organized like the file level ones (all with the same naming pattern)

Yeah, we messed up this part, the storage names for the dataset-level files. It's because of this also that StorageIO can listAuxObjects for a file; but not for a dataset. We save a file-level aux file with the tag "xyz" as "${basename}.xyz", but on the dataset level, we save it as just "xyz". Either some reserved dataset-level basename, or simply ".xyz" would be a way better scheme. (obviously, this is way outside this PR)

@landreev
Copy link
Contributor

(sorry, that regex I first posted was specifically for aux. files - i.e., for filenames with extensions; it needed one extra character to match the main files too; I corrected it in my comments above)

@landreev
Copy link
Contributor

Thank you @ErykKul. I'm looking at the latest changes now.

@landreev
Copy link
Contributor

Sorry for the delay with reviewing the PR; I keep getting distracted by other things, but will try to wrap it up asap.
I just tested it with filesystem-stored files; happy to see that my regex appears to be working (w/ the extra backslash to escape the literal . - yes). I mean, I only tried it with egrep, wasn't 100% sure if it was going to work w/ Java matcher.
I do want to test it on s3 some more, so I'll go do that.
(I may suggest that we tighten that regex a bit further, let me think about that).

@landreev
Copy link
Contributor

The first part of our generated physical file name - [0-9a-f]{11} - that's the timestamp of when the file was created, the number of milliseconds since 01-01-1970, hex-encoded.
If my quick math is correct, the regex will be valid for 500+ more years, before the number becomes 12 digit... so, I think we are good?


You might also want to remove cached export files or some temporary files, thumbnails, etc.

All the files stored in the Dataset storage location that are not in the file list of that Dataset can be removed, as shown in the example below.
Copy link
Contributor

@landreev landreev Jan 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to clarify here, that only the files that are looking like leftover datafiles will be deleted here, that Dataverse will skip any files that it can't identify as such for sure? So, if they happen to have anything else useful stored in the same directory - not that we ever recommend doing that - it will be spared?
I'm not sure about this, whether it is in fact necessary to explain this. So I'm leaving this up to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it would be great to add a reminder here that this is an experimental feature, like in the release note; and a suggestion to make sure their backups are up-to-date before attempting this on prod. servers. And also, could you please document the "dryrun" parameter, and tell them sternly, to start with that. Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will rework the documentation to reflect the current solution and add the "experimental feature" warning there. The dry run parameter is a very good idea for testing before using it, and should be explained (I did not do that yet after adding the feature...). I will give extra warning to the swift users, as it is not yet tested on swift...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landreev
Thank you for your review! I have done the remaining changes and updated the documentation. Can you have a look at the latest commits? Thanks.

@landreev
Copy link
Contributor

I tested it under S3 a bit, it's all looking good - thank you!
I don't think it's necessary, to mess with that regex any further.
So, the LAST thing I'm worried about: swift. Do you happen to have a way to test it? Because I don't think we do at this point.

@ErykKul
Copy link
Collaborator Author

ErykKul commented Jan 25, 2023

Swift is a problem. I can not test it neither. The current solution with your regex is much safer, we do not attempt removing files that do not match the pattern. And, if the remove fails, the file remains in the system. I think it is OK to release it with additional warnings for swift that it is not tested. The better solution would be to find someone who uses swift and can test it...

@pdurbin
Copy link
Member

pdurbin commented Jan 25, 2023

Borealis (née Scholars Portal) is using Swift, I believe. They wrote "Ontario Library Research Cloud (SWIFT-S3 setup), locally managed with 5 nodes across Canada" under "indicate where Dataverse deployed" in the 2022 GDCC survey. They also ticked the box for using Swift:

Screen Shot 2023-01-25 at 8 38 42 AM

I'm not sure who would know for sure or who might be able to test this PR so I'll just spam some folks from Borealis (sorry! 😅 ): @amberleahey @lubitchv @JayanthyChengan @meghangoodchild

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to go ahead and approve the PR. Thank you for all the work on this!
As for swift - thank you @pdurbin for spamming some potential testers of this functionality.

I personally don't think the swift part of it should hold the PR, or be considered a blocker. So, if testing under swift becomes even remotely problematic, I'm just going to suggest that the cleanup implementation in the swift driver be commented out and replaced with a "not yet supported" exception, and a message to any potential swift user suggesting that they review the code and complete the task of adding the functionality.

@kcondon kcondon self-assigned this Jan 26, 2023
@mreekie
Copy link

mreekie commented Jan 26, 2023

sprint kickoff

  • include in QATail for Jan 11, 2023

@mreekie mreekie added Size: SprintTail Issues that have made it into Ready for QA or further right and removed Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) labels Jan 26, 2023
@kcondon kcondon merged commit 6bbbf38 into IQSS:develop Jan 27, 2023
@pdurbin pdurbin added this to the 5.13 milestone Jan 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: SprintTail Issues that have made it into Ready for QA or further right
Projects
Status: No status
Status: Closed
Status: No status
Development

Successfully merging this pull request may close these issues.

Cleanup of the files in dataset storage that are not linked to that dataset
8 participants