Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large File Upload API: Allow Super Admin to upload beyond limit files via API, REST or Rsync #6102

Closed
kcondon opened this issue Aug 19, 2019 · 12 comments

Comments

@kcondon
Copy link
Contributor

kcondon commented Aug 19, 2019

We've had 3 larger than file upload limit, file upload requests in the past week or so. This appears to be a trend. The current process of manually uploading the file, uploading a small placeholder through the UI, then manually updating the db with file stats such as md5, size, file type, is time consuming and inefficient.

Adding a way for super admins to bypass the configured limit on a case by case basis, through an API, either the current REST API or some variation or modification of Rsync, would be helpful and empower the Curation Team and others to do this work directly.

@djbrooke
Copy link
Contributor

Thanks for reporting @kcondon. I think a better answer here would be to enable the large file transfers at the individual user level, but that's a bigger task and we should indeed make the workaround process easier as an intermediate step.

I'm especially interested in taking what we learn from #6093 and applying it here re: lambda functions, but other suggestions are welcome.

@kcondon
Copy link
Contributor Author

kcondon commented Aug 19, 2019

@djbrooke Well, if we enable at user level we should just remove the upload limit entirely, right? Or were you thinking only enable it at user level through API since UI likely would have some issues with larger files?

@djbrooke
Copy link
Contributor

@kcondon I'd like for Harvard Dataverse to do a better job of large files upload for everybody, and specifically I'd like to enable rsync on Harvard Dataverse so that we can support more disciplines (and without administrator intervention). Since our installation is open worldwide, I haven't done that yet because I'm concerned about storage/transfer costs. Before we implement rsync on Harvard Dataverse I'd like to get some tooling in place to allow specific users/groups to take advantage of large data transfers, and superusers would definitely be in that group.

To expand on what I wrote about, as an intermediate step I'd be interested in exploring us setting up some process where we as admins can kick off a data transfer to the Harvard Dataverse S3 bucket using AWS Datasync (or something else), which then uses lambda functions to call the APIs needed to get the file(s) represented in Dataverse (didn't we build those for SBGrid? :)).

@scolapasta
Copy link
Contributor

scolapasta commented Aug 20, 2019

Should we just add logic now for sync to be superuser only? The code is already there and tested.

We can either do this by either:
• making it fully superuser in the command and ui permissionwrapper (for ui renders). Should be a "smedium".
• have the setting be: off, on, on for superusers. This would be a little bigger as it would require the permissions check to be dynamic based on the setting. But could be very useful for any installation.

Note the first solution would require an installation like sbgrid to maintain a fork until we either do something like the latter.

@scolapasta
Copy link
Contributor

Discussed with @kcondon, this doesn't perfectly solve the problem as rync has some specialized aspects which wouldn't work in all use cases.

Another idea is that we investigate the APIs that rync calls to inform Dataverse that the file is now in storage. If they call already be called by a superuser (or if that's a small change), then at the very least a sys admin, could manually put the file in the correct location on storage and then call this API to add the correct metadata to the DB. (these are the same APIs that would also need to be used by any AWS lambda type solution)

@kcondon
Copy link
Contributor Author

kcondon commented Dec 4, 2019

Seem to be getting more large file upload requests as well as iterative large file uploads, possibly indicating a new use case or work flow? Any movement on this or a simpler admin-only facility? @djbrooke @scolapasta

@pdurbin
Copy link
Member

pdurbin commented Dec 4, 2019

@JayanthyChengan just mentioned she's working on large file upload. I'm not sure about the specifics though. 😄

@qqmyers has also been working on large file uploads. A different approach I believe.

There are some notes about both approaches from the 2019-11-05 Dataverse Community Call at https://groups.google.com/d/msg/dataverse-community/0hu9xXrwOPI/TaMZwOhFAwAJ

@qqmyers
Copy link
Member

qqmyers commented Dec 4, 2019

FWIW: I have a working proof of concept @ TDL allowing direct uploads of large files to S3 via the API. Currently finishing work to allow use of a second S3 store (multiple stores in general) that is intended for large files (and for TDL is actually closer network-wise to where the large files would be from). Next step is to see if I can do the same direct upload using the existing GUI (just like the direct S3 download just works through the download button).

This doesn't directly allow bypassing the upload size limit, but might be easier to admin than rsync (just another S3 store). As with any upload method, an override to allow admins to upload larger files could be implemented.

I'm hoping to email the big data list soon with a progress update and questions/decisions with respect to whether this works for the community, but I'd be happy to discuss at any point.

@djbrooke
Copy link
Contributor

djbrooke commented Dec 5, 2019

@kcondon Yes, the two efforts mentioned above. This is something we're very much interested in.

@djbrooke
Copy link
Contributor

I'm going to close this issue as I've been managing this process for the Harvard Dataverse Repository with multiple stores and direct uploads over the last few months. 👍 We've got some large files in and have received reports of 250+ GB files brought in in other installations

@kcondon
Copy link
Contributor Author

kcondon commented May 13, 2021

Quick q: how do you get around the upload size limit, does direct upload provide that?

@djbrooke
Copy link
Contributor

Hey @kcondon - the way it's set up is that we have a L, XL, XXL, and XXXL (5 GB, 10 GB, 20 GB, 50 GB or something similar) store set up on the installation, so I temporarily switch the dataset to one of these stores to allow for me to do the large upload, then I switch back to the regular store once the upload is completed so that users can still get all the usual file-level functionality that comes with regular upload. I could do this at the Dataverse collection level as well (set it and forget it!) but it's been easy enough to do it at the dataset level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants