Large File Upload API: Allow Super Admin to upload beyond limit files via API, REST or Rsync #6102

kcondon · 2019-08-19T16:08:28Z

We've had 3 larger than file upload limit, file upload requests in the past week or so. This appears to be a trend. The current process of manually uploading the file, uploading a small placeholder through the UI, then manually updating the db with file stats such as md5, size, file type, is time consuming and inefficient.

Adding a way for super admins to bypass the configured limit on a case by case basis, through an API, either the current REST API or some variation or modification of Rsync, would be helpful and empower the Curation Team and others to do this work directly.

djbrooke · 2019-08-19T16:16:56Z

Thanks for reporting @kcondon. I think a better answer here would be to enable the large file transfers at the individual user level, but that's a bigger task and we should indeed make the workaround process easier as an intermediate step.

I'm especially interested in taking what we learn from #6093 and applying it here re: lambda functions, but other suggestions are welcome.

kcondon · 2019-08-19T16:52:31Z

@djbrooke Well, if we enable at user level we should just remove the upload limit entirely, right? Or were you thinking only enable it at user level through API since UI likely would have some issues with larger files?

djbrooke · 2019-08-20T01:34:52Z

@kcondon I'd like for Harvard Dataverse to do a better job of large files upload for everybody, and specifically I'd like to enable rsync on Harvard Dataverse so that we can support more disciplines (and without administrator intervention). Since our installation is open worldwide, I haven't done that yet because I'm concerned about storage/transfer costs. Before we implement rsync on Harvard Dataverse I'd like to get some tooling in place to allow specific users/groups to take advantage of large data transfers, and superusers would definitely be in that group.

To expand on what I wrote about, as an intermediate step I'd be interested in exploring us setting up some process where we as admins can kick off a data transfer to the Harvard Dataverse S3 bucket using AWS Datasync (or something else), which then uses lambda functions to call the APIs needed to get the file(s) represented in Dataverse (didn't we build those for SBGrid? :)).

scolapasta · 2019-08-20T15:12:40Z

Should we just add logic now for sync to be superuser only? The code is already there and tested.

We can either do this by either:
• making it fully superuser in the command and ui permissionwrapper (for ui renders). Should be a "smedium".
• have the setting be: off, on, on for superusers. This would be a little bigger as it would require the permissions check to be dynamic based on the setting. But could be very useful for any installation.

Note the first solution would require an installation like sbgrid to maintain a fork until we either do something like the latter.

scolapasta · 2019-08-20T17:21:49Z

Discussed with @kcondon, this doesn't perfectly solve the problem as rync has some specialized aspects which wouldn't work in all use cases.

Another idea is that we investigate the APIs that rync calls to inform Dataverse that the file is now in storage. If they call already be called by a superuser (or if that's a small change), then at the very least a sys admin, could manually put the file in the correct location on storage and then call this API to add the correct metadata to the DB. (these are the same APIs that would also need to be used by any AWS lambda type solution)

kcondon · 2019-12-04T20:02:01Z

Seem to be getting more large file upload requests as well as iterative large file uploads, possibly indicating a new use case or work flow? Any movement on this or a simpler admin-only facility? @djbrooke @scolapasta

pdurbin · 2019-12-04T20:44:01Z

@JayanthyChengan just mentioned she's working on large file upload. I'm not sure about the specifics though. 😄

@qqmyers has also been working on large file uploads. A different approach I believe.

There are some notes about both approaches from the 2019-11-05 Dataverse Community Call at https://groups.google.com/d/msg/dataverse-community/0hu9xXrwOPI/TaMZwOhFAwAJ

qqmyers · 2019-12-04T21:12:52Z

FWIW: I have a working proof of concept @ TDL allowing direct uploads of large files to S3 via the API. Currently finishing work to allow use of a second S3 store (multiple stores in general) that is intended for large files (and for TDL is actually closer network-wise to where the large files would be from). Next step is to see if I can do the same direct upload using the existing GUI (just like the direct S3 download just works through the download button).

This doesn't directly allow bypassing the upload size limit, but might be easier to admin than rsync (just another S3 store). As with any upload method, an override to allow admins to upload larger files could be implemented.

I'm hoping to email the big data list soon with a progress update and questions/decisions with respect to whether this works for the community, but I'd be happy to discuss at any point.

djbrooke · 2019-12-05T01:41:58Z

@kcondon Yes, the two efforts mentioned above. This is something we're very much interested in.

djbrooke · 2021-05-13T18:59:46Z

I'm going to close this issue as I've been managing this process for the Harvard Dataverse Repository with multiple stores and direct uploads over the last few months. 👍 We've got some large files in and have received reports of 250+ GB files brought in in other installations

kcondon · 2021-05-13T19:06:35Z

Quick q: how do you get around the upload size limit, does direct upload provide that?

djbrooke · 2021-05-13T19:22:40Z

Hey @kcondon - the way it's set up is that we have a L, XL, XXL, and XXXL (5 GB, 10 GB, 20 GB, 50 GB or something similar) store set up on the installation, so I temporarily switch the dataset to one of these stores to allow for me to do the large upload, then I switch back to the regular store once the upload is completed so that users can still get all the usual file-level functionality that comes with regular upload. I could do this at the Dataverse collection level as well (set it and forget it!) but it's been easy enough to do it at the dataset level.

djbrooke closed this as completed May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large File Upload API: Allow Super Admin to upload beyond limit files via API, REST or Rsync #6102

Large File Upload API: Allow Super Admin to upload beyond limit files via API, REST or Rsync #6102

kcondon commented Aug 19, 2019

djbrooke commented Aug 19, 2019

kcondon commented Aug 19, 2019

djbrooke commented Aug 20, 2019

scolapasta commented Aug 20, 2019 •

edited

Loading

scolapasta commented Aug 20, 2019

kcondon commented Dec 4, 2019

pdurbin commented Dec 4, 2019 •

edited

Loading

qqmyers commented Dec 4, 2019

djbrooke commented Dec 5, 2019

djbrooke commented May 13, 2021

kcondon commented May 13, 2021

djbrooke commented May 13, 2021

Large File Upload API: Allow Super Admin to upload beyond limit files via API, REST or Rsync #6102

Large File Upload API: Allow Super Admin to upload beyond limit files via API, REST or Rsync #6102

Comments

kcondon commented Aug 19, 2019

djbrooke commented Aug 19, 2019

kcondon commented Aug 19, 2019

djbrooke commented Aug 20, 2019

scolapasta commented Aug 20, 2019 • edited Loading

scolapasta commented Aug 20, 2019

kcondon commented Dec 4, 2019

pdurbin commented Dec 4, 2019 • edited Loading

qqmyers commented Dec 4, 2019

djbrooke commented Dec 5, 2019

djbrooke commented May 13, 2021

kcondon commented May 13, 2021

djbrooke commented May 13, 2021

scolapasta commented Aug 20, 2019 •

edited

Loading

pdurbin commented Dec 4, 2019 •

edited

Loading