-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize Zipping Process on the backend #6505
Comments
It seems like a good idea to have this kind of batch processing offloaded with one of the great options around these days (JBatch, FaaS, separated webservice, ...). While reading this I wondered if this kind of work isn't even more widespread in Dataverse. I know this is kinda off-scope for this issue, but as you already generalized from #6093: wouldn't it make sense to discuss about a more general offloading modularization first? I'm thinking about Ingress processing, ZIP file creation and fulltext indexing here as the most ressource intense tasks in Dataverse I am aware of. |
@poikilotherm, I'm not opposed to a more general architecture discussion and will defer to @scolapasta but I'm very interested in implementing this for zipping first. Note that ingest modularization is included in TRSA branch. |
In the issue __ @shlake pointed out that "a tabular file in Harvard Dataverse v4.5.1." does not include the "All File Formats + Information" option under the download dropdown menu. In another comment @scolapasta points out the "need to retest and confirm that it's no longer causing trouble, if/when we re-enable these bundle downloads". I suggest that this use case and the request to return this feature to the UI also be considered in this download ZIP optimization effort. |
@mheppler thanks for linking up that issue. I'd like to keep anything that may have a front end impact, or other feature requests, as a separate issue and not consider those use cases as part of this issue. This is solely to optimize how things work in the backend as it would tremendously reduce the incomplete zipping issues and increase stability. I'll update the title to clarify. |
After discussing with @djbrooke and @scolapasta the file level ZIP creation might not fit into the final solution for this issue, but the use case can still be considered in the discussion. From what I can tell, to return this functionality, we only need to remove the
|
Discussed with team during tech hours - our approach here will be to separate it from the core Dataverse app as a service that can be called. Some details will need to be hashed out in development, but generally: |
To outline where I am from now and what still needs to be figured out (a lot): First of all, what it is that we are trying to address: with increasingly large files and datasets, serving zip file bundles can potentially tie up many worker threads in these long-running tasks or otherwise hog a significant amount of resources. We currently try to manage it by limiting the amount of data allowed in a zip bundle. This of course is unpopular with our users who want to have an easier mechanism to download an entire dataset or a significant portion thereof. This too is bound to become only more of an issue with larger datasets. A solution being considered is to be able to offload this job, potentially on a different server; (this would be available as an option). It's not necessary that the zipped stream is generated there faster or more efficiently (although that would help). The main idea is that if too many users attempt this at once on too many files, it will degrade or crash this additional server, and not the main application. Note that the problem has a trivial/brute force solution, using mostly the already available technology: This "extra zipping server" is simply another instance of Dataverse, with access controls blocking everything but /api/access/datafiles/* from the outside. On the main application server, FileDownloadService uses the url of the external server above when a bundle download is requested. The only currently available solution for authentication/authorization for any restricted files in this setup however would be to embed the user's API token into the redirect URL. (this "solution" is for illustration purposes only - but do note that it would actually work). Things that need to be figured out for a real practical solution:
There is definitely more to discuss (to be continued). |
FWIW: At least for S3, the download URLs are presigned, so they are essentially a 1-time token already. |
They are; I meant, if this is a truly standalone service, it needs to authenticate itself to the Dataverse first, before it gets the presigned S3 urls. Of course, alternatively, the Dataverse could call the service directly, giving it all the urls, before sending the user there. One way or another, this part is doable, yes. |
That is what I had originally envisioned: user goes to Dataverse, attempts to download, Dataverse confirms which files they are authorized for, then sends those urls to the service which does the zipping. |
@scolapasta That's how I read your earlier summary too. But then this extra call (from Dataverse to the "zip service") can be skipped by issuing this "one time token" authorizing the downloads and embedding it into the download url that takes the user to download. |
Good point. When we added the one time urls for s3, we always said it would be nice to have something similar (a cgi script) for a local filesystem that could do the same thing. So possibly it could be that? Or in lieu of that, it could be the direct location of the file since this service could have access to the file system directly? |
So that's what I was talking about - a temporary token that gets deleted once the file is served; an equivalent of those pre-signed S3 urls. (which, technically, may not be "one time", but limited time) |
that part - direct access to the files and/or database - I mentioned, as a technical possibility; but thought that it was too hacky to consider - ? |
(I actually meant to ask about this, but we never got to it) |
In my earlier summary I tried to start with defining the problem we are trying to solve. Since it kept coming up, let me try again: Zipping up files is not super CPU or memory intensive. But it takes time; which is especially true when the source files live on S3 and have to be accessed over the network. I'm pretty sure that is the finite resource that would be the stress point if we removed the total size limit and left things as they are otherwise. Theoretically, if the feature becomes popular enough, these long-running zipping jobs will max out the thread pool (simplifying a tiny bit, with each extra simultaneous zipping job download becomes slower for each user; since they all share the network throughput to S3). Once the thread pool is maxed out, the application can no longer accept new requests, serve pages, etc. How likely it is that we would actually reach that state - we don't know. At the moment the API is not actually that popular. But, once again, this is the finite resource that we are talking about and this is the problem we are trying to address. There will always be this possibility; it's safe to assume that it's becoming more real with the data becoming "bigger". So we should assume that it can and will happen. And then it will have to be dealt with, somehow, by limiting it on some level. Whether we do this limiting on the application side, or offload the job remotely (there will be a finite number of threads on that external server too!). Whether we'll have to start telling the users "your files are being zipped asynchronously, we'll send you a notification with the download url once it's done" (that was basically the approach we agreed on back in March; above), or simply "we are too busy now, try again later" - but we have to operate under the assumption that there will be situations where we won't be able to start streaming the data the moment the user clicks the download button. How much effort we should be willing to invest into solving this issue, that hasn't started biting us in the butt just yet, should be a legitimate question though. The fact that S3 downloads are expensive literally - in terms of money they cost - I don't view as a technical problem (or mine). |
The point @donsizemore made - that HTTP is simply not a suitable download method past a certain size - is real though. But that maybe outside of the scope of this issue. |
@mankoff
Strictly speaking, we are not "waiting for zipping" - we start streaming the zipped bytes right away, as soon as the first buffer becomes available. But I am with you in principle, that trying to compress the content is a waste of cpu cycles in many cases. It's the "download starts instantly" part that I have questions about. I mean, I don't see how it starts instantly, or how it starts at all. That ".../download" call merely exposed the directory structure - i.e. it produced some html with a bunch of links. It's still the client's job to issue the download requests for these links. I'm assuming what you mean is that the user can point something like wget at this virtual directory, and tell it to crawl it. And then the downloads will indeed "start instantly", and wget will handle all the individual downloads and replicate the folder structure locally, etc. I agree that this would save a command line user a whole lot of scripting that's currently needed - first listing the files in the dataset, parsing the output, issuing individual download requests for the files etc. But as for the web users - the fact that it would require a browser extension for a folder download to happen automatically makes me think we're not going to be able to use this as the only multiple file download method. (I would definitely prefer not to have this rely on a ton of custom client-side javascript, for crawling through the folders and issuing download requests either...) (Or is it now possible to create HTML5 folders, that a user can actually drag-and-drop onto their system - an entire folder at once? Last time I checked it couldn't be done; but even if it can be done, I would expect it not to be universally supported in all browsers/on all OSes...) My understanding of this is that even if this can be implemented as a viable solution, for both the API and web users, we'll still have to support the zipping mechanism. Of which, honestly, I'm not a big fan either. It's kind of a legacy thing - something we need to support, because that's the format most/many users want. As you said, the actual compression that's performed when we create that ZIP stream is most likely a waste of cycles. With large files specifically - most of those in our own production are either compressed already by the user, or are in a format that uses internal compression. So this means we are using ZIP not for compression - but simply as a format for packaging multiple files in a single archive stream. Which is kind of annoying, true. But I've been told that "users want ZIP". But please don't get me wrong, I am interested in implementing this, even if it does not replace the ZIP download. |
Hi @landreev - you're right, this does not start the download. I was assuming As for browser-users, sometimes I forget that interaction mode, but you are right, they would still need a way to download multiple files, hence zipping. If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed. You are right, exposing a virtual folder for CLI users (or DownThemAll browser extension users) and bulk download are separate issues. Perhaps this virtual folder should be its own Issue here on GitHub to keep things separate. |
I realize that if appending |
@mankoff |
I like |
But I readily acknowledge that it's still bad and painful, even with streaming. |
Just a side note: one might be tempted to create a WebDAV interface, which could be included in things like Nextcloud. |
…rom working in some browsers. (#6505)
We currently restrict zip file downloads to a max size (based on a setting) due to performance of zipping large files via glassfish. We should investigate:
This will allow for better system stability and to avoid the issue where the user downloads a zip file and doesn't find out that it's not a complete archive until after the fact (because it exceeds the limit).
For S3, it will be interesting to architect this with an eye towards cost, that is, if we store zipped versions of the dataset so that we can provide zips without processing time, we incur storage cost, but if we optimize the zipping so it takes place on demand we incur some computation cost on lambda/fargate whatever. Not sure what's preferred.
The text was updated successfully, but these errors were encountered: