Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Zipping Process on the backend #6505

Closed
djbrooke opened this issue Jan 10, 2020 · 28 comments · Fixed by #6986
Closed

Optimize Zipping Process on the backend #6505

djbrooke opened this issue Jan 10, 2020 · 28 comments · Fixed by #6986
Assignees
Milestone

Comments

@djbrooke
Copy link
Contributor

djbrooke commented Jan 10, 2020

We currently restrict zip file downloads to a max size (based on a setting) due to performance of zipping large files via glassfish. We should investigate:

  • how to store files in such a way that on-demand zipping is not needed
  • to modularize the functionality so that this does not tax the application server,
  • some other option

This will allow for better system stability and to avoid the issue where the user downloads a zip file and doesn't find out that it's not a complete archive until after the fact (because it exceeds the limit).

For S3, it will be interesting to architect this with an eye towards cost, that is, if we store zipped versions of the dataset so that we can provide zips without processing time, we incur storage cost, but if we optimize the zipping so it takes place on demand we incur some computation cost on lambda/fargate whatever. Not sure what's preferred.

@poikilotherm
Copy link
Contributor

poikilotherm commented Jan 12, 2020

It seems like a good idea to have this kind of batch processing offloaded with one of the great options around these days (JBatch, FaaS, separated webservice, ...).

While reading this I wondered if this kind of work isn't even more widespread in Dataverse. I know this is kinda off-scope for this issue, but as you already generalized from #6093: wouldn't it make sense to discuss about a more general offloading modularization first?

I'm thinking about Ingress processing, ZIP file creation and fulltext indexing here as the most ressource intense tasks in Dataverse I am aware of.

@djbrooke
Copy link
Contributor Author

@poikilotherm, I'm not opposed to a more general architecture discussion and will defer to @scolapasta but I'm very interested in implementing this for zipping first. Note that ingest modularization is included in TRSA branch.

@mheppler
Copy link
Contributor

In the issue __ @shlake pointed out that "a tabular file in Harvard Dataverse v4.5.1." does not include the "All File Formats + Information" option under the download dropdown menu. In another comment @scolapasta points out the "need to retest and confirm that it's no longer causing trouble, if/when we re-enable these bundle downloads".

I suggest that this use case and the request to return this feature to the UI also be considered in this download ZIP optimization effort.

@djbrooke
Copy link
Contributor Author

@mheppler thanks for linking up that issue. I'd like to keep anything that may have a front end impact, or other feature requests, as a separate issue and not consider those use cases as part of this issue. This is solely to optimize how things work in the backend as it would tremendously reduce the incomplete zipping issues and increase stability. I'll update the title to clarify.

@djbrooke djbrooke changed the title Optimize Zipping Optimize Zipping Process on the backend Feb 11, 2020
@mheppler
Copy link
Contributor

After discussing with @djbrooke and @scolapasta the file level ZIP creation might not fit into the final solution for this issue, but the use case can still be considered in the discussion.

From what I can tell, to return this functionality, we only need to remove the ui:remove tags around the "All File Formats + Information" link in file-download-button-fragment.xhtml.

            <ui:remove>
                <li>
                    <p:commandLink styleClass="highlightBold" rendered="#{!(downloadPopupRequired)}"
                                   process="@this"
                                   actionListener="#{fileDownloadService.writeGuestbookAndStartFileDownload(guestbookResponse, fileMetadata, 'bundle')}">
                        #{bundle['file.downloadBtn.format.all']}
                    </p:commandLink>
                    <p:commandLink styleClass="highlightBold" rendered="#{downloadPopupRequired}"
                                   process="@this"
                                   action="#{guestbookResponseService.modifyDatafileAndFormat(guestbookResponse, fileMetadata, 'bundle' )}"
                                   update="@widgetVar(downloadPopup)"
                                   oncomplete="PF('downloadPopup').show();handleResizeDialog('downloadPopup');">
                        #{bundle['file.downloadBtn.format.all']}
                    </p:commandLink>
                </li>
                <li role="presentation" class="divider"></li>
            </ui:remove>

@scolapasta
Copy link
Contributor

Discussed with team during tech hours - our approach here will be to separate it from the core Dataverse app as a service that can be called. Some details will need to be hashed out in development, but generally:
• a request for multiple files that need to be zipped comes in the the main app
• permissions are checked, list of files (or likely file locations) is sent to separate service
• service creates zip and either streams back to end user or puts in a temp location and informs user to download (we may want to use this 2nd option if we think we need to create multiple zips for one download)

@landreev
Copy link
Contributor

To outline where I am from now and what still needs to be figured out (a lot):

First of all, what it is that we are trying to address: with increasingly large files and datasets, serving zip file bundles can potentially tie up many worker threads in these long-running tasks or otherwise hog a significant amount of resources. We currently try to manage it by limiting the amount of data allowed in a zip bundle. This of course is unpopular with our users who want to have an easier mechanism to download an entire dataset or a significant portion thereof. This too is bound to become only more of an issue with larger datasets.

A solution being considered is to be able to offload this job, potentially on a different server; (this would be available as an option). It's not necessary that the zipped stream is generated there faster or more efficiently (although that would help). The main idea is that if too many users attempt this at once on too many files, it will degrade or crash this additional server, and not the main application.

Note that the problem has a trivial/brute force solution, using mostly the already available technology: This "extra zipping server" is simply another instance of Dataverse, with access controls blocking everything but /api/access/datafiles/* from the outside. On the main application server, FileDownloadService uses the url of the external server above when a bundle download is requested. The only currently available solution for authentication/authorization for any restricted files in this setup however would be to embed the user's API token into the redirect URL. (this "solution" is for illustration purposes only - but do note that it would actually work).

Things that need to be figured out for a real practical solution:

  • Is it synchronous or asynchronous? Jbatch was mentioned earlier in the discussion (it's great. btw, we already use it, for "package file" custom imports). We could use it in combination with JMS, or some equivalent, to manage queuing of these batch jobs. As was also mentioned earlier in the issue, there are other areas of the application where similar asynchronous batch execution could be introduced. The problem here is that I am not sure if this particular task, i.e. generating zipped download streams, is a good candidate for asynchronous handling. Specifically, I cannot think of a way of handling it asynchronously that would not require storing the generated file, at least temporarily. And I would absolutely prefer to avoid that. I.e., it makes life so much easier, to generate the stream on the fly and immediately stream it to the user. Otherwise we would have to schedule the job, tell the user to "come back later" or send them a notification when it's done; but we have to keep that generated zip in some temp storage until the user downloads it. (and if we are talking about truly "large data", using local temp space is simply not going to work for us here, or for anyone else using AWS; so it would need to be cached back on S3 - which would have its own overhead).

  • What is the relationship between Dataverse and this extra download service? Specifically, is this essentially a limited-functionality Dataverse application - that can use Dataverse services, StorageIO, authentication etc.? Or is it a completely standalone entity that can only communicate to the Dataverse via its APIs? (there is a third potential solution: a standalone entity with direct access to both the Dataverse database and the filesystem or bucket where the files live; this could be fairly efficient, but Ia don't think we want to go there...)

  • With respect to the above, it appears that the applicability of either of the two solutions seriously depends on whether the files are stored on a local filesystem or in an S3 bucket. If it is the latter, the overhead of using the API is minimal: the "zipping service app" will need to make individual /api/access calls for each file, but it will be getting back the redirect-to-S3 URLs and following them to get the bytes; which will cost it roughly the same as for the Dataverse application itself to read the files from S3 over the same network connection. I am currently working on a prototype that uses this (API) approach.
    However, if the Dataverse in question stores its files on a filesystem, this approach likely has more overhead than we want. Since the files would need to be read via network; even if it is very local network, it's still a lot of overhead, compared to reading the files directly via FileAccessIO.

  • Actual zipping: it can be done better/more efficiently than what we are doing now (using simply java.util.zip). In my prototype I am experimenting with org.apache.commons Zip, particularly their ParallelScatterZipCreator. It's already used in Dataverse, in the BagIt archive generator (thanks to @qqmeyers for the tips and consulting). Its pretty awesome, in its ability to use multiple threads to generate the archive faster. It may be much better suited for an asynchronous execution scenario however (in fact, it may take longer to start streaming the output); and it may also provide less of an overall improvement when the individual files are read from an S3 bucket vs. the filesystem. But I need to do some experimenting before I can make any conclusions.

  • Access to restricted files. We don't want to resort to something as brute force as embedding the API token into the download URL. I'm working on a prototype of a simple system of one-time access tokens that can be used instead. (to be discussed).

There is definitely more to discuss (to be continued).

@qqmyers
Copy link
Member

qqmyers commented May 12, 2020

FWIW: At least for S3, the download URLs are presigned, so they are essentially a 1-time token already.
General one-time tokens would be useful for external tools, etc. Straight forward to do it in a way the server doesn't have to track what URLs have been generated.

@landreev
Copy link
Contributor

FWIW: At least for S3, the download URLs are presigned, so they are essentially a 1-time token already.

They are; I meant, if this is a truly standalone service, it needs to authenticate itself to the Dataverse first, before it gets the presigned S3 urls. Of course, alternatively, the Dataverse could call the service directly, giving it all the urls, before sending the user there. One way or another, this part is doable, yes.

@scolapasta
Copy link
Contributor

That is what I had originally envisioned: user goes to Dataverse, attempts to download, Dataverse confirms which files they are authorized for, then sends those urls to the service which does the zipping.

@landreev
Copy link
Contributor

@scolapasta That's how I read your earlier summary too. But then this extra call (from Dataverse to the "zip service") can be skipped by issuing this "one time token" authorizing the downloads and embedding it into the download url that takes the user to download.
(in your setup above though, what is "those urls" for files that are on a local filesystem, and not on s3?)

@scolapasta
Copy link
Contributor

Good point. When we added the one time urls for s3, we always said it would be nice to have something similar (a cgi script) for a local filesystem that could do the same thing. So possibly it could be that? Or in lieu of that, it could be the direct location of the file since this service could have access to the file system directly?

@landreev
Copy link
Contributor

So that's what I was talking about - a temporary token that gets deleted once the file is served; an equivalent of those pre-signed S3 urls. (which, technically, may not be "one time", but limited time)

@landreev
Copy link
Contributor

Or in lieu of that, it could be the direct location of the file since this service could have access to the file system directly?

that part - direct access to the files and/or database - I mentioned, as a technical possibility; but thought that it was too hacky to consider - ?

@landreev
Copy link
Contributor

that part - direct access to the files and/or database - I mentioned, as a technical possibility; but thought that it was too hacky to consider - ?

(I actually meant to ask about this, but we never got to it)

@landreev
Copy link
Contributor

In my earlier summary I tried to start with defining the problem we are trying to solve. Since it kept coming up, let me try again:

Zipping up files is not super CPU or memory intensive. But it takes time; which is especially true when the source files live on S3 and have to be accessed over the network. I'm pretty sure that is the finite resource that would be the stress point if we removed the total size limit and left things as they are otherwise. Theoretically, if the feature becomes popular enough, these long-running zipping jobs will max out the thread pool (simplifying a tiny bit, with each extra simultaneous zipping job download becomes slower for each user; since they all share the network throughput to S3). Once the thread pool is maxed out, the application can no longer accept new requests, serve pages, etc.

How likely it is that we would actually reach that state - we don't know. At the moment the API is not actually that popular. But, once again, this is the finite resource that we are talking about and this is the problem we are trying to address. There will always be this possibility; it's safe to assume that it's becoming more real with the data becoming "bigger". So we should assume that it can and will happen. And then it will have to be dealt with, somehow, by limiting it on some level. Whether we do this limiting on the application side, or offload the job remotely (there will be a finite number of threads on that external server too!). Whether we'll have to start telling the users "your files are being zipped asynchronously, we'll send you a notification with the download url once it's done" (that was basically the approach we agreed on back in March; above), or simply "we are too busy now, try again later" - but we have to operate under the assumption that there will be situations where we won't be able to start streaming the data the moment the user clicks the download button.

How much effort we should be willing to invest into solving this issue, that hasn't started biting us in the butt just yet, should be a legitimate question though.

The fact that S3 downloads are expensive literally - in terms of money they cost - I don't view as a technical problem (or mine).

@landreev
Copy link
Contributor

The point @donsizemore made - that HTTP is simply not a suitable download method past a certain size - is real though. But that maybe outside of the scope of this issue.

@landreev
Copy link
Contributor

@mankoff
Hello, I apologize for missing this discussion (bad/busy time of the year here).
I actually like the idea; and would be interested in trying to schedule it for a near release. But I'm not sure this can actually replace the download-multiple-files-as-zip functionality, completely.
OK, so adding "/download" to the dataset url "exposes the files within as a virtual folder structure" - so, something that looks like your normal Apache directory listing? Again, I like the idea, but not entirely sure about the next sentence:

No waiting for zipping, which could be a long wait if the dataset is 100s of GB. The download starts instantly

Strictly speaking, we are not "waiting for zipping" - we start streaming the zipped bytes right away, as soon as the first buffer becomes available. But I am with you in principle, that trying to compress the content is a waste of cpu cycles in many cases. It's the "download starts instantly" part that I have questions about. I mean, I don't see how it starts instantly, or how it starts at all. That ".../download" call merely exposed the directory structure - i.e. it produced some html with a bunch of links. It's still the client's job to issue the download requests for these links. I'm assuming what you mean is that the user can point something like wget at this virtual directory, and tell it to crawl it. And then the downloads will indeed "start instantly", and wget will handle all the individual downloads and replicate the folder structure locally, etc. I agree that this would save a command line user a whole lot of scripting that's currently needed - first listing the files in the dataset, parsing the output, issuing individual download requests for the files etc. But as for the web users - the fact that it would require a browser extension for a folder download to happen automatically makes me think we're not going to be able to use this as the only multiple file download method. (I would definitely prefer not to have this rely on a ton of custom client-side javascript, for crawling through the folders and issuing download requests either...)

(Or is it now possible to create HTML5 folders, that a user can actually drag-and-drop onto their system - an entire folder at once? Last time I checked it couldn't be done; but even if it can be done, I would expect it not to be universally supported in all browsers/on all OSes...)

My understanding of this is that even if this can be implemented as a viable solution, for both the API and web users, we'll still have to support the zipping mechanism. Of which, honestly, I'm not a big fan either. It's kind of a legacy thing - something we need to support, because that's the format most/many users want. As you said, the actual compression that's performed when we create that ZIP stream is most likely a waste of cycles. With large files specifically - most of those in our own production are either compressed already by the user, or are in a format that uses internal compression. So this means we are using ZIP not for compression - but simply as a format for packaging multiple files in a single archive stream. Which is kind of annoying, true. But I've been told that "users want ZIP".

But please don't get me wrong, I am interested in implementing this, even if it does not replace the ZIP download.

@mankoff
Copy link
Contributor

mankoff commented Jun 25, 2020

Hi @landreev - you're right, this does not start the download. I was assuming wget is pointed at that URL, and that starts the downloads.

As for browser-users, sometimes I forget that interaction mode, but you are right, they would still need a way to download multiple files, hence zipping. If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed. You are right, exposing a virtual folder for CLI users (or DownThemAll browser extension users) and bulk download are separate issues. Perhaps this virtual folder should be its own Issue here on GitHub to keep things separate.

@mankoff
Copy link
Contributor

mankoff commented Jun 25, 2020

I realize that if appending /download to the URL doesn't start the download as @landreev pointed out, that may not be the best URL. Perhaps /files would be better. In which case, appending /metadata could be a way for computers to fetch the equivalent of the metadata tab that users might click on, here again via a simpler mechanism than the API.

@landreev
Copy link
Contributor

@mankoff
Yes, I will open a new issue for it.

@landreev
Copy link
Contributor

landreev commented Jun 25, 2020

@mankoff

I realize that if appending /download to the URL doesn't start the download ... that may not be the best URL. Perhaps /files would be better.

I like /files. Or /viewfiles? - something like that.
I also would like to point out that we don't want this option to start the download automatically, even if it were possible. Just like with zipped downloads, either via the API or the GUI, not everybody wants all the files. So we want the command line user to be able to look at the output of this /files call, and, for example, select a subfolder they want - and then tell wget to crawl it. Same with the web user.

@landreev
Copy link
Contributor

@mankoff

... If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed.

But I readily acknowledge that it's still bad and painful, even with streaming.
The very fact that we are relying on one long uninterrupted HTTP GET request to potentially download a huge amount of data is "painful". And the "uninterrupted" part is a must - because it cannot be resumed from a specific point if the connection dies (by nature of having to generate the zipped stream on the fly). There are other "bad" things about this process, some we have discussed already (spending CPU cycles compressing = potential waste); and some I haven't even mentioned yet... So yes, being able to offer an alternative would be great.

@poikilotherm
Copy link
Contributor

Just a side note: one might be tempted to create a WebDAV interface, which could be included in things like Nextcloud.

landreev added a commit that referenced this issue Jun 26, 2020
landreev added a commit that referenced this issue Jun 26, 2020
landreev added a commit that referenced this issue Jul 8, 2020
landreev added a commit that referenced this issue Jul 9, 2020
landreev added a commit that referenced this issue Jul 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants