-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement access to the files in the dataset as a virtual folder tree #7084
Comments
From 6505: From @mankoff: Hello. I was sent here from #4529. I'm curious why zipping is a requirement for bulk download. It has been a long time since I've admin'd a webserver, but if I recall many servers (e.g. Apache) perform on-the-fly compression for files that they transfer. I'm imagining a solution where appending
Just some thoughts about how I'd like to see bulk download exposed as an end-user. From @poikilotherm: Independent from the pros and cons of ZIP files (like for many small files), I really like the idea proposed above. Both approaches don't merely exclude each other, too, which makes it even more attractive. It should be as simple as rendering a very simple HTML page, containing the links to the files. So this still allows for control of direct or indirect access to the data, even using things like secret download tokens. Obviously the same goal of bulk download could be done via some script, too, but using normal system tools like curl and wget is even a lower barrier for scientist/endusers than using the API. From @landreev: ...
Strictly speaking, we are not "waiting for zipping" - we start streaming the zipped bytes right away, as soon as the first buffer becomes available. But I am with you in principle, that trying to compress the content is a waste of cpu cycles in many cases. It's the "download starts instantly" part that I have questions about. I mean, I don't see how it starts instantly, or how it starts at all. That ".../download" call merely exposed the directory structure - i.e. it produced some html with a bunch of links. It's still the client's job to issue the download requests for these links. I'm assuming what you mean is that the user can point something like wget at this virtual directory, and tell it to crawl it. And then the downloads will indeed "start instantly", and wget will handle all the individual downloads and replicate the folder structure locally, etc. I agree that this would save a command line user a whole lot of scripting that's currently needed - first listing the files in the dataset, parsing the output, issuing individual download requests for the files etc. But as for the web users - the fact that it would require a browser extension for a folder download to happen automatically makes me think we're not going to be able to use this as the only multiple file download method. (I would definitely prefer not to have this rely on a ton of custom client-side javascript, for crawling through the folders and issuing download requests either...) (Or is it now possible to create HTML5 folders, that a user can actually drag-and-drop onto their system - an entire folder at once? Last time I checked it couldn't be done; but even if it can be done, I would expect it not to be universally supported in all browsers/on all OSes...) My understanding of this is that even if this can be implemented as a viable solution, for both the API and web users, we'll still have to support the zipping mechanism. Of which, honestly, I'm not a big fan either. It's kind of a legacy thing - something we need to support, because that's the format most/many users want. As you said, the actual compression that's performed when we create that ZIP stream is most likely a waste of cycles. With large files specifically - most of those in our own production are either compressed already by the user, or are in a format that uses internal compression. So this means we are using ZIP not for compression - but simply as a format for packaging multiple files in a single archive stream. Which is kind of annoying, true. But I've been told that "users want ZIP". But please don't get me wrong, I am interested in implementing this, even if it does not replace the ZIP download. From @mankoff: Hi - you're right, this does not start the download. I was assuming As for browser-users, sometimes I forget that interaction mode, but you are right, they would still need a way to download multiple files, hence zipping. If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed. You are right, exposing a virtual folder for CLI users (or DownThemAll browser extension users) and bulk download are separate issues. Perhaps this virtual folder should be its own Issue here on GitHub to keep things separate. From @mankoff: I realize that if appending From @landreev:
I like From @landreev:
But I readily acknowledge that it's still bad and painful, even with streaming. From @poikilotherm: Just a side note: one might be tempted to create a WebDAV interface, which could be included in things like Nextcloud. |
Related to #7174 - the
|
More generally, it would be nice to always have access to the latest version of a file, even though the file DOI changes when the file updates. The behavior described here provides that feature. I'm not sure this is correct though, because that means |
@djbrooke I see you added this to a "Needs Discussion" card. Is there any part of the discussion I can help with? |
Another use case that popped up today from a workshop: making such a structure available could help with integrating data in Dataverse with DataLad, https://github.com/datalad/datalad. DataLad is basically a wrapper around DataLad is gaining traction especially in communities with big data needs like neuroimaging. |
@poikilotherm we're friendly with the DataLad team. In https://chat.dataverse.org the DataLad PI is "yoh" and I've had the privilege of having tacos with him in Boston and beers with him in Brussels ( https://twitter.com/philipdurbin/status/1223987847222431744 ). I really enjoyed the talk they gave at FOSDEM 2020 and you can find a recording here: https://archive.fosdem.org/2020/schedule/event/open_research_datalad/ . Anyway, we're happy to integrate with DataLad in whatever way makes sense (whenever it makes sense). @mankoff "Needs Discussion" makes more sense if you look at our project board, which goes from left to right: https://github.com/orgs/IQSS/projects/2
Basically, "Needs Discussion" means that the issue is not yet defined well enough to be estimated or to have a developer pick it up. As of this writing it looks like there are 39 of these issues, so you might need to be patient with us. 😄 |
@mankoff @poikilotherm @pdurbin thanks for the discussion here. I'll prioritize the team discussing this as there seem to be a few use cases that could be supported here. @scolapasta can you get this on the docket for next tech hours? (Or just discuss with @landreev if it's clear enough.) Thanks all! |
I agree that this should be ready to move into the "Up Next" column. And I just want to emphasize that this is my understanding of what we want to develop: this is not another UI implementation of a tree view of the files and folders (like we already have on the dataset page, but with download links). This is not for human users (mostly), but for download clients (command line-based or browser extensions) to be able to crawl through the whole thing and download every file; hence this should output a simple html view of one folder at a time, with download links for files and recursive links to sub-folders. Again, similarly to how files and directories on a filesystem look like when exposed behind an httpd server. |
Thinking about this - I agree that this API should understand version numbers; maybe serve the latest version by default, or a select version, when specified. But I'm not sure about providing a top level access point for multiple versions at the same time, like in your example above. My problem with that is that if you have a file that happens to be in all 10 versions, a crawler will proceed to download and save 10 different copies of that file, if you point it at the top level pseudo folder. |
I'm happy to hear this is moving toward implementation. I agree with your understanding of the features, functions, and purpose of this. This is also what you wrote when you open the ticket. I was just going to repeat my 'version' comment when you posted your 2nd comment above. Yes to latest by default, with perhaps some method to access earlier versions (could be different URL you find from the GUI, not necessarily as sub-folders under the default URL for this feature). Use case: I have a dataset A that updates every 12 days via the API. I am working on another project B that is doing something every day, ad I always wants the latest A. It would be good if the code in B is 1 line (wget to the latest URL, download if server version newer than local version). It would not be as good if B needed to include a complicated function to access the A dataset, parse something to get the URL for the latest version, and then download that. |
Just a quick thought: what about making it WebDav compatible? It could be integrated into Next loud/owncloud this way (read-only for now) |
|
|
Thanks @mankoff, I think we're all set, I was just capturing some discussion from the sprint planning meeting this afternoon. We'll start working on this soon, and I mentioned that we may run some iterations by you as we build it out. |
Thinking about the behaviors requested here after reading and commenting on #7425, I see a problem. The original request was to allow easy file access for a dataset, so doi:nnnn/files exposes the files for The request grew to add a logical feature to support easy access the latest version of the files. "easy" here presumably means via a fixed URL. But DOIs point to a specific version, so it is counter-intuitive for doi:nnn_v1/latest to point to something that is not v1. I note that Zenodo provides a DOI that always points to the latest version, with clear links back to earlier DOId versions. Would this behavior be a major architecture change for Dataverse? Or if you go to doi:nnnn/latest does it automatically redirect to a different doi, unless nnnn is the latest? I'm not sure if this is a reasonable behavior or not. Anyway, perhaps "easy URL access to folders" and "fixed URL access to latest" should be treated as two distinct features to implement and test, although there is a connection between the two and latter should make use of the former. |
How would a |
If files are deleted or renamed, then a 404 or similar error seems fine. Note that this ticket is about exposing files and folders in a simple view, so if you use this feature to link to the latest version of a dataset (not a file within the dataset), then everything "just works", because whatever files exist in the latest version would be in that folder, by definition. Here are some use-cases:
How can we easily share this dataset with colleagues (and computers) so they always get the latest data? From your suggestions above, |
Not sure I follow. /10.5072/ABCDEF/:latest/file1.csv would always be the latest file with that name, and /10.5072/ABCDEF/:latest/ would always be a point where you'd get the latest dataset version's list of files, so new files with new dats would appear there. Does that support your two cases OK? (In essence the doi + :latest serves as the ID for the latest version versus there being a separate DOI for that.) |
Please recall the opening description by @landreev, "similar to how static files and directories are shown on simple web servers." Picture this API allowing browsing a simple folder. This may help answer some of the questions below. If a file is deleted from a folder, it is no longer there. If a file is renamed, or replaced, the latest view should be clearly defined based on our shared common (Mac, Windows, Linux, not VAX or DropBox web view behavior) OS experiences of browsing folders containing files. Another option that may simplify implementation: The
Yes this works for both use cases. I still point out that Furthermore if
I personally am not concerned by this. The relationship is still available for people to see in the GUI "Versions" tab.
Hmmm. Ugh :). So I see the following choices: File is deleted and not in latest version.
File is replaced and in the latest version:
File is deleted, then added, and exists in latest version
This seems overly complicated and I'd vote for "just return the latest". |
@mankoff and anyone else who may be interested, the current implementation in my branch works as follows: I called the new crawlable file access API "fileaccess": The API outputs a simple html listing (I made it to look like the standard Apache directory index), with Access API download links for individual files, I think it's easier to use an example, and pictures: Let's say we have a dataset version with 2 files, one of them with the folder named "subfolder" specified: or, as viewed as a tree on the dataset page: The output of the fileaccess API for the top-level folder ( with the underlying html source:
And if you follow the with the html source as follows:
Note that I'm solving the problem of having
The wget command line for crawling this API is NOT pretty, but that's what I came up with so far, that actually works:
Any feedback - comments/suggestions - are welcome. |
This looks good at first pass. I did not know of the One concern is the |
Correct, without the "--content-disposition" flag wget will download |
It understands our standard version id notations like |
… (version number added to the dir. index) #7084
@mankoff So the API path is now
it defaults to the latest. An optional parameter |
Hello. If my institution upgrades their Dataverse, will we receive this feature? Or is it implemented into some future release and is not included in the latest version installed when updating? |
Hi @mankoff, this will be included in the next release, 5.4. I added the 5.4 tag to the PR: Once 5.4 shows up in https://github.com/IQSS/dataverse/releases you'll be able to install and use the release with this feature. We expect this in the next few weeks - we're just waiting on a few more issues to finish up. |
Hello. I see that demo.dataverse.org is now at v5.4, so I'd like to test this. I'm reading the docs here https://guides.dataverse.org/en/latest/api/native-api.html?highlight=dirindex#view-dataset-files-and-folders-as-a-directory-index And it seems to only work with the dataset ID. If I'm an end-user, how do I find the ID? Is there a way to browse the |
Also, regarding point #5 from #7084 (comment) this API does not allow browsing. When I go to https://demo.dataverse.org/api/datasets/24/dirindex I'm given a ".index" to download in firefox, not something that I can view in my browser. This also means (I think?) that browser-tools that I hoped would use this feature, like DownThemAll probably won't work. |
Also seeing a ".index.html" download with https://demo.dataverse.org/api/datasets/:persistentId/dirindex/?persistentId=doi:10.70122/FK2/HXJVJU or https://demo.dataverse.org/api/datasets/:persistentId/dirindex/?persistentId=doi:10.70122/FK2/PDRSIQ The file contains the expected HTML page. |
Well that tells me how to use this with DOI and not ID. I suggest making this option clear in the API docs. I'll add an issue for that. |
This is based on a suggestion from a user (@mankoff) made earlier in the "optimize zip" issue (#6505). I believe something similar had also been proposed elsewhere earlier. I'm going to copy the relevant discussion from that issue and add it here.
I do not consider this as a possible replacement for the "download multiple files as zip" functionality. Unfortunately, we're pretty much stuck supporting zip, since it has become the de-facto standard for sharing mutli-file and folders bundles. But it could be something very useful to offer as another option.
The way it would work, there will be an API call (for example,
/api/access/dataset/<id>/files
) that would expose the files and folders in the dataset as crawl-able tree of links; similar to how static files and directories are shown on simple web servers. A command line user could point a client - for example, wget - to crawl and save the entire tree, or a sub-folder thereof. The advantages of this method are huge - the end result is the same as downloading the entire dataset as Zip and unpacking the archive locally, in one step. But it's achieved in a dramatically better way - by wget issuing individual GET calls for the individual files; meaning that those a) can be redirected to S3 and b) the whole process is completely resume-able in case it is interrupted; unlike the single continuous zip download that cannot be resumed at all.The advantages are not as dramatic for the web UI users. None of the browsers I know of support drag-and drop downloads of entire folders out of the box. However, plugins that do that are available for major browsers. Still, even clicking through the folders, and being able to download the files directly (unlike in the current "tree view" on the page) would be pretty awesome. Again, see the discussion re-posted below for more information.
I would strongly support implementing this sometime soon (soon after v5.0 that is).
The text was updated successfully, but these errors were encountered: