Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement access to the files in the dataset as a virtual folder tree #7084

Closed
landreev opened this issue Jul 14, 2020 · 35 comments · Fixed by #7579
Closed

Implement access to the files in the dataset as a virtual folder tree #7084

landreev opened this issue Jul 14, 2020 · 35 comments · Fixed by #7579
Assignees

Comments

@landreev
Copy link
Contributor

This is based on a suggestion from a user (@mankoff) made earlier in the "optimize zip" issue (#6505). I believe something similar had also been proposed elsewhere earlier. I'm going to copy the relevant discussion from that issue and add it here.

I do not consider this as a possible replacement for the "download multiple files as zip" functionality. Unfortunately, we're pretty much stuck supporting zip, since it has become the de-facto standard for sharing mutli-file and folders bundles. But it could be something very useful to offer as another option.

The way it would work, there will be an API call (for example, /api/access/dataset/<id>/files) that would expose the files and folders in the dataset as crawl-able tree of links; similar to how static files and directories are shown on simple web servers. A command line user could point a client - for example, wget - to crawl and save the entire tree, or a sub-folder thereof. The advantages of this method are huge - the end result is the same as downloading the entire dataset as Zip and unpacking the archive locally, in one step. But it's achieved in a dramatically better way - by wget issuing individual GET calls for the individual files; meaning that those a) can be redirected to S3 and b) the whole process is completely resume-able in case it is interrupted; unlike the single continuous zip download that cannot be resumed at all.
The advantages are not as dramatic for the web UI users. None of the browsers I know of support drag-and drop downloads of entire folders out of the box. However, plugins that do that are available for major browsers. Still, even clicking through the folders, and being able to download the files directly (unlike in the current "tree view" on the page) would be pretty awesome. Again, see the discussion re-posted below for more information.

I would strongly support implementing this sometime soon (soon after v5.0 that is).

@landreev
Copy link
Contributor Author

From 6505:

From @mankoff:

Hello. I was sent here from #4529.

I'm curious why zipping is a requirement for bulk download. It has been a long time since I've admin'd a webserver, but if I recall many servers (e.g. Apache) perform on-the-fly compression for files that they transfer.

I'm imagining a solution where appending /download/ to any dataverse or dataset URL (where this feature is enabled) exposes the files within as a virtual folder structure. The advantages of this are:

  • No waiting for zipping, which could be a long wait if the dataset is 100s of GB. The download starts instantly
  • No zipping of files that cannot be compressed. For example, NetCDF files with internal compression. I believe Apache on-the-fly compression can be configured per filetype (MIME or extension), so some files would still be transferred as compressed, but not all
  • wget and other default tools (including GUI "DownThemAll" browser extension, for example) could be deployed against this URL, and would support filename filtering, inclusion, exclusion, etc. This offloads a whole bunch of functionality to the end-user download tool, rather than bloating Dataverse. If you zip, I promise there is or will be a feature request to "let me bulk download but filter on filename".

Just some thoughts about how I'd like to see bulk download exposed as an end-user.

From @poikilotherm:

Independent from the pros and cons of ZIP files (like for many small files), I really like the idea proposed above. Both approaches don't merely exclude each other, too, which makes it even more attractive.

It should be as simple as rendering a very simple HTML page, containing the links to the files. So this still allows for control of direct or indirect access to the data, even using things like secret download tokens.

Obviously the same goal of bulk download could be done via some script, too, but using normal system tools like curl and wget is even a lower barrier for scientist/endusers than using the API.

From @landreev:

...
I actually like the idea; and would be interested in trying to schedule it for a near release. But I'm not sure this can actually replace the download-multiple-files-as-zip functionality, completely.
OK, so adding "/download" to the dataset url "exposes the files within as a virtual folder structure" - so, something that looks like your normal Apache directory listing? Again, I like the idea, but not entirely sure about the next sentence:

No waiting for zipping, which could be a long wait if the dataset is 100s of GB. The download starts instantly

Strictly speaking, we are not "waiting for zipping" - we start streaming the zipped bytes right away, as soon as the first buffer becomes available. But I am with you in principle, that trying to compress the content is a waste of cpu cycles in many cases. It's the "download starts instantly" part that I have questions about. I mean, I don't see how it starts instantly, or how it starts at all. That ".../download" call merely exposed the directory structure - i.e. it produced some html with a bunch of links. It's still the client's job to issue the download requests for these links. I'm assuming what you mean is that the user can point something like wget at this virtual directory, and tell it to crawl it. And then the downloads will indeed "start instantly", and wget will handle all the individual downloads and replicate the folder structure locally, etc. I agree that this would save a command line user a whole lot of scripting that's currently needed - first listing the files in the dataset, parsing the output, issuing individual download requests for the files etc. But as for the web users - the fact that it would require a browser extension for a folder download to happen automatically makes me think we're not going to be able to use this as the only multiple file download method. (I would definitely prefer not to have this rely on a ton of custom client-side javascript, for crawling through the folders and issuing download requests either...)

(Or is it now possible to create HTML5 folders, that a user can actually drag-and-drop onto their system - an entire folder at once? Last time I checked it couldn't be done; but even if it can be done, I would expect it not to be universally supported in all browsers/on all OSes...)

My understanding of this is that even if this can be implemented as a viable solution, for both the API and web users, we'll still have to support the zipping mechanism. Of which, honestly, I'm not a big fan either. It's kind of a legacy thing - something we need to support, because that's the format most/many users want. As you said, the actual compression that's performed when we create that ZIP stream is most likely a waste of cycles. With large files specifically - most of those in our own production are either compressed already by the user, or are in a format that uses internal compression. So this means we are using ZIP not for compression - but simply as a format for packaging multiple files in a single archive stream. Which is kind of annoying, true. But I've been told that "users want ZIP".

But please don't get me wrong, I am interested in implementing this, even if it does not replace the ZIP download.

From @mankoff:

Hi - you're right, this does not start the download. I was assuming wget is pointed at that URL, and that starts the downloads.

As for browser-users, sometimes I forget that interaction mode, but you are right, they would still need a way to download multiple files, hence zipping. If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed. You are right, exposing a virtual folder for CLI users (or DownThemAll browser extension users) and bulk download are separate issues. Perhaps this virtual folder should be its own Issue here on GitHub to keep things separate.

From @mankoff:

I realize that if appending /download to the URL doesn't start the download as @landreev pointed out, that may not be the best URL. Perhaps /files would be better. In which case, appending /metadata could be a way for computers to fetch the equivalent of the metadata tab that users might click on, here again via a simpler mechanism than the API.

From @landreev:

I realize that if appending /download to the URL doesn't start the download ... that may not be the best URL. Perhaps /files would be better.

I like /files. Or /viewfiles? - something like that.
I also would like to point out that we don't want this option to start the download automatically, even if it were possible. Just like with zipped downloads, either via the API or the GUI, not everybody wants all the files. So we want the command line user to be able to look at the output of this /files call, and, for example, select a subfolder they want - and then tell wget to crawl it. Same with the web user.

From @landreev:

... If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed.

But I readily acknowledge that it's still bad and painful, even with streaming.
The very fact that we are relying on one long uninterrupted HTTP GET request to potentially download a huge amount of data is "painful". And the "uninterrupted" part is a must - because it cannot be resumed from a specific point if the connection dies (by nature of having to generate the zipped stream on the fly). There are other "bad" things about this process, some we have discussed already (spending CPU cycles compressing = potential waste); and some I haven't even mentioned yet... So yes, being able to offer an alternative would be great.

From @poikilotherm:

Just a side note: one might be tempted to create a WebDAV interface, which could be included in things like Nextcloud.

@landreev landreev changed the title Implement access to the files in the dataset as a pseudo folder Implement access to the files in the dataset as a pseudo folder tree Jul 14, 2020
@djbrooke djbrooke modified the milestone: Dataverse 5 Jul 14, 2020
@landreev landreev changed the title Implement access to the files in the dataset as a pseudo folder tree Implement access to the files in the dataset as a virtual folder tree Jul 14, 2020
@mankoff
Copy link
Contributor

mankoff commented Aug 11, 2020

Related to #7174 - the /files view could expose versions, like this:

├── 1.0
├── 1.1
├── 2.0
├── 2.1
├── 2.2
├── 2.3
├── 2.4
├── 3.0
├── 4.0
├── 5.0
└── latest

@mankoff
Copy link
Contributor

mankoff commented Aug 11, 2020

More generally, it would be nice to always have access to the latest version of a file, even though the file DOI changes when the file updates. The behavior described here provides that feature. I'm not sure this is correct though, because that means doi:nn.nnnn/path/to/doi/for/v3/files/latest/, or doi:nn.nnnn/path/to/doi/for/v3/files/2.4/ will download versions that are not v3 (the actual DOI used in this example). Could be confusing...

@mankoff
Copy link
Contributor

mankoff commented Oct 28, 2020

@djbrooke I see you added this to a "Needs Discussion" card. Is there any part of the discussion I can help with?

@poikilotherm
Copy link
Contributor

Another use case that popped up today from a workshop: making such a structure available could help with integrating data in Dataverse with DataLad, https://github.com/datalad/datalad. DataLad is basically a wrapper around git-annex, which allows for special remotes

DataLad is gaining traction especially in communities with big data needs like neuroimaging.
Corss-Linking datalad/datalad#393 here.

@pdurbin
Copy link
Member

pdurbin commented Nov 4, 2020

@poikilotherm we're friendly with the DataLad team. In https://chat.dataverse.org the DataLad PI is "yoh" and I've had the privilege of having tacos with him in Boston and beers with him in Brussels ( https://twitter.com/philipdurbin/status/1223987847222431744 ). I really enjoyed the talk they gave at FOSDEM 2020 and you can find a recording here: https://archive.fosdem.org/2020/schedule/event/open_research_datalad/ . Anyway, we're happy to integrate with DataLad in whatever way makes sense (whenever it makes sense).

@mankoff "Needs Discussion" makes more sense if you look at our project board, which goes from left to right: https://github.com/orgs/IQSS/projects/2

  • Community Dev
  • Needs Discussion
  • Up Next
  • IQSS Team - In Progress
  • Review
  • QA
  • Done

Basically, "Needs Discussion" means that the issue is not yet defined well enough to be estimated or to have a developer pick it up. As of this writing it looks like there are 39 of these issues, so you might need to be patient with us. 😄

@djbrooke
Copy link
Contributor

djbrooke commented Nov 4, 2020

@mankoff @poikilotherm @pdurbin thanks for the discussion here. I'll prioritize the team discussing this as there seem to be a few use cases that could be supported here.

@scolapasta can you get this on the docket for next tech hours? (Or just discuss with @landreev if it's clear enough.)

Thanks all!

@landreev
Copy link
Contributor Author

I agree that this should be ready to move into the "Up Next" column.
Whatever decisions may still need to be made, we should be able to resolve as we work on it.
The implementation should be straightforward enough. One big-ish question is whether there is already a good package that will render these crawl-able links that we can use; or if we should just go ahead and implement it from scratch. (since the whole point is to have these simple, straight html links w/no fancy ui features, the latter feels like a reasonable idea -?).

And I just want to emphasize that this is my understanding of what we want to develop: this is not another UI implementation of a tree view of the files and folders (like we already have on the dataset page, but with download links). This is not for human users (mostly), but for download clients (command line-based or browser extensions) to be able to crawl through the whole thing and download every file; hence this should output a simple html view of one folder at a time, with download links for files and recursive links to sub-folders. Again, similarly to how files and directories on a filesystem look like when exposed behind an httpd server.

@landreev
Copy link
Contributor Author

@mankoff

Related to #7174 - the /files view could expose versions, like this:

├── 1.0
├── 1.1
├── 2.0
├── 2.1
├── 2.2
├── 2.3
├── 2.4
├── 3.0
├── 4.0
├── 5.0
└── latest

Thinking about this - I agree that this API should understand version numbers; maybe serve the latest version by default, or a select version, when specified. But I'm not sure about providing a top level access point for multiple versions at the same time, like in your example above. My problem with that is that if you have a file that happens to be in all 10 versions, a crawler will proceed to download and save 10 different copies of that file, if you point it at the top level pseudo folder.

@mankoff
Copy link
Contributor

mankoff commented Nov 17, 2020

I'm happy to hear this is moving toward implementation. I agree with your understanding of the features, functions, and purpose of this. This is also what you wrote when you open the ticket.

I was just going to repeat my 'version' comment when you posted your 2nd comment above. Yes to latest by default, with perhaps some method to access earlier versions (could be different URL you find from the GUI, not necessarily as sub-folders under the default URL for this feature).

Use case: I have a dataset A that updates every 12 days via the API. I am working on another project B that is doing something every day, ad I always wants the latest A. It would be good if the code in B is 1 line (wget to the latest URL, download if server version newer than local version). It would not be as good if B needed to include a complicated function to access the A dataset, parse something to get the URL for the latest version, and then download that.

@poikilotherm
Copy link
Contributor

poikilotherm commented Nov 18, 2020

Just a quick thought: what about making it WebDav compatible? It could be integrated into Next loud/owncloud this way (read-only for now)

@djbrooke
Copy link
Contributor

  • Are these files or datasets that we're showing here (versioning may be an issue - there may be some files that are not available in all versions). One proposal is for this to work for a specific version.
  • Should this tree cover aux files and metadata files? It would be good to have a canonical URI for this and for Bag files. Consider a binary switch, similar to how we handle "download all" on the dataset page. We should name/structure the API with this in mind
  • Not a GUI, but an API that provides this information - file/folder layout of resources - Rest API that provides HTML (ex. wget or other crawler)

@mankoff
Copy link
Contributor

mankoff commented Nov 18, 2020

  • The most useful minimal implementation is the latest files in a dataset: http://doi/view/latest exposes a simple wget-friendly view of all files and folders. Note that view is open for discussion - could be files or list or download or something else. Versioning would only show the files in that version, so http://doi/view/4.0 might show different files and folders.

  • Aux and metadata? I guess. I notice when I download a dataset I get MANIFEST.TXT even though I didn't ask for it. I'm not sure what happens if the dataset contains a real file called MANIFEST.TXT. But there could be a virtual folder of aux and metadata too.

  • I'm not sure what your 3rd point means. But the point of this feature is not for the GUI. It's a way to make bulk download easy and accessible to the most common tools and user experience - "similarly to how files and directories on a filesystem look like when exposed behind an httpd server."

@djbrooke
Copy link
Contributor

Thanks @mankoff, I think we're all set, I was just capturing some discussion from the sprint planning meeting this afternoon. We'll start working on this soon, and I mentioned that we may run some iterations by you as we build it out.

@mankoff
Copy link
Contributor

mankoff commented Nov 19, 2020

Thinking about the behaviors requested here after reading and commenting on #7425, I see a problem.

The original request was to allow easy file access for a dataset, so doi:nnnn/files exposes the files for wget or a similar access method.

The request grew to add a logical feature to support easy access the latest version of the files. "easy" here presumably means via a fixed URL. But DOIs point to a specific version, so it is counter-intuitive for doi:nnn_v1/latest to point to something that is not v1.

I note that Zenodo provides a DOI that always points to the latest version, with clear links back to earlier DOId versions. Would this behavior be a major architecture change for Dataverse?

Or if you go to doi:nnnn/latest does it automatically redirect to a different doi, unless nnnn is the latest? I'm not sure if this is a reasonable behavior or not.

Anyway, perhaps "easy URL access to folders" and "fixed URL access to latest" should be treated as two distinct features to implement and test, although there is a connection between the two and latter should make use of the former.

@qqmyers
Copy link
Member

qqmyers commented Nov 19, 2020

How would a /dataset doi/dataset version or :latest/file path URI work? That would allow a stable URI for the file of a given path/name in the latest dataset version. If files are being replaced by files with different names this wouldn't work, but it would avoid trying to have both the dataset and file versioning schemes represented in the API.

@mankoff
Copy link
Contributor

mankoff commented Nov 19, 2020

If files are deleted or renamed, then a 404 or similar error seems fine.

Note that this ticket is about exposing files and folders in a simple view, so if you use this feature to link to the latest version of a dataset (not a file within the dataset), then everything "just works", because whatever files exist in the latest version would be in that folder, by definition.

Here are some use-cases:

  • A dataset with files with fixed names that are updated every day (e.g 10 updated CSV and NetCDF files).
  • A dataset with a new file YYYY-MM-DD.tif added every day

How can we easily share this dataset with colleagues (and computers) so they always get the latest data? From your suggestions above, /dataset doi/dataset version won't find the latest, but could expose the files in dataset version in a virtual folder. The URL with :latest/file path won't work because the files for tomorrow don't exist in the 2nd example, where files get added every day. The URL dataset doi/view/latest could expose the latest version in a simple virtual folder, but may confuse people because of the DV vs. Zenodo architecture decision, where dataset doi is not meant to point to the latest version.

@qqmyers
Copy link
Member

qqmyers commented Nov 19, 2020

Not sure I follow. /10.5072/ABCDEF/:latest/file1.csv would always be the latest file with that name, and /10.5072/ABCDEF/:latest/ would always be a point where you'd get the latest dataset version's list of files, so new files with new dats would appear there. Does that support your two cases OK? (In essence the doi + :latest serves as the ID for the latest version versus there being a separate DOI for that.)
The concern I raised was that, because I could replace file1.csv with filen.csv in later versions, a scheme using the file path/name won't expose the relationship between file1.csv and filen.csv, but if the names are the same, or the names include dates that can be used to get the file you want, knowing Dataverse's internal relationship between those files may not be important. (Conversely - what happens if someone deleted file1 in dataverse and added a new file1, versus using the replace function? Should the API not provide a single URI that would get you to the latest?)

@mankoff
Copy link
Contributor

mankoff commented Nov 19, 2020

Please recall the opening description by @landreev, "similar to how static files and directories are shown on simple web servers." Picture this API allowing browsing a simple folder. This may help answer some of the questions below. If a file is deleted from a folder, it is no longer there. If a file is renamed, or replaced, the latest view should be clearly defined based on our shared common (Mac, Windows, Linux, not VAX or DropBox web view behavior) OS experiences of browsing folders containing files.

Another option that may simplify implementation: The :latest is only valid for a dataset, not a file. Recall again that we're talking about two things in this ticket: 1) :latest and to :view, providing the virtual folder. If :latest is limited to datasets and not files, then combining it with :view provides access to the files within the latest dataset.

Not sure I follow. /10.5072/ABCDEF/:latest/file1.csv would always be
the latest file with that name, and /10.5072/ABCDEF/:latest/ would
always be a point where you'd get the latest dataset version's list of
files, so new files with new dats would appear there. Does that
support your two cases OK? (In essence the doi + :latest serves as the
ID for the latest version versus there being a separate DOI for that.)

Yes this works for both use cases.

I still point out that 10.5072/ABCDEF is (in theory) the DOI for v1, so having it also point to latest because of an additional few characters (i.e., :latest), could be confusing. But I think that is a requirement given the architecture decision that there is no minted DOI that always points to the latest (like Zenodo).

Furthermore if /10.5072/ABCDEF/:latest/ is generalized to support :v1, :v2, etc. in addition to :latest, then any DOI for any version within a dataset can be used to access any other version. For my daily updating data, after a year I have 365 DOIs, each of which can be used to access all 365 versions.

The concern I raised was that, because I could replace file1.csv with
filen.csv in later versions, a scheme using the file path/name won't
expose the relationship between file1.csv and filen.csv, but if the
names are the same, or the names include dates that can be used to get
the file you want, knowing Dataverse's internal relationship between
those files may not be important.

I personally am not concerned by this. The relationship is still available for people to see in the GUI "Versions" tab.

(Conversely - what happens if someone deleted file1 in dataverse and
added a new file1, versus using the replace function? Should the API
not provide a single URI that would get you to the latest?)

Hmmm. Ugh :). So I see the following choices:

File is deleted and not in latest version.

  • API for ":latest" can point to the latest available
  • API for ":latest" can return error

File is replaced and in the latest version:

  • API points to latest

File is deleted, then added, and exists in latest version

  • API can point to the latest available

  • API can return error: ambiguous file

  • API for ":latest" can look at the DOI used, and trace it downstream. If the DOI was for the earlier version that got deleted, then return the latest file before deletion. If the DOI was for an intermediate version where it did not exist, return error. If the DOI was for a later version after it was added, trace it downstream and return the latest one.

This seems overly complicated and I'd vote for "just return the latest".

@landreev
Copy link
Contributor Author

landreev commented Feb 2, 2021

@mankoff and anyone else who may be interested, the current implementation in my branch works as follows:

I called the new crawlable file access API "fileaccess":
/api/datasets/{dataset}/versions/{version}/fileaccess
(So the name/syntax follows the existing API /api/datasets/{dataset}/versions/{version}/files, that shows the metadata for the files in a given version. I'm open to naming it something else; I'm considering "folderview", and maybe the version number should be passed as a query parameter instead).
The optional query parameter ?folder=<foldername> specifies the subfolder to list.
For the {dataset} id both the numeric and:persistentId notation are supported, like in other similar APIs.

The API outputs a simple html listing (I made it to look like the standard Apache directory index), with Access API download links for individual files,
and recursive calls to the API above for sub-folders.

I think it's easier to use an example, and pictures:

Let's say we have a dataset version with 2 files, one of them with the folder named "subfolder" specified:

dataset_page_files_view

or, as viewed as a tree on the dataset page:
dataset_page_tree_view

The output of the fileaccess API for the top-level folder (/api/datasets/NNN/versions/MM/fileaccess) will be as follows:

index_view_top

with the underlying html source:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
    <html><head><title>Index of folder /</title></head>
    <body><h1>Index of folder / in dataset doi:XXX/YY/ZZZZ</h1>
    <table>
    <tr><th>Name</th><th>Last Modified</th><th>Size</th><th>Description</th></tr>
    <tr><th colspan="4"><hr></th></tr>
    <tr><td><a href="/api/datasets/NNNN/versions/MM/fileaccess?folder=subfolder">subfolder/</a></td><td align="right"> - </td><td align="right"> - </td><td align="right">&nbsp;</td></tr>
    <tr><td><a href="/api/access/datafile/KKKK">testfile.txt</a></td><td align="right">13-January-2021 22:35</td><td align="right">19 B</td><td align="right">&nbsp;</td></tr>
    </table></body></html>

And if you follow the ../fileaccess?folder=subfolder link above it will produce the following view:

index_view_subfolder

with the html source as follows:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
    <html><head><title>Index of folder /subfolder</title></head>
    <body><h1>Index of folder /subfolder in dataset doi:XXX/YY/ZZZZ</h1>
    <table>
    <tr><th>Name</th><th>Last Modified</th><th>Size</th><th>Description</th></tr>
    <tr><th colspan="4"><hr></th></tr>
    <tr><td><a href="/api/access/datafile/subfolder/LLLL">50by1000.tab</a></td><td align="right">11-January-2021 09:31</td><td align="right">102.5 KB</td><td align="right">&nbsp;</td></tr>
    </table></body></html>

Note that I'm solving the problem of having wget --recursive preserve the folder structure when saving files by embedding the folder name in the file access API URL: /api/access/datafile/subfolder/LLLL, instead of the normal /api/access/datafile/LLLL notation.
Yes, this is perfectly legal! You can embed an arbitrary number of slashes into a path parameter, by using a regex in the @Path notation:

@Path("datafile/{fileId:.+}")

The wget command line for crawling this API is NOT pretty, but that's what I came up with so far, that actually works:

wget --recursive -nH --cut-dirs=3 --content-disposition http://localhost:8080/api/datasets/NNNN/versions/1.0/fileaccess

Any feedback - comments/suggestions - are welcome.

@mankoff
Copy link
Contributor

mankoff commented Feb 2, 2021

This looks good at first pass. I did not know of the --content-disposition flag for wget. What happens if that is left off? Are the files not named correctly? The rest of the wget command looks about as normal as most times that I use it...

One concern is the version part of the URL. Is there a way to easily always get the latest version? Either if version is left off the URL, or if it is set to latest rather than a number?

@landreev
Copy link
Contributor Author

landreev commented Feb 2, 2021

I did not know of the --content-disposition flag for wget. What happens if that is left off? Are the files not named correctly? The rest of the wget command looks about as normal as most times that I use it...

Correct, without the "--content-disposition" flag wget will download http://host/api/access/datafile/1234 and save it as 1234. With this flag wget will use the real filename that we supply in the "Content-Disposition:" header. (browsers do this automatically, so this header is the reason a browser offers to save a file downloaded from our dataset page under its user-friendly name).
It is, unfortunately, impossible to use that header to supply a folder name as well. If you try something like Content-disposition: attachment; filename="folder/subfolder/testfile.txt" the "folder/subfolder" part is ignored, and the file is still saved as "testfile.txt".
So I rely on both this header, and embedding the folder name into the access url, and --cut-dirs=3 to download /api/access/datafile/folder/subfolder/1234 and have it saved as folder/subfolder/testfile.txt.

@landreev
Copy link
Contributor Author

landreev commented Feb 2, 2021

One concern is the version part of the URL. Is there a way to easily always get the latest version? Either if version is left off the URL, or if it is set to latest rather than a number?

It understands our standard version id notations like :draft, :latest and :latest-published.
But yes, I am indeed considering dropping the version from the path. So it would be
/api/datasets/{datasetid}/fileaccess
defaulting to the latest version available; with the optional ?version={version} query parameter for requesting a different version.

@landreev
Copy link
Contributor Author

landreev commented Feb 6, 2021

@mankoff
Hi, a quick followup to the comments above: I ended up dropping the version parameter from the path.
I also renamed the API. It is now called "dirindex" - to emphasize that it presents the dataset in a way that resembles the Apache Directory Index format.

So the API path is now

/api/datasets/{dataset}/dirindex

it defaults to the latest. An optional parameter?version={version} can be used to specify a different version.
This is all documented in the API guide as part of the pull request linked above.

@mankoff
Copy link
Contributor

mankoff commented Feb 24, 2021

Hello. If my institution upgrades their Dataverse, will we receive this feature? Or is it implemented into some future release and is not included in the latest version installed when updating?

@djbrooke
Copy link
Contributor

Hi @mankoff, this will be included in the next release, 5.4. I added the 5.4 tag to the PR:

#7579

Once 5.4 shows up in https://github.com/IQSS/dataverse/releases you'll be able to install and use the release with this feature. We expect this in the next few weeks - we're just waiting on a few more issues to finish up.

@mankoff
Copy link
Contributor

mankoff commented Apr 9, 2021

Hello. I see that demo.dataverse.org is now at v5.4, so I'd like to test this.

I'm reading the docs here https://guides.dataverse.org/en/latest/api/native-api.html?highlight=dirindex#view-dataset-files-and-folders-as-a-directory-index

And it seems to only work with the dataset ID. If I'm an end-user, how do I find the ID? Is there a way to browse the dirindex using the DOI? Can you provide an example with this demo data set? https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/MV0TMN

@mankoff
Copy link
Contributor

mankoff commented Apr 9, 2021

Also, regarding point #5 from #7084 (comment) this API does not allow browsing. When I go to https://demo.dataverse.org/api/datasets/24/dirindex I'm given a ".index" to download in firefox, not something that I can view in my browser. This also means (I think?) that browser-tools that I hoped would use this feature, like DownThemAll probably won't work.

@poikilotherm
Copy link
Contributor

poikilotherm commented Apr 9, 2021

@mankoff
Copy link
Contributor

mankoff commented Apr 9, 2021

Well that tells me how to use this with DOI and not ID. I suggest making this option clear in the API docs. I'll add an issue for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants