Implement access to the files in the dataset as a virtual folder tree #7084

landreev · 2020-07-14T18:52:07Z

This is based on a suggestion from a user (@mankoff) made earlier in the "optimize zip" issue (#6505). I believe something similar had also been proposed elsewhere earlier. I'm going to copy the relevant discussion from that issue and add it here.

I do not consider this as a possible replacement for the "download multiple files as zip" functionality. Unfortunately, we're pretty much stuck supporting zip, since it has become the de-facto standard for sharing mutli-file and folders bundles. But it could be something very useful to offer as another option.

The way it would work, there will be an API call (for example, /api/access/dataset/<id>/files) that would expose the files and folders in the dataset as crawl-able tree of links; similar to how static files and directories are shown on simple web servers. A command line user could point a client - for example, wget - to crawl and save the entire tree, or a sub-folder thereof. The advantages of this method are huge - the end result is the same as downloading the entire dataset as Zip and unpacking the archive locally, in one step. But it's achieved in a dramatically better way - by wget issuing individual GET calls for the individual files; meaning that those a) can be redirected to S3 and b) the whole process is completely resume-able in case it is interrupted; unlike the single continuous zip download that cannot be resumed at all.
The advantages are not as dramatic for the web UI users. None of the browsers I know of support drag-and drop downloads of entire folders out of the box. However, plugins that do that are available for major browsers. Still, even clicking through the folders, and being able to download the files directly (unlike in the current "tree view" on the page) would be pretty awesome. Again, see the discussion re-posted below for more information.

I would strongly support implementing this sometime soon (soon after v5.0 that is).

The text was updated successfully, but these errors were encountered:

landreev · 2020-07-14T18:58:46Z

From 6505:

From @mankoff:

Hello. I was sent here from #4529.

I'm curious why zipping is a requirement for bulk download. It has been a long time since I've admin'd a webserver, but if I recall many servers (e.g. Apache) perform on-the-fly compression for files that they transfer.

I'm imagining a solution where appending /download/ to any dataverse or dataset URL (where this feature is enabled) exposes the files within as a virtual folder structure. The advantages of this are:

No waiting for zipping, which could be a long wait if the dataset is 100s of GB. The download starts instantly
No zipping of files that cannot be compressed. For example, NetCDF files with internal compression. I believe Apache on-the-fly compression can be configured per filetype (MIME or extension), so some files would still be transferred as compressed, but not all
wget and other default tools (including GUI "DownThemAll" browser extension, for example) could be deployed against this URL, and would support filename filtering, inclusion, exclusion, etc. This offloads a whole bunch of functionality to the end-user download tool, rather than bloating Dataverse. If you zip, I promise there is or will be a feature request to "let me bulk download but filter on filename".

Just some thoughts about how I'd like to see bulk download exposed as an end-user.

From @poikilotherm:

Independent from the pros and cons of ZIP files (like for many small files), I really like the idea proposed above. Both approaches don't merely exclude each other, too, which makes it even more attractive.

It should be as simple as rendering a very simple HTML page, containing the links to the files. So this still allows for control of direct or indirect access to the data, even using things like secret download tokens.

Obviously the same goal of bulk download could be done via some script, too, but using normal system tools like curl and wget is even a lower barrier for scientist/endusers than using the API.

From @landreev:

...
I actually like the idea; and would be interested in trying to schedule it for a near release. But I'm not sure this can actually replace the download-multiple-files-as-zip functionality, completely.
OK, so adding "/download" to the dataset url "exposes the files within as a virtual folder structure" - so, something that looks like your normal Apache directory listing? Again, I like the idea, but not entirely sure about the next sentence:

No waiting for zipping, which could be a long wait if the dataset is 100s of GB. The download starts instantly

Strictly speaking, we are not "waiting for zipping" - we start streaming the zipped bytes right away, as soon as the first buffer becomes available. But I am with you in principle, that trying to compress the content is a waste of cpu cycles in many cases. It's the "download starts instantly" part that I have questions about. I mean, I don't see how it starts instantly, or how it starts at all. That ".../download" call merely exposed the directory structure - i.e. it produced some html with a bunch of links. It's still the client's job to issue the download requests for these links. I'm assuming what you mean is that the user can point something like wget at this virtual directory, and tell it to crawl it. And then the downloads will indeed "start instantly", and wget will handle all the individual downloads and replicate the folder structure locally, etc. I agree that this would save a command line user a whole lot of scripting that's currently needed - first listing the files in the dataset, parsing the output, issuing individual download requests for the files etc. But as for the web users - the fact that it would require a browser extension for a folder download to happen automatically makes me think we're not going to be able to use this as the only multiple file download method. (I would definitely prefer not to have this rely on a ton of custom client-side javascript, for crawling through the folders and issuing download requests either...)

(Or is it now possible to create HTML5 folders, that a user can actually drag-and-drop onto their system - an entire folder at once? Last time I checked it couldn't be done; but even if it can be done, I would expect it not to be universally supported in all browsers/on all OSes...)

My understanding of this is that even if this can be implemented as a viable solution, for both the API and web users, we'll still have to support the zipping mechanism. Of which, honestly, I'm not a big fan either. It's kind of a legacy thing - something we need to support, because that's the format most/many users want. As you said, the actual compression that's performed when we create that ZIP stream is most likely a waste of cycles. With large files specifically - most of those in our own production are either compressed already by the user, or are in a format that uses internal compression. So this means we are using ZIP not for compression - but simply as a format for packaging multiple files in a single archive stream. Which is kind of annoying, true. But I've been told that "users want ZIP".

But please don't get me wrong, I am interested in implementing this, even if it does not replace the ZIP download.

From @mankoff:

Hi - you're right, this does not start the download. I was assuming wget is pointed at that URL, and that starts the downloads.

As for browser-users, sometimes I forget that interaction mode, but you are right, they would still need a way to download multiple files, hence zipping. If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed. You are right, exposing a virtual folder for CLI users (or DownThemAll browser extension users) and bulk download are separate issues. Perhaps this virtual folder should be its own Issue here on GitHub to keep things separate.

From @mankoff:

I realize that if appending /download to the URL doesn't start the download as @landreev pointed out, that may not be the best URL. Perhaps /files would be better. In which case, appending /metadata could be a way for computers to fetch the equivalent of the metadata tab that users might click on, here again via a simpler mechanism than the API.

From @landreev:

I realize that if appending /download to the URL doesn't start the download ... that may not be the best URL. Perhaps /files would be better.

I like /files. Or /viewfiles? - something like that.
I also would like to point out that we don't want this option to start the download automatically, even if it were possible. Just like with zipped downloads, either via the API or the GUI, not everybody wants all the files. So we want the command line user to be able to look at the output of this /files call, and, for example, select a subfolder they want - and then tell wget to crawl it. Same with the web user.

From @landreev:

... If zipping is on-the-fly streaming and you don't have to wait for n files n GB in size to all get zipped, then it isn't as painful/bad as I assumed.

But I readily acknowledge that it's still bad and painful, even with streaming.
The very fact that we are relying on one long uninterrupted HTTP GET request to potentially download a huge amount of data is "painful". And the "uninterrupted" part is a must - because it cannot be resumed from a specific point if the connection dies (by nature of having to generate the zipped stream on the fly). There are other "bad" things about this process, some we have discussed already (spending CPU cycles compressing = potential waste); and some I haven't even mentioned yet... So yes, being able to offer an alternative would be great.

From @poikilotherm:

Just a side note: one might be tempted to create a WebDAV interface, which could be included in things like Nextcloud.

mankoff · 2020-08-11T13:00:00Z

Related to #7174 - the /files view could expose versions, like this:

├── 1.0
├── 1.1
├── 2.0
├── 2.1
├── 2.2
├── 2.3
├── 2.4
├── 3.0
├── 4.0
├── 5.0
└── latest

mankoff · 2020-08-11T13:04:04Z

More generally, it would be nice to always have access to the latest version of a file, even though the file DOI changes when the file updates. The behavior described here provides that feature. I'm not sure this is correct though, because that means doi:nn.nnnn/path/to/doi/for/v3/files/latest/, or doi:nn.nnnn/path/to/doi/for/v3/files/2.4/ will download versions that are not v3 (the actual DOI used in this example). Could be confusing...

mankoff · 2020-10-28T18:44:17Z

@djbrooke I see you added this to a "Needs Discussion" card. Is there any part of the discussion I can help with?

poikilotherm · 2020-11-04T13:48:03Z

Another use case that popped up today from a workshop: making such a structure available could help with integrating data in Dataverse with DataLad, https://github.com/datalad/datalad. DataLad is basically a wrapper around git-annex, which allows for special remotes

DataLad is gaining traction especially in communities with big data needs like neuroimaging.
Corss-Linking datalad/datalad#393 here.

pdurbin · 2020-11-04T15:49:51Z

@poikilotherm we're friendly with the DataLad team. In https://chat.dataverse.org the DataLad PI is "yoh" and I've had the privilege of having tacos with him in Boston and beers with him in Brussels ( https://twitter.com/philipdurbin/status/1223987847222431744 ). I really enjoyed the talk they gave at FOSDEM 2020 and you can find a recording here: https://archive.fosdem.org/2020/schedule/event/open_research_datalad/ . Anyway, we're happy to integrate with DataLad in whatever way makes sense (whenever it makes sense).

@mankoff "Needs Discussion" makes more sense if you look at our project board, which goes from left to right: https://github.com/orgs/IQSS/projects/2

Community Dev
Needs Discussion
Up Next
IQSS Team - In Progress
Review
QA
Done

Basically, "Needs Discussion" means that the issue is not yet defined well enough to be estimated or to have a developer pick it up. As of this writing it looks like there are 39 of these issues, so you might need to be patient with us. 😄

djbrooke · 2020-11-04T15:57:32Z

@mankoff @poikilotherm @pdurbin thanks for the discussion here. I'll prioritize the team discussing this as there seem to be a few use cases that could be supported here.

@scolapasta can you get this on the docket for next tech hours? (Or just discuss with @landreev if it's clear enough.)

Thanks all!

landreev · 2020-11-17T21:50:14Z

I agree that this should be ready to move into the "Up Next" column.
Whatever decisions may still need to be made, we should be able to resolve as we work on it.
The implementation should be straightforward enough. One big-ish question is whether there is already a good package that will render these crawl-able links that we can use; or if we should just go ahead and implement it from scratch. (since the whole point is to have these simple, straight html links w/no fancy ui features, the latter feels like a reasonable idea -?).

And I just want to emphasize that this is my understanding of what we want to develop: this is not another UI implementation of a tree view of the files and folders (like we already have on the dataset page, but with download links). This is not for human users (mostly), but for download clients (command line-based or browser extensions) to be able to crawl through the whole thing and download every file; hence this should output a simple html view of one folder at a time, with download links for files and recursive links to sub-folders. Again, similarly to how files and directories on a filesystem look like when exposed behind an httpd server.

landreev · 2020-11-17T21:57:39Z

@mankoff

Related to #7174 - the /files view could expose versions, like this:

├── 1.0
├── 1.1
├── 2.0
├── 2.1
├── 2.2
├── 2.3
├── 2.4
├── 3.0
├── 4.0
├── 5.0
└── latest

Thinking about this - I agree that this API should understand version numbers; maybe serve the latest version by default, or a select version, when specified. But I'm not sure about providing a top level access point for multiple versions at the same time, like in your example above. My problem with that is that if you have a file that happens to be in all 10 versions, a crawler will proceed to download and save 10 different copies of that file, if you point it at the top level pseudo folder.

mankoff · 2020-11-17T22:02:38Z

I'm happy to hear this is moving toward implementation. I agree with your understanding of the features, functions, and purpose of this. This is also what you wrote when you open the ticket.

I was just going to repeat my 'version' comment when you posted your 2nd comment above. Yes to latest by default, with perhaps some method to access earlier versions (could be different URL you find from the GUI, not necessarily as sub-folders under the default URL for this feature).

Use case: I have a dataset A that updates every 12 days via the API. I am working on another project B that is doing something every day, ad I always wants the latest A. It would be good if the code in B is 1 line (wget to the latest URL, download if server version newer than local version). It would not be as good if B needed to include a complicated function to access the A dataset, parse something to get the URL for the latest version, and then download that.

poikilotherm · 2020-11-18T08:13:15Z

Just a quick thought: what about making it WebDav compatible? It could be integrated into Next loud/owncloud this way (read-only for now)

djbrooke · 2020-11-18T19:38:02Z

Are these files or datasets that we're showing here (versioning may be an issue - there may be some files that are not available in all versions). One proposal is for this to work for a specific version.
Should this tree cover aux files and metadata files? It would be good to have a canonical URI for this and for Bag files. Consider a binary switch, similar to how we handle "download all" on the dataset page. We should name/structure the API with this in mind
Not a GUI, but an API that provides this information - file/folder layout of resources - Rest API that provides HTML (ex. wget or other crawler)

mankoff · 2020-11-18T20:29:20Z

The most useful minimal implementation is the latest files in a dataset: http://doi/view/latest exposes a simple wget-friendly view of all files and folders. Note that view is open for discussion - could be files or list or download or something else. Versioning would only show the files in that version, so http://doi/view/4.0 might show different files and folders.
Aux and metadata? I guess. I notice when I download a dataset I get MANIFEST.TXT even though I didn't ask for it. I'm not sure what happens if the dataset contains a real file called MANIFEST.TXT. But there could be a virtual folder of aux and metadata too.
I'm not sure what your 3rd point means. But the point of this feature is not for the GUI. It's a way to make bulk download easy and accessible to the most common tools and user experience - "similarly to how files and directories on a filesystem look like when exposed behind an httpd server."

djbrooke · 2020-11-18T20:32:10Z

Thanks @mankoff, I think we're all set, I was just capturing some discussion from the sprint planning meeting this afternoon. We'll start working on this soon, and I mentioned that we may run some iterations by you as we build it out.

mankoff · 2020-11-19T16:22:35Z

Thinking about the behaviors requested here after reading and commenting on #7425, I see a problem.

The original request was to allow easy file access for a dataset, so doi:nnnn/files exposes the files for wget or a similar access method.

The request grew to add a logical feature to support easy access the latest version of the files. "easy" here presumably means via a fixed URL. But DOIs point to a specific version, so it is counter-intuitive for doi:nnn_v1/latest to point to something that is not v1.

I note that Zenodo provides a DOI that always points to the latest version, with clear links back to earlier DOId versions. Would this behavior be a major architecture change for Dataverse?

Or if you go to doi:nnnn/latest does it automatically redirect to a different doi, unless nnnn is the latest? I'm not sure if this is a reasonable behavior or not.

Anyway, perhaps "easy URL access to folders" and "fixed URL access to latest" should be treated as two distinct features to implement and test, although there is a connection between the two and latter should make use of the former.

qqmyers · 2020-11-19T16:35:20Z

How would a /dataset doi/dataset version or :latest/file path URI work? That would allow a stable URI for the file of a given path/name in the latest dataset version. If files are being replaced by files with different names this wouldn't work, but it would avoid trying to have both the dataset and file versioning schemes represented in the API.

mankoff · 2020-11-19T16:51:32Z

If files are deleted or renamed, then a 404 or similar error seems fine.

Note that this ticket is about exposing files and folders in a simple view, so if you use this feature to link to the latest version of a dataset (not a file within the dataset), then everything "just works", because whatever files exist in the latest version would be in that folder, by definition.

Here are some use-cases:

A dataset with files with fixed names that are updated every day (e.g 10 updated CSV and NetCDF files).
A dataset with a new file YYYY-MM-DD.tif added every day

How can we easily share this dataset with colleagues (and computers) so they always get the latest data? From your suggestions above, /dataset doi/dataset version won't find the latest, but could expose the files in dataset version in a virtual folder. The URL with :latest/file path won't work because the files for tomorrow don't exist in the 2nd example, where files get added every day. The URL dataset doi/view/latest could expose the latest version in a simple virtual folder, but may confuse people because of the DV vs. Zenodo architecture decision, where dataset doi is not meant to point to the latest version.

qqmyers · 2020-11-19T17:15:52Z

Not sure I follow. /10.5072/ABCDEF/:latest/file1.csv would always be the latest file with that name, and /10.5072/ABCDEF/:latest/ would always be a point where you'd get the latest dataset version's list of files, so new files with new dats would appear there. Does that support your two cases OK? (In essence the doi + :latest serves as the ID for the latest version versus there being a separate DOI for that.)
The concern I raised was that, because I could replace file1.csv with filen.csv in later versions, a scheme using the file path/name won't expose the relationship between file1.csv and filen.csv, but if the names are the same, or the names include dates that can be used to get the file you want, knowing Dataverse's internal relationship between those files may not be important. (Conversely - what happens if someone deleted file1 in dataverse and added a new file1, versus using the replace function? Should the API not provide a single URI that would get you to the latest?)

mankoff · 2020-11-19T17:46:48Z

Please recall the opening description by @landreev, "similar to how static files and directories are shown on simple web servers." Picture this API allowing browsing a simple folder. This may help answer some of the questions below. If a file is deleted from a folder, it is no longer there. If a file is renamed, or replaced, the latest view should be clearly defined based on our shared common (Mac, Windows, Linux, not VAX or DropBox web view behavior) OS experiences of browsing folders containing files.

Another option that may simplify implementation: The :latest is only valid for a dataset, not a file. Recall again that we're talking about two things in this ticket: 1) :latest and to :view, providing the virtual folder. If :latest is limited to datasets and not files, then combining it with :view provides access to the files within the latest dataset.

Not sure I follow. /10.5072/ABCDEF/:latest/file1.csv would always be
the latest file with that name, and /10.5072/ABCDEF/:latest/ would
always be a point where you'd get the latest dataset version's list of
files, so new files with new dats would appear there. Does that
support your two cases OK? (In essence the doi + :latest serves as the
ID for the latest version versus there being a separate DOI for that.)

Yes this works for both use cases.

I still point out that 10.5072/ABCDEF is (in theory) the DOI for v1, so having it also point to latest because of an additional few characters (i.e., :latest), could be confusing. But I think that is a requirement given the architecture decision that there is no minted DOI that always points to the latest (like Zenodo).

Furthermore if /10.5072/ABCDEF/:latest/ is generalized to support :v1, :v2, etc. in addition to :latest, then any DOI for any version within a dataset can be used to access any other version. For my daily updating data, after a year I have 365 DOIs, each of which can be used to access all 365 versions.

The concern I raised was that, because I could replace file1.csv with
filen.csv in later versions, a scheme using the file path/name won't
expose the relationship between file1.csv and filen.csv, but if the
names are the same, or the names include dates that can be used to get
the file you want, knowing Dataverse's internal relationship between
those files may not be important.

I personally am not concerned by this. The relationship is still available for people to see in the GUI "Versions" tab.

(Conversely - what happens if someone deleted file1 in dataverse and
added a new file1, versus using the replace function? Should the API
not provide a single URI that would get you to the latest?)

Hmmm. Ugh :). So I see the following choices:

File is deleted and not in latest version.

API for ":latest" can point to the latest available
API for ":latest" can return error

File is replaced and in the latest version:

API points to latest

File is deleted, then added, and exists in latest version

API can point to the latest available
API can return error: ambiguous file
API for ":latest" can look at the DOI used, and trace it downstream. If the DOI was for the earlier version that got deleted, then return the latest file before deletion. If the DOI was for an intermediate version where it did not exist, return error. If the DOI was for a later version after it was added, trace it downstream and return the latest one.

This seems overly complicated and I'd vote for "just return the latest".

…7084

landreev · 2021-02-02T14:46:14Z

@mankoff and anyone else who may be interested, the current implementation in my branch works as follows:

I called the new crawlable file access API "fileaccess":
/api/datasets/{dataset}/versions/{version}/fileaccess
(So the name/syntax follows the existing API /api/datasets/{dataset}/versions/{version}/files, that shows the metadata for the files in a given version. I'm open to naming it something else; I'm considering "folderview", and maybe the version number should be passed as a query parameter instead).
The optional query parameter ?folder=<foldername> specifies the subfolder to list.
For the {dataset} id both the numeric and:persistentId notation are supported, like in other similar APIs.

The API outputs a simple html listing (I made it to look like the standard Apache directory index), with Access API download links for individual files,
and recursive calls to the API above for sub-folders.

I think it's easier to use an example, and pictures:

Let's say we have a dataset version with 2 files, one of them with the folder named "subfolder" specified:

or, as viewed as a tree on the dataset page:

The output of the fileaccess API for the top-level folder (/api/datasets/NNN/versions/MM/fileaccess) will be as follows:

with the underlying html source:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
    <html><head><title>Index of folder /</title></head>
    <body><h1>Index of folder / in dataset doi:XXX/YY/ZZZZ</h1>
    <table>
    <tr><th>Name</th><th>Last Modified</th><th>Size</th><th>Description</th></tr>
    <tr><th colspan="4"><hr></th></tr>
    <tr><td><a href="/api/datasets/NNNN/versions/MM/fileaccess?folder=subfolder">subfolder/</a></td><td align="right"> - </td><td align="right"> - </td><td align="right">&nbsp;</td></tr>
    <tr><td><a href="/api/access/datafile/KKKK">testfile.txt</a></td><td align="right">13-January-2021 22:35</td><td align="right">19 B</td><td align="right">&nbsp;</td></tr>
    </table></body></html>

And if you follow the ../fileaccess?folder=subfolder link above it will produce the following view:

with the html source as follows:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
    <html><head><title>Index of folder /subfolder</title></head>
    <body><h1>Index of folder /subfolder in dataset doi:XXX/YY/ZZZZ</h1>
    <table>
    <tr><th>Name</th><th>Last Modified</th><th>Size</th><th>Description</th></tr>
    <tr><th colspan="4"><hr></th></tr>
    <tr><td><a href="/api/access/datafile/subfolder/LLLL">50by1000.tab</a></td><td align="right">11-January-2021 09:31</td><td align="right">102.5 KB</td><td align="right">&nbsp;</td></tr>
    </table></body></html>

Note that I'm solving the problem of having wget --recursive preserve the folder structure when saving files by embedding the folder name in the file access API URL: /api/access/datafile/subfolder/LLLL, instead of the normal /api/access/datafile/LLLL notation.
Yes, this is perfectly legal! You can embed an arbitrary number of slashes into a path parameter, by using a regex in the @Path notation:

@Path("datafile/{fileId:.+}")

The wget command line for crawling this API is NOT pretty, but that's what I came up with so far, that actually works:

wget --recursive -nH --cut-dirs=3 --content-disposition http://localhost:8080/api/datasets/NNNN/versions/1.0/fileaccess

Any feedback - comments/suggestions - are welcome.

mankoff · 2021-02-02T15:36:54Z

This looks good at first pass. I did not know of the --content-disposition flag for wget. What happens if that is left off? Are the files not named correctly? The rest of the wget command looks about as normal as most times that I use it...

One concern is the version part of the URL. Is there a way to easily always get the latest version? Either if version is left off the URL, or if it is set to latest rather than a number?

landreev · 2021-02-02T16:03:58Z

I did not know of the --content-disposition flag for wget. What happens if that is left off? Are the files not named correctly? The rest of the wget command looks about as normal as most times that I use it...

Correct, without the "--content-disposition" flag wget will download http://host/api/access/datafile/1234 and save it as 1234. With this flag wget will use the real filename that we supply in the "Content-Disposition:" header. (browsers do this automatically, so this header is the reason a browser offers to save a file downloaded from our dataset page under its user-friendly name).
It is, unfortunately, impossible to use that header to supply a folder name as well. If you try something like Content-disposition: attachment; filename="folder/subfolder/testfile.txt" the "folder/subfolder" part is ignored, and the file is still saved as "testfile.txt".
So I rely on both this header, and embedding the folder name into the access url, and --cut-dirs=3 to download /api/access/datafile/folder/subfolder/1234 and have it saved as folder/subfolder/testfile.txt.

landreev · 2021-02-02T16:11:28Z

One concern is the version part of the URL. Is there a way to easily always get the latest version? Either if version is left off the URL, or if it is set to latest rather than a number?

It understands our standard version id notations like :draft, :latest and :latest-published.
But yes, I am indeed considering dropping the version from the path. So it would be
/api/datasets/{datasetid}/fileaccess
defaulting to the latest version available; with the optional ?version={version} query parameter for requesting a different version.

…es (#7084)

…#7084)

… (version number added to the dir. index) #7084

landreev · 2021-02-06T01:06:16Z

@mankoff
Hi, a quick followup to the comments above: I ended up dropping the version parameter from the path.
I also renamed the API. It is now called "dirindex" - to emphasize that it presents the dataset in a way that resembles the Apache Directory Index format.

So the API path is now

/api/datasets/{dataset}/dirindex

it defaults to the latest. An optional parameter?version={version} can be used to specify a different version.
This is all documented in the API guide as part of the pull request linked above.

mankoff · 2021-02-24T14:25:45Z

Hello. If my institution upgrades their Dataverse, will we receive this feature? Or is it implemented into some future release and is not included in the latest version installed when updating?

djbrooke · 2021-02-24T14:33:00Z

Hi @mankoff, this will be included in the next release, 5.4. I added the 5.4 tag to the PR:

#7579

Once 5.4 shows up in https://github.com/IQSS/dataverse/releases you'll be able to install and use the release with this feature. We expect this in the next few weeks - we're just waiting on a few more issues to finish up.

mankoff · 2021-04-09T00:36:13Z

Hello. I see that demo.dataverse.org is now at v5.4, so I'd like to test this.

I'm reading the docs here https://guides.dataverse.org/en/latest/api/native-api.html?highlight=dirindex#view-dataset-files-and-folders-as-a-directory-index

And it seems to only work with the dataset ID. If I'm an end-user, how do I find the ID? Is there a way to browse the dirindex using the DOI? Can you provide an example with this demo data set? https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/MV0TMN

mankoff · 2021-04-09T00:40:21Z

Also, regarding point #5 from #7084 (comment) this API does not allow browsing. When I go to https://demo.dataverse.org/api/datasets/24/dirindex I'm given a ".index" to download in firefox, not something that I can view in my browser. This also means (I think?) that browser-tools that I hoped would use this feature, like DownThemAll probably won't work.

poikilotherm · 2021-04-09T09:09:43Z

Also seeing a ".index.html" download with https://demo.dataverse.org/api/datasets/:persistentId/dirindex/?persistentId=doi:10.70122/FK2/HXJVJU or https://demo.dataverse.org/api/datasets/:persistentId/dirindex/?persistentId=doi:10.70122/FK2/PDRSIQ

The file contains the expected HTML page.

mankoff · 2021-04-09T11:03:21Z

Well that tells me how to use this with DOI and not ID. I suggest making this option clear in the API docs. I'll add an issue for that.

landreev changed the title ~~Implement access to the files in the dataset as a pseudo folder~~ Implement access to the files in the dataset as a pseudo folder tree Jul 14, 2020

djbrooke modified the milestone: Dataverse 5 Jul 14, 2020

landreev changed the title ~~Implement access to the files in the dataset as a pseudo folder tree~~ Implement access to the files in the dataset as a virtual folder tree Jul 14, 2020

poikilotherm mentioned this issue Nov 4, 2020

Support exporting multiple tarballs from a dataset datalad/datalad#2038

Open

djbrooke assigned scolapasta Nov 4, 2020

djbrooke unassigned scolapasta Nov 17, 2020

djbrooke added the Medium label Nov 18, 2020

pdurbin mentioned this issue Nov 19, 2020

Add API support for downloading the most recent file #7425

Open

landreev added a commit that referenced this issue Feb 1, 2021

formatting of html code blocks in the guide (#7084)

e5e427d

landreev added a commit that referenced this issue Feb 1, 2021

formatting of the release note for #7084

031882f

landreev added a commit that referenced this issue Feb 1, 2021

missing dash in the release note #7084

c6e2766

landreev added a commit that referenced this issue Feb 1, 2021

typos in the docs #7084

fbb3573

landreev added a commit that referenced this issue Feb 2, 2021

I think I figured out how to solve the wget --recursive folder issue! #…

3160fab

…7084

landreev added a commit that referenced this issue Feb 4, 2021

added full support for downloading originals of the ingested tab. fil…

32d0fc1

…es (#7084)

landreev added a commit that referenced this issue Feb 4, 2021

implements changes to the API path as discussed (#7084)

481c3b4

landreev added a commit that referenced this issue Feb 4, 2021

Changed the release note to reflect the modifications to the API path (…

cdaf121

…#7084)

landreev added a commit that referenced this issue Feb 4, 2021

changed the api guide to reflect the latest mods. #7084

b774f07

landreev added a commit that referenced this issue Feb 4, 2021

a (#7084)

a853e31

landreev added a commit that referenced this issue Feb 4, 2021

updated the images for the guide to reflect the minor cosmetic change…

61c5cba

… (version number added to the dir. index) #7084

landreev added a commit that referenced this issue Feb 4, 2021

Changed the RestAssured tests to reflect the API changes. #7084

75e6e35

landreev mentioned this issue Feb 4, 2021

7084 crawlable file access #7579

Merged

mankoff mentioned this issue Feb 11, 2021

Download script GEUS-Glaciology-and-Climate/Sentinel-1_Greenland_Ice_Velocity#5

Closed

landreev added a commit that referenced this issue Feb 16, 2021

finishing touches, per qa feedback. (#7084)

78ca6d5

kcondon closed this as completed in #7579 Feb 17, 2021

mankoff mentioned this issue Apr 24, 2021

dirindexdoes not support web browsing #7823

Closed

mankoff mentioned this issue Jun 16, 2021

Provide GUI link to dirindex URL #7948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement access to the files in the dataset as a virtual folder tree #7084

Implement access to the files in the dataset as a virtual folder tree #7084

landreev commented Jul 14, 2020

landreev commented Jul 14, 2020

mankoff commented Aug 11, 2020

mankoff commented Aug 11, 2020

mankoff commented Oct 28, 2020

poikilotherm commented Nov 4, 2020

pdurbin commented Nov 4, 2020

djbrooke commented Nov 4, 2020

landreev commented Nov 17, 2020

landreev commented Nov 17, 2020

mankoff commented Nov 17, 2020

poikilotherm commented Nov 18, 2020 •

edited

Loading

djbrooke commented Nov 18, 2020

mankoff commented Nov 18, 2020

djbrooke commented Nov 18, 2020

mankoff commented Nov 19, 2020

qqmyers commented Nov 19, 2020 •

edited

Loading

mankoff commented Nov 19, 2020

qqmyers commented Nov 19, 2020

mankoff commented Nov 19, 2020

landreev commented Feb 2, 2021

mankoff commented Feb 2, 2021

landreev commented Feb 2, 2021

landreev commented Feb 2, 2021

landreev commented Feb 6, 2021

mankoff commented Feb 24, 2021

djbrooke commented Feb 24, 2021

mankoff commented Apr 9, 2021

mankoff commented Apr 9, 2021

poikilotherm commented Apr 9, 2021 •

edited

Loading

mankoff commented Apr 9, 2021

Implement access to the files in the dataset as a virtual folder tree #7084

Implement access to the files in the dataset as a virtual folder tree #7084

Comments

landreev commented Jul 14, 2020

landreev commented Jul 14, 2020

mankoff commented Aug 11, 2020

mankoff commented Aug 11, 2020

mankoff commented Oct 28, 2020

poikilotherm commented Nov 4, 2020

pdurbin commented Nov 4, 2020

djbrooke commented Nov 4, 2020

landreev commented Nov 17, 2020

landreev commented Nov 17, 2020

mankoff commented Nov 17, 2020

poikilotherm commented Nov 18, 2020 • edited Loading

djbrooke commented Nov 18, 2020

mankoff commented Nov 18, 2020

djbrooke commented Nov 18, 2020

mankoff commented Nov 19, 2020

qqmyers commented Nov 19, 2020 • edited Loading

mankoff commented Nov 19, 2020

qqmyers commented Nov 19, 2020

mankoff commented Nov 19, 2020

landreev commented Feb 2, 2021

mankoff commented Feb 2, 2021

landreev commented Feb 2, 2021

landreev commented Feb 2, 2021

landreev commented Feb 6, 2021

mankoff commented Feb 24, 2021

djbrooke commented Feb 24, 2021

mankoff commented Apr 9, 2021

mankoff commented Apr 9, 2021

poikilotherm commented Apr 9, 2021 • edited Loading

mankoff commented Apr 9, 2021

poikilotherm commented Nov 18, 2020 •

edited

Loading

qqmyers commented Nov 19, 2020 •

edited

Loading

poikilotherm commented Apr 9, 2021 •

edited

Loading