Data Access API: Download by Dataset #4529

SamiSousa · 2018-03-21T00:25:56Z

Is there any method of downloading all the files of a dataset using the Data Access API? Something like using he global_id of the dataset to download all files in a zip, similar to the bundle download. Thanks!

pdurbin · 2018-03-21T00:50:12Z

@SamiSousa thanks for the suggestion.

For a little more context, see our conversation in IRC: http://irclog.iq.harvard.edu/dataverse/2018-03-20#i_64759

pdurbin · 2018-05-01T19:59:00Z

@SamiSousa it was nice meeting you this afternoon at BU! I couldn't remember if you had opened any issues but found this one. Like I was saying, with 800+ issues I sometimes reach out to the person who originally opened the issue to see if they're still interested. I got the impression that you may or may not be interested in this issue long term, after your class is over, which is totally fine.

We did discuss this issue during backlog grooming in late March but we're worried about performance implications primarily, even though the code to address this issue is probably straightforward.

Also, is this issue on topic enough to link your team's video from here? I ask because I just checked my email and didn't get your message yet. We have pretty aggressive spam filtering enabled and I get a summary email at the end of the day that may give me the opportunity to see your email and have it delivered to my inbox. Thanks!

pdurbin · 2018-05-01T22:12:23Z

@SamiSousa nevermind! I clicked "release and allow sender" in the antispam tool...

... and now I have a link to your video and code:

SamiSousa · 2018-05-02T00:02:43Z

Great meeting you too Phil! In the project, we ended up using the Search and Data Access APIs to list files and download individual files, so this specific feature isn't a high priority request from me. Hope this helps!

pdurbin · 2018-05-02T00:58:06Z

@SamiSousa no problem. Thanks for clarifying. I just sent a message about your project to the Dataverse community at https://groups.google.com/d/msg/dataverse-community/P4llZSssZ2Q/zvhGltLpAQAJ and you are welcome to make sure I didn't misrepresent your project at all. The video is really interesting! Thanks for sharing!

pdurbin · 2018-05-02T12:30:59Z

@SamiSousa questions are coming in already! Please see dataverse-broker/dataverse-broker#46 . Thanks!

pdurbin · 2018-07-13T03:27:48Z

we're worried about performance implications primarily, even though the code to address this issue is probably straightforward

I guess I'll vote to close this issue if we have no intention of supporting this.

pdurbin · 2019-06-06T16:14:44Z

I just spoke with @jggautier about this in the context of https://help.hmdc.harvard.edu/Ticket/Display.html?id=276556 and was telling him that I do think we should implement the ability to download all the files in a dataset based on the DOI or Handle of that dataset via API using a script (with a config option to turn off this feature for installations that don't want it).

The workaround for figuring out from a browser the file IDs to pass to the Data Access API is to use dev tools (inspect element) and copy the curl command. For example:

Firefox

Chrome

Then, once you have the crazy long URL (lots of extra junk in there), you can use it like this:

curl 'https://demo.dataverse.org/api/access/datafiles/307909,307910,307908?gbrecs=true' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Referer: https://demo.dataverse.org/dataset.xhtml?persistentId=doi:10.70122/FK2/KG5TTZ' -H 'Connection: keep-alive' -H 'Cookie: _ga=GA1.2.2122576405.1458229148; __utma=226172687.2122576405.1458229148.1559755065.1559827431.784; _saml_idp=aHR0cHM6Ly9mZWQuaHVpdC5oYXJ2YXJkLmVkdS9pZHAvc2hpYmJvbGV0aA%3D%3D+aHR0cHM6Ly9pZHAudGVzdHNoaWIub3JnL2lkcC9zaGliYm9sZXRo; __utmz=226172687.1559243403.772.43.utmcsr=iq.harvard.edu|utmccn=(referral)|utmcmd=referral|utmcct=/product-development; _gid=GA1.2.418902204.1559572454; __utmc=226172687; JSESSIONID=d801bf3d60278565c5847c1f3dd0' -H 'Upgrade-Insecure-Requests: 1' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' > out.zip

dhcp-10-250-190-90:tmp pdurbin$ file out.zip 
out.zip: Zip archive data, at least v2.0 to extract
dhcp-10-250-190-90:tmp pdurbin$ unzip out.zip 
Archive:  out.zip
  inflating: hdv-did-extract-metadata.fits  
  inflating: hdv-did-extract-metadata.uvfits  
  inflating: hdv-did-not-extract-metadata.fits  
  inflating: MANIFEST.TXT            
dhcp-10-250-190-90:tmp pdurbin$

The part that matters is https://demo.dataverse.org/api/access/datafiles/307909,307910,307908?gbrecs=true and this is documented at http://guides.dataverse.org/en/4.14/api/dataaccess.html#multiple-file-bundle-download

mankoff · 2020-06-04T00:05:39Z

I would like to vote for this feature. It is critical to be able to do bulk downloads via wget or some other computer-to-computer solution. This is the beauty of the classic FTP folder full of files. It would be nice to be able to point any script to any dataverse DOI with /download appended to the end and know that it will fetching everything within.

This could be turned off for dataverses, on for datasets, and configurable by the site admin and each dataverse and dataset admin.

Alternatively, the dataverse site could auto-generate a script (bash? Python?) for each dataset to download all the data contained in that dataset. The National Snow and Ice Data Center (NSIDC) takes this approach.

mankoff · 2020-06-04T00:11:02Z

I recognize that the reason this issue is often closed is "server load issues". The advantage of generating a script the user can run is that the script could throttle the download. You'd need to trust users to not remove that part of the script though. A script also lets the user download multiple files, not just a ZIP file, and then zipping is not required on the server backend (although perhaps the web server itself, e.g. Apache not dataverse does on-the-fly compression for transfers).

djbrooke · 2020-06-04T12:44:07Z

Hey @mankoff, we'll be implementing this after we make some optimizations to the zipping service in #6505.

We have a full API suite documented at http://guides.dataverse.org/en/latest/api/index.html, so it would be possible to script things now.

pdurbin · 2020-07-15T20:59:53Z

I made pull request #7086 for this issue. Feedback is welcome, of course.

Added enum so we don't have two methods both with 3 String args.

Return UNAUTHORIZED instead of BAD_REQUEST and detailed error messages.

add API to download all files by dataset #4529

djbrooke added Status: Backlog and removed Status: Backlog labels Mar 28, 2018

pdurbin added Vote to Close: pdurbin User Role: Guest Anyone using the system, even without an account Feature: Performance & Stability labels Jul 13, 2018

pdurbin mentioned this issue Oct 4, 2018

File Download: Allow user to download all files from a dataverse at once. #639

Closed

pdurbin removed the Vote to Close: pdurbin label Jun 6, 2019

mankoff mentioned this issue Jun 4, 2020

Optimize Zipping Process on the backend #6505

Closed

mheppler mentioned this issue Jun 24, 2020

Make download all files feature more prominent #6118

Closed

pdurbin mentioned this issue Jul 1, 2020

Add "download all" buttons (including size of dataset) to dataset page #7047

Merged

pdurbin self-assigned this Jul 1, 2020

pdurbin added this to the Dataverse 5 milestone Jul 2, 2020

pdurbin removed their assignment Jul 2, 2020

sekmiller self-assigned this Jul 7, 2020

pdurbin unassigned sekmiller Jul 9, 2020

pdurbin self-assigned this Jul 9, 2020

landreev mentioned this issue Jul 14, 2020

Implement access to the files in the dataset as a virtual folder tree #7084

Closed

pdurbin added a commit that referenced this issue Jul 14, 2020

add API to download all files by dataset #4529

abb72e7

pdurbin mentioned this issue Jul 14, 2020

add API to download all files by dataset #4529 #7086

Merged

pdurbin added a commit that referenced this issue Jul 15, 2020

add docs and tests for format=original #4529

44cc815

pdurbin added a commit that referenced this issue Jul 15, 2020

add DownloadFilesIT to API test suite #4529

30bb828

pdurbin removed their assignment Jul 17, 2020

pdurbin added a commit that referenced this issue Jul 17, 2020

resolve naming conflict between test methods #4529

05e55c2

Added enum so we don't have two methods both with 3 String args.

pdurbin added a commit that referenced this issue Jul 17, 2020

rename "downloadAll" to "dataset" (Data Access API) #4529

4da877b

pdurbin added a commit that referenced this issue Jul 28, 2020

improve error handling #4529

fc7df59

Return UNAUTHORIZED instead of BAD_REQUEST and detailed error messages.

kcondon closed this as completed in #7086 Jul 29, 2020

kcondon added a commit that referenced this issue Jul 29, 2020

Merge pull request #7086 from IQSS/4529-download-all-api

2613d4e

add API to download all files by dataset #4529

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Access API: Download by Dataset #4529

Data Access API: Download by Dataset #4529

SamiSousa commented Mar 21, 2018

pdurbin commented Mar 21, 2018

pdurbin commented May 1, 2018 •

edited

Loading

pdurbin commented May 1, 2018 •

edited

Loading

SamiSousa commented May 2, 2018

pdurbin commented May 2, 2018

pdurbin commented May 2, 2018

pdurbin commented Jul 13, 2018

pdurbin commented Jun 6, 2019 •

edited

Loading

mankoff commented Jun 4, 2020

mankoff commented Jun 4, 2020

djbrooke commented Jun 4, 2020

pdurbin commented Jul 15, 2020

Data Access API: Download by Dataset #4529

Data Access API: Download by Dataset #4529

Comments

SamiSousa commented Mar 21, 2018

pdurbin commented Mar 21, 2018

pdurbin commented May 1, 2018 • edited Loading

pdurbin commented May 1, 2018 • edited Loading

SamiSousa commented May 2, 2018

pdurbin commented May 2, 2018

pdurbin commented May 2, 2018

pdurbin commented Jul 13, 2018

pdurbin commented Jun 6, 2019 • edited Loading

Firefox

Chrome

mankoff commented Jun 4, 2020

mankoff commented Jun 4, 2020

djbrooke commented Jun 4, 2020

pdurbin commented Jul 15, 2020

pdurbin commented May 1, 2018 •

edited

Loading

pdurbin commented May 1, 2018 •

edited

Loading

pdurbin commented Jun 6, 2019 •

edited

Loading