Downloading the whole archive for Weaver (2019) fails #5999

davidkane9 · 2019-07-06T18:30:35Z

Consider this Dataverse entry:

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QTOIQF

Try downloading the entire archive, all 128 files. You will get a zip file as normal, but it won't include all the files! This is what I get:

But it is missing many files. For example, the main_text directory should include a bunch of stuff, not just one .R file.

pdurbin · 2019-07-08T13:23:11Z

@davidkane9 hi! Is there any more information in MANIFEST.TXT about the missing files?

(I'm going to guess that this issue is related: Download All Error Reporting - Hidden in Manifest if not all files are downloaded (4.9.4) #5588 )

At https://twitter.com/dataverseorg/status/1144729141138165764 Harvard Dataverse tweeted about mitigating stability issues. One step was to lower :ZipDownloadLimit ( http://guides.dataverse.org/en/4.15/installation/config.html#zipdownloadlimit ) significantly, down to 10 MB, I believe.

A workaround would be to write a script to download each file individually. You can get a list of file IDs for that dataset from https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/QTOIQF ("Export Metadata" then "JSON").

donsizemore · 2019-07-08T13:38:39Z

@davidkane9 in case it's of any use/interest, I wrote a script to download all files of a given dataset in original format: https://github.com/OdumInstitute/dataverse-toolbox/blob/master/download_dataset_original_format.py

if you don't want all-originals I could make that a flag.

pdurbin · 2019-07-08T14:53:13Z

@donsizemore I forgot your wrote this script! Some quick thoughts:

This script is so great that perhaps it deserves its own repo!
@davidkane9 might want support for file hiearchy, given the screenshots above.
I'm not on Reddit but if you are maybe you could link to it from https://www.reddit.com/r/datasets/comments/caba7u/how_do_i_download_all_files_from_a_dataverse/ to help the person who posted that quesion: freepoopftw.

pdurbin · 2019-09-05T17:41:30Z

This morning I was over at the Science Center and noticed a stack of flyers for a class that @davidkane9 teaches:

@davidkane9 I'm thinking I should write a script as a workaround to get all those files. It's something @sbarbosadataverse asked for recently and would make a good addition to the "Finding and Downloading Data" section of the Dataverse API Guide: http://guides.dataverse.org/en/4.16/api/getting-started.html#finding-and-downloading-data

Would that be helpful? Is Python ok? Did you try https://github.com/OdumInstitute/dataverse-toolbox/blob/master/download_dataset_original_format.py ?

Longer term we're hoping to fix this with #6093 or similar so that a script isn't necessary.

Also, that class sound great! 😄

qqmyers · 2019-12-04T21:27:45Z

FWIW: might be interesting to see if the archiving functionality manages to create a zipped Bag (e.g. configure it to use Odum's file archiver to just drop the bag in the file system). If that works, it might make sense to adopt the zipping code there (based on apache's parallel zip and retrieving the files as needed on a pool of threads). The code itself can create zips up ~a terabyte/100K+ files - haven't tried it at that scale on a Dataverse dataset though. (If this doesn't work, the perhaps archiving will need #6093 also!)

davidkane9 · 2020-06-20T12:02:17Z

Thanks! And sorry for the delay in responding.

djbrooke · 2021-03-11T17:48:08Z

I'm going to close this issue - this should now work as expected on the Harvard Dataverse Repository, as we've enabled the external zipper service discussed in #6093. For those installations not using the external zip service, #5588 has the relevant details.

djbrooke closed this as completed Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Downloading the whole archive for Weaver (2019) fails #5999

Downloading the whole archive for Weaver (2019) fails #5999

davidkane9 commented Jul 6, 2019

pdurbin commented Jul 8, 2019

donsizemore commented Jul 8, 2019

pdurbin commented Jul 8, 2019

pdurbin commented Sep 5, 2019

qqmyers commented Dec 4, 2019

davidkane9 commented Jun 20, 2020

djbrooke commented Mar 11, 2021

Downloading the whole archive for Weaver (2019) fails #5999

Downloading the whole archive for Weaver (2019) fails #5999

Comments

davidkane9 commented Jul 6, 2019

pdurbin commented Jul 8, 2019

donsizemore commented Jul 8, 2019

pdurbin commented Jul 8, 2019

pdurbin commented Sep 5, 2019

qqmyers commented Dec 4, 2019

davidkane9 commented Jun 20, 2020

djbrooke commented Mar 11, 2021