Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Downloading the whole archive for Weaver (2019) fails #5999

Closed
davidkane9 opened this issue Jul 6, 2019 · 7 comments
Closed

Downloading the whole archive for Weaver (2019) fails #5999

davidkane9 opened this issue Jul 6, 2019 · 7 comments

Comments

@davidkane9
Copy link

Consider this Dataverse entry:

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QTOIQF

Try downloading the entire archive, all 128 files. You will get a zip file as normal, but it won't include all the files! This is what I get:

image

But it is missing many files. For example, the main_text directory should include a bunch of stuff, not just one .R file.

@pdurbin
Copy link
Member

pdurbin commented Jul 8, 2019

@davidkane9 hi! Is there any more information in MANIFEST.TXT about the missing files?

(I'm going to guess that this issue is related: Download All Error Reporting - Hidden in Manifest if not all files are downloaded (4.9.4) #5588 )

At https://twitter.com/dataverseorg/status/1144729141138165764 Harvard Dataverse tweeted about mitigating stability issues. One step was to lower :ZipDownloadLimit ( http://guides.dataverse.org/en/4.15/installation/config.html#zipdownloadlimit ) significantly, down to 10 MB, I believe.

A workaround would be to write a script to download each file individually. You can get a list of file IDs for that dataset from https://dataverse.harvard.edu/api/datasets/export?exporter=dataverse_json&persistentId=doi%3A10.7910/DVN/QTOIQF ("Export Metadata" then "JSON").

@donsizemore
Copy link
Contributor

@davidkane9 in case it's of any use/interest, I wrote a script to download all files of a given dataset in original format: https://github.com/OdumInstitute/dataverse-toolbox/blob/master/download_dataset_original_format.py

if you don't want all-originals I could make that a flag.

@pdurbin
Copy link
Member

pdurbin commented Jul 8, 2019

@donsizemore I forgot your wrote this script! Some quick thoughts:

@pdurbin
Copy link
Member

pdurbin commented Sep 5, 2019

This morning I was over at the Science Center and noticed a stack of flyers for a class that @davidkane9 teaches:

IMG_20190905_092020142-1

@davidkane9 I'm thinking I should write a script as a workaround to get all those files. It's something @sbarbosadataverse asked for recently and would make a good addition to the "Finding and Downloading Data" section of the Dataverse API Guide: http://guides.dataverse.org/en/4.16/api/getting-started.html#finding-and-downloading-data

Would that be helpful? Is Python ok? Did you try https://github.com/OdumInstitute/dataverse-toolbox/blob/master/download_dataset_original_format.py ?

Longer term we're hoping to fix this with #6093 or similar so that a script isn't necessary.

Also, that class sound great! 😄

@qqmyers
Copy link
Member

qqmyers commented Dec 4, 2019

FWIW: might be interesting to see if the archiving functionality manages to create a zipped Bag (e.g. configure it to use Odum's file archiver to just drop the bag in the file system). If that works, it might make sense to adopt the zipping code there (based on apache's parallel zip and retrieving the files as needed on a pool of threads). The code itself can create zips up ~a terabyte/100K+ files - haven't tried it at that scale on a Dataverse dataset though. (If this doesn't work, the perhaps archiving will need #6093 also!)

@davidkane9
Copy link
Author

Thanks! And sorry for the delay in responding.

@djbrooke
Copy link
Contributor

I'm going to close this issue - this should now work as expected on the Harvard Dataverse Repository, as we've enabled the external zipper service discussed in #6093. For those installations not using the external zip service, #5588 has the relevant details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants