-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement dataverse-import-files
#292
Comments
I tried this approach with a dataset with restricted files (https://dataverse.nl/dataset.xhtml?persistentId=doi:10.34894/R1TNL8&version=1.3), but to no avail:
Then I ran fsck, which failed:
At this point I wondered if the file access urls were wrong for some reason, and I tested with
Another thing I tried was to navigate to the file access url (https://dataverse.nl/api/access/datafile/46664) in a browser tab where I had already logged in to the dataverse instance, and therefore the files were not restricted (I have admin privileges for the dataset). The browser downloaded the file automatically and successfully. As expected, doing this in an incognito tab fails:
I'm out of ideas at the moment... |
Have you checked how it tries to authenticate? Above I wrote:
And this is likely what you are seeing here. If you check the code here: datalad-dataverse/datalad_dataverse/dataset.py Lines 183 to 188 in d793dda
you'll see that the SWORD API wants the API token to be given as the user (not the password) with HTTP Basic auth. The dataverse special remote primarily uses the main API via pydataverse. It provides the API token via a https://github.com/gdcc/pyDataverse/blob/master/src/pyDataverse/api.py#L117-L118 There is no way that I am aware of how this could all be inferred magically. It needs a dedicated handler for such URLs. |
thanks for the pointer, I will look into this |
The preferred way to authenticate with v5 native APIs is using I'm still trying to figure out where exactly a dedicated handler would need to fit in, and how to implement it. My current understanding is this:
IIUC a handler for dataverse urls would essentially be the same as the current
So should there be a new urloperations class in next? I don't think so because So it would be something like the following:
where uncurl would already have had to know that it's a dataverse url so that it can query for the api token so that it can add that the header. This last-mentioned part is still very blurry to me. Is this all on the right track? |
I just want to register the thought that we should write this up once we have all the steps. Either in the datalad-dataverse docs, or datalad-next, or maybe the handbook |
Dataverse provides full dataset (version) file listings that also include md5 sums (and others). Therefore it would be fairly simple to support sucking in a filetree without having to go through the full complexity of support git-annex's importtree.
datalad-ebrains
pretty much has the blueprint for that.It is unclear to me whether such a starting point could be coupled with an export/filetree-only setup provided by
but the immediate answer is no. Git-annex refuses to try, because it has no export location on record.
Faking an (or performing an empty) export also does not work, base a datalad dataset will contain files that are not on dataverse (and possibly cannot be, ie. the importing agent has no write permissions).
A different approach would be to populate a dataset with keys that have attached URLs that point to the data access API of the respective dataverse instance. The
uncurl
special remote would then be able to take care of them. Possibly a dedicated handler needs to be implemented that performs the auth correctly. Such a handler can be configured in the dataset and for the specific dataverse instance specifically.Here is a sketch
For public datasets (no auth), uncurl is not even needed.
web
does things alright.The text was updated successfully, but these errors were encountered: