-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add API to download all files by dataset #4529 #7086
Conversation
Ok, I'm ready for more review. I updated the description above to reflect that I added docs and tests for I also did a quick test of a dataset with a guestbook. I made no attempt to address #2911 which is about how guestbooks and terms can be bypassed via API. The UI says name and email are required (screenshot below) but the API continues to let you have the files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering why we weren't handling that "format=original" parameter as an obvious @QueryParam; finding it instead by going through UriInfo.getQueryParameters()... Probably not worth the trouble of changing it; seeing how it's been like that forever, in the /access/datafiles/ - it's just strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be nitpicking - but I'm not sure I like "downloadAll" as the name of the API. Should it be more self-explanatory, and along the lines of the existing APIs?
/api/access/datafile/<id>
downloads a datafile;
/api/access/dataset/<id>
downloads the files in a dataset?
(I'm open to hearing other opinions on this)
I'm in favor of: /api/access/dataset/ |
I'm not requesting this as a change, in this PR, but just want to put this on record: there is a potential optimization that can be added - instead of doing a full isAccessAuthorized() lookup for every file, it is possible to cache some permission information (for ex. - you don't really need to figure out whether this user has the permission to view this unpublished dataset every time from scratch). Which could in certain cases result in non-trivial savings... I remember this because we used to do that, in /api/access/datafiles/... Until we realized that even though that API call was originally created for downloading multiple files from the dataset page, arbitrary file ids spanning multiple datasets could be supplied to it directly - so extrapolating any file permission info, under the assumption that all the other files are within the same dataset was obviously wrong and dangerous. But that code is still in github history, and now that we have a call where staying within the same dataset is guaranteed, we could potentially revive it. |
Added enum so we don't have two methods both with 3 String args.
Ok, in 4da877b I renamed the endpoint from "downloadAll" to "dataset". Back to code review. |
@pdurbin issues found so far:
No information when attempting to download something you do not have perms for, just says "error". In single file download, says: |
This was recently changed in #7085 |
ThrowableHandler was added prior to #7085. When it was added, it stopped the default handing for exceptions such as javax.ws.rs.NotAcceptableException . What #7085 did was to add more exception/http code-specific handlers to manage the ones Dataverse source throws so that they don't get caught be the new default ThrowableHandler. My guess is that javax.ws.rs.NotAcceptableException is being thrown by the framework code rather than Dataverse source code, so I didn't add a specific handler for it. The fix is probably to just add yet-another handler for this specific exception and just send a 406 response as intended. The Redirect handler is probably a good example. |
@qqmyers Should it be a separate ticket? |
@kcondon -yeah - not related to this PR that I can see. Also - is this 406 error from a test? If so, it might also make sense to fix the test (unless we really wanted to test 406 instead of seeing if the dcterms metadata can be downloaded. |
@qqmyers I'm not sure what's causing it. It appears to happen on it's own. |
FWIW - datasets/export can produce "application/xml", "application/json", or "application/html". My guess is that if someone makes an API call requesting something other than one of those, it's a 406. |
@pdurbin This works but since it only ever seems to say "error" when something goes wrong, maybe that could be looked at? |
@poikilotherm Is there something more meaningful? If it doesn't add value, then no. If it provides context or potentially a direction for action, then yes. I trust your judgement. In the case for errors in this pr, it literally just says error, when there are clearly different causes and I am error prone when first using an API. :( |
@kcondon is right. Only "ERROR" is seen like this...
... if a guest user tries to download from a dataset with multiple unpublished files. Additionally, I can get this error (I'm using PIDs instead of database IDs)...
... if the guest user (no API token) attempts to use the new "download by dataset" endpoint on a dataset with a single restricted file. Let me see what I can do. |
Return UNAUTHORIZED instead of BAD_REQUEST and detailed error messages.
@kcondon et al, good catch on the error handling. I wasn't using the WrappedResponse object properly but as of fc7df59 I am. This means that instead of always sending BAD_REQUEST (400), you should see UNAUTHORIZED (401) as appropriate. You should also see more detailed messages like these:
I updated the tests to reflect these and others. I didn't attempt to do anything with the "you are not authorized to access this object via this api endpoint" message because it's pre-existing and I'm trying to be mindful of scope. If I call into the older "datafiles" (plural) API on which this new "download by dataset" API is build, you get a similar message. Here's an attempt to use that API by the guest user to download a single restricted file: curl http://localhost:8080/api/access/datafiles/4216
Furthermore, this error at least conveys "not authorized" and gives FORBIDDEN (403) so I think it's ok. I'm moving this back to code review. I'm aware that there's more conversation above about error handling as well as in #7134 and #7136 but I imagine it'll be handled in separate issues and pull requests. |
What this PR does / why we need it:
API users have long wanted a way to download all the files in a dataset using its persistent identifier (DOI or Handle).
Which issue(s) this PR closes:
Closes #4529
Special notes for your reviewer:
One thought I had just now is that there's no way to specify original format vs archival format. Should this be added?I added docs and tests forformat=original
in 44cc815.Suggestions on how to test this:
Follow the API docs. Try variations with restricted files, tabular files, guestbook/terms.
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No.
Is there a release notes update needed for this change?:
Yes, included.
Additional documentation:
Included.