add API to download all files by dataset #4529 #7086

pdurbin · 2020-07-14T19:03:10Z

What this PR does / why we need it:

API users have long wanted a way to download all the files in a dataset using its persistent identifier (DOI or Handle).

Which issue(s) this PR closes:

Closes #4529

Special notes for your reviewer:

~~One thought I had just now is that there's no way to specify original format vs archival format. Should this be added?~~ I added docs and tests for format=original in 44cc815.
I made no attempt to address File API download bypasses terms of use and guestbook #2911 which is about how terms and guestbook acceptance can by bypassed via API. In my testing, I was able to bypass a guestbook with required fields for name and email.
I assume that Make Data Count is covered deeper in the API than I went (by existing APIs I'm calling into, I mean).

Suggestions on how to test this:

Follow the API docs. Try variations with restricted files, tabular files, guestbook/terms.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

No.

Is there a release notes update needed for this change?:

Yes, included.

Additional documentation:

Included.

coveralls · 2020-07-14T19:22:57Z

Coverage decreased (-0.05%) to 19.605% when pulling fc7df59 on 4529-download-all-api into c1a3834 on develop.

pdurbin · 2020-07-15T20:27:19Z

Ok, I'm ready for more review. I updated the description above to reflect that I added docs and tests for format=original (no code changes required).

I also did a quick test of a dataset with a guestbook. I made no attempt to address #2911 which is about how guestbooks and terms can be bypassed via API. The UI says name and email are required (screenshot below) but the API continues to let you have the files.

landreev

I was wondering why we weren't handling that "format=original" parameter as an obvious @QueryParam; finding it instead by going through UriInfo.getQueryParameters()... Probably not worth the trouble of changing it; seeing how it's been like that forever, in the /access/datafiles/ - it's just strange.

landreev

This may be nitpicking - but I'm not sure I like "downloadAll" as the name of the API. Should it be more self-explanatory, and along the lines of the existing APIs?
/api/access/datafile/<id> downloads a datafile;
/api/access/dataset/<id> downloads the files in a dataset?
(I'm open to hearing other opinions on this)

scolapasta · 2020-07-16T16:16:56Z

I'm in favor of: /api/access/dataset/

landreev · 2020-07-16T16:39:25Z

I'm not requesting this as a change, in this PR, but just want to put this on record: there is a potential optimization that can be added - instead of doing a full isAccessAuthorized() lookup for every file, it is possible to cache some permission information (for ex. - you don't really need to figure out whether this user has the permission to view this unpublished dataset every time from scratch). Which could in certain cases result in non-trivial savings... I remember this because we used to do that, in /api/access/datafiles/... Until we realized that even though that API call was originally created for downloading multiple files from the dataset page, arbitrary file ids spanning multiple datasets could be supplied to it directly - so extrapolating any file permission info, under the assumption that all the other files are within the same dataset was obviously wrong and dangerous. But that code is still in github history, and now that we have a call where staying within the same dataset is guaranteed, we could potentially revive it.
It's not 100% clear to me whether it's worth bothering with - so I'm going to open a new issue for it; and we could decide later on whether we want to touch this.

Added enum so we don't have two methods both with 3 String args.

pdurbin · 2020-07-17T16:29:52Z

Ok, in 4da877b I renamed the endpoint from "downloadAll" to "dataset". Back to code review.

kcondon · 2020-07-27T18:15:32Z

@pdurbin issues found so far:

Error handling is sparse -seems to just say "error" no matter what the error is: lack of perms, bad arg, bad key, etc.

No information when attempting to download something you do not have perms for, just says "error". In single file download, says:
{"status":"ERROR","code":403,"message":"'/api/v1/access/datafile/53822' you are not authorized to access this object via this api endpoint. Please check your code for typos, or consult our API guide at http://guides.dataverse.org."}

pdurbin · 2020-07-27T19:14:29Z

ThrowableHandler

This was recently changed in #7085

qqmyers · 2020-07-27T19:26:35Z

ThrowableHandler was added prior to #7085. When it was added, it stopped the default handing for exceptions such as javax.ws.rs.NotAcceptableException . What #7085 did was to add more exception/http code-specific handlers to manage the ones Dataverse source throws so that they don't get caught be the new default ThrowableHandler. My guess is that javax.ws.rs.NotAcceptableException is being thrown by the framework code rather than Dataverse source code, so I didn't add a specific handler for it. The fix is probably to just add yet-another handler for this specific exception and just send a 406 response as intended. The Redirect handler is probably a good example.

kcondon · 2020-07-27T19:35:42Z

@qqmyers Should it be a separate ticket?

qqmyers · 2020-07-27T19:37:58Z

@kcondon -yeah - not related to this PR that I can see. Also - is this 406 error from a test? If so, it might also make sense to fix the test (unless we really wanted to test 406 instead of seeing if the dcterms metadata can be downloaded.

kcondon · 2020-07-27T19:42:09Z

@qqmyers I'm not sure what's causing it. It appears to happen on it's own.

qqmyers · 2020-07-27T19:47:00Z

FWIW - datasets/export can produce "application/xml", "application/json", or "application/html". My guess is that if someone makes an API call requesting something other than one of those, it's a 406.

kcondon · 2020-07-27T20:28:46Z

@pdurbin This works but since it only ever seems to say "error" when something goes wrong, maybe that could be looked at?

poikilotherm · 2020-07-28T11:52:02Z

@kcondon I extended the error handling stuff in #7136. Do you wish for me to add a more meaningfull error message for error 406?

kcondon · 2020-07-28T13:23:28Z

@poikilotherm Is there something more meaningful? If it doesn't add value, then no. If it provides context or potentially a direction for action, then yes. I trust your judgement. In the case for errors in this pr, it literally just says error, when there are clearly different causes and I am error prone when first using an API. :(

pdurbin · 2020-07-28T15:01:11Z

@kcondon is right. Only "ERROR" is seen like this...

{
    "status": "ERROR"
}

... if a guest user tries to download from a dataset with multiple unpublished files.

Additionally, I can get this error (I'm using PIDs instead of database IDs)...

{
    "status": "ERROR",
    "code": 403,
    "message": "'/api/v1/access/dataset/%3ApersistentId' you are not authorized to access this object via this api endpoint. Please check your code for typos, or consult our API guide at http://guides.dataverse.org."
}

... if the guest user (no API token) attempts to use the new "download by dataset" endpoint on a dataset with a single restricted file.

Let me see what I can do.

Return UNAUTHORIZED instead of BAD_REQUEST and detailed error messages.

pdurbin · 2020-07-28T18:55:35Z

@kcondon et al, good catch on the error handling. I wasn't using the WrappedResponse object properly but as of fc7df59 I am. This means that instead of always sending BAD_REQUEST (400), you should see UNAUTHORIZED (401) as appropriate. You should also see more detailed messages like these:

User :guest is not permitted to perform requested action.
Bad api key

I updated the tests to reflect these and others.

I didn't attempt to do anything with the "you are not authorized to access this object via this api endpoint" message because it's pre-existing and I'm trying to be mindful of scope. If I call into the older "datafiles" (plural) API on which this new "download by dataset" API is build, you get a similar message. Here's an attempt to use that API by the guest user to download a single restricted file:

curl http://localhost:8080/api/access/datafiles/4216

{
  "status": "ERROR",
  "code": 403,
  "message": "'/api/v1/access/datafiles/4216' you are not authorized to access this object via this api endpoint. Please check your code for typos, or consult our API guide at http://guides.dataverse.org."
}

Furthermore, this error at least conveys "not authorized" and gives FORBIDDEN (403) so I think it's ok.

I'm moving this back to code review. I'm aware that there's more conversation above about error handling as well as in #7134 and #7136 but I imagine it'll be handled in separate issues and pull requests.

add API to download all files by dataset #4529

abb72e7

djbrooke added this to the Dataverse 5 milestone Jul 14, 2020

landreev self-requested a review July 14, 2020 20:20

landreev self-assigned this Jul 14, 2020

djbrooke assigned pdurbin and unassigned landreev Jul 15, 2020

pdurbin added 2 commits July 15, 2020 16:05

add docs and tests for format=original #4529

44cc815

add DownloadFilesIT to API test suite #4529

30bb828

pdurbin removed their assignment Jul 15, 2020

pdurbin mentioned this pull request Jul 15, 2020

Data Access API: Download by Dataset #4529

Closed

landreev self-assigned this Jul 16, 2020

landreev reviewed Jul 16, 2020

View reviewed changes

landreev requested changes Jul 16, 2020

View reviewed changes

pdurbin assigned pdurbin and unassigned landreev Jul 17, 2020

pdurbin added 2 commits July 17, 2020 12:03

resolve naming conflict between test methods #4529

05e55c2

Added enum so we don't have two methods both with 3 String args.

rename "downloadAll" to "dataset" (Data Access API) #4529

4da877b

pdurbin removed their assignment Jul 17, 2020

landreev self-assigned this Jul 21, 2020

landreev approved these changes Jul 21, 2020

View reviewed changes

kcondon self-assigned this Jul 21, 2020

kcondon assigned pdurbin and unassigned kcondon Jul 27, 2020

improve error handling #4529

fc7df59

Return UNAUTHORIZED instead of BAD_REQUEST and detailed error messages.

pdurbin removed their assignment Jul 28, 2020

sekmiller self-assigned this Jul 29, 2020

sekmiller approved these changes Jul 29, 2020

View reviewed changes

sekmiller removed their assignment Jul 29, 2020

kcondon self-assigned this Jul 29, 2020

kcondon merged commit 2613d4e into develop Jul 29, 2020

kcondon deleted the 4529-download-all-api branch July 29, 2020 20:06

pdurbin mentioned this pull request Feb 7, 2022

DOIs for Dataset versions #4499

Open

qqmyers mentioned this pull request Jan 10, 2023

Client-side multifile zip download #9245

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add API to download all files by dataset #4529 #7086

add API to download all files by dataset #4529 #7086

pdurbin commented Jul 14, 2020 •

edited

Loading

coveralls commented Jul 14, 2020 •

edited

Loading

pdurbin commented Jul 15, 2020

landreev left a comment

landreev left a comment •

edited

Loading

scolapasta commented Jul 16, 2020

landreev commented Jul 16, 2020

pdurbin commented Jul 17, 2020

kcondon commented Jul 27, 2020 •

edited

Loading

pdurbin commented Jul 27, 2020

qqmyers commented Jul 27, 2020

kcondon commented Jul 27, 2020

qqmyers commented Jul 27, 2020

kcondon commented Jul 27, 2020

qqmyers commented Jul 27, 2020

kcondon commented Jul 27, 2020

poikilotherm commented Jul 28, 2020 •

edited

Loading

kcondon commented Jul 28, 2020

pdurbin commented Jul 28, 2020

pdurbin commented Jul 28, 2020

add API to download all files by dataset #4529 #7086

add API to download all files by dataset #4529 #7086

Conversation

pdurbin commented Jul 14, 2020 • edited Loading

coveralls commented Jul 14, 2020 • edited Loading

pdurbin commented Jul 15, 2020

landreev left a comment

Choose a reason for hiding this comment

landreev left a comment • edited Loading

Choose a reason for hiding this comment

scolapasta commented Jul 16, 2020

landreev commented Jul 16, 2020

pdurbin commented Jul 17, 2020

kcondon commented Jul 27, 2020 • edited Loading

pdurbin commented Jul 27, 2020

qqmyers commented Jul 27, 2020

kcondon commented Jul 27, 2020

qqmyers commented Jul 27, 2020

kcondon commented Jul 27, 2020

qqmyers commented Jul 27, 2020

kcondon commented Jul 27, 2020

poikilotherm commented Jul 28, 2020 • edited Loading

kcondon commented Jul 28, 2020

pdurbin commented Jul 28, 2020

pdurbin commented Jul 28, 2020

pdurbin commented Jul 14, 2020 •

edited

Loading

coveralls commented Jul 14, 2020 •

edited

Loading

landreev left a comment •

edited

Loading

kcondon commented Jul 27, 2020 •

edited

Loading

poikilotherm commented Jul 28, 2020 •

edited

Loading