[Spike - API] Extend API to get files download size without original tabular files size #9958

GPortas · 2023-09-26T11:10:44Z

Overview of the Feature Request

The current API endpoint /api/datasets/$ID/versions/$VERSIONID/downloadsize shows the combined size in bytes of all the files available from a particular dataset version.

This endpoint is not suitable for the current SPA needs, in particular the files tab, since it includes the size of the tabular files before being transformed.

We need the total size of the final downloadable files, without additional data.

It's also interesting that the endpoint returns a format that is easier to consume/transform. The current /downloadsize has the following format:

data: {
        status: 'OK',
        data: {
          message: 'Total size of the files available for download in this version of the dataset: 330 bytes'
        }
      }

We can include the bytes in a separate parameter, direct to read.

The new API changes will be consumed by: IQSS/dataverse-client-javascript#86

What kind of user is the feature intended for?
API user

What inspired the request?

Add use case to get the total space consumed by the files of a particular dataset version dataverse-client-javascript#86

What existing behavior do you want changed?
N/A

Any brand new behavior do you want to add to Dataverse?
Extend the API to get files download size without original tabular files size

Any open or closed issues related to this feature request?

Add use case to get the total space consumed by the files of a particular dataset version dataverse-client-javascript#86

The text was updated successfully, but these errors were encountered:

landreev · 2023-09-26T22:41:45Z

This makes sense, yes. I was surprised to see that the existing /downloadsize api did not have this option. The way our download APIs work, you can request either the (default) converted tab-delimited files (when present) or their saved originals, but never both at the same time - so it is strange, that the API counts both.

OK, looking at the GetDatasetStorageSizeCommand that the api relies on, and the service bean method that the command calls, it really looks like those were created for the specific purpose of finding the storage size, how much space the dataset is using on disk/s3 - as the name suggests. It must have been later repurposed for that "downloadsize" api, and it was somehow missed, that it was counting both.

I jsut saw the linked PR you made today. That should work; but an alternative (and maybe simpler?) solution would be not to use GetDatasetStorageSizeCommand for the /downloadsize API at all. And simply make it call DatasetUtil.getDownloadSizeNumeric(DatasetVersion dsv, boolean original) instead.
Our current dataset page uses this method for the download size, and the boolean arg. selects either the tab-delimited, or the original sizes. (up to you).

GPortas · 2023-09-27T09:34:04Z

Thank you for your detailed explanation @landreev.

I like the approach of using DatasetUtil.getDownloadSizeNumeric, even more considering that the current dataset page uses that method, and it simplifies the logic comparing it to the GetDatasetStorageSizeCommand extension that you saw in my draft PR.

In any case, I think we should still use GetDatasetStorageSizeCommand when we want to include the original sizes for API backwards compatibility (maybe not so important here, but it is easy to implement). The reason is that GetDatasetStorageSizeCommand and DatasetUtil.getDownloadSizeNumeric calculate the size differently when including original tabular sizes.

DatasetUtil.getDownloadSizeNumeric: it only adds the original tabular file size to the total count (original param = true) or only the processed one (original = false).
GetDatasetStorageSizeCommand: adds both the size of the original tabular file and the processed one. (https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/DatasetServiceBean.java#L1062)

I have updated the PR following this approach. f653c21

pdurbin · 2023-12-11T20:45:26Z

I'm not sure why this issue wasn't automatically closed when the following PR was merged but I'm closing it now:

New API option to get the download size of the dataset version files without the original tabular files size #9960

GPortas added User Role: API User Makes use of APIs pm.GREI-d-2.7.1 NIH, yr2, aim7, task1: R&D UI modules for creating datasets and supporting publishing workflows pm.GREI-d-2.7.2 NIH, yr2, aim7, task2: Implement UI modules for creating datasets and publishing workflows labels Sep 26, 2023

GPortas added the Size: 10 A percentage of a sprint. 7 hours. label Sep 26, 2023

GPortas self-assigned this Sep 26, 2023

GPortas mentioned this issue Sep 26, 2023

New API option to get the download size of the dataset version files without the original tabular files size #9960

Merged

GPortas added a commit that referenced this issue Sep 27, 2023

Added: #9958 release notes

9d10b99

GPortas added a commit that referenced this issue Oct 2, 2023

Changed: updated release notes for #9958

cbf00d7

pdurbin added this to the 6.1 milestone Oct 13, 2023

pdurbin closed this as completed Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spike - API] Extend API to get files download size without original tabular files size #9958

[Spike - API] Extend API to get files download size without original tabular files size #9958

GPortas commented Sep 26, 2023 •

edited

Loading

landreev commented Sep 26, 2023

GPortas commented Sep 27, 2023

pdurbin commented Dec 11, 2023

[Spike - API] Extend API to get files download size without original tabular files size #9958

[Spike - API] Extend API to get files download size without original tabular files size #9958

Comments

GPortas commented Sep 26, 2023 • edited Loading

landreev commented Sep 26, 2023

GPortas commented Sep 27, 2023

pdurbin commented Dec 11, 2023

GPortas commented Sep 26, 2023 •

edited

Loading