Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike - API] Extend API to get files download size without original tabular files size #9958

Closed
GPortas opened this issue Sep 26, 2023 · 3 comments
Assignees
Labels
pm.GREI-d-2.7.1 NIH, yr2, aim7, task1: R&D UI modules for creating datasets and supporting publishing workflows pm.GREI-d-2.7.2 NIH, yr2, aim7, task2: Implement UI modules for creating datasets and publishing workflows Size: 10 A percentage of a sprint. 7 hours. User Role: API User Makes use of APIs
Milestone

Comments

@GPortas
Copy link
Contributor

GPortas commented Sep 26, 2023

Overview of the Feature Request

The current API endpoint /api/datasets/$ID/versions/$VERSIONID/downloadsize shows the combined size in bytes of all the files available from a particular dataset version.

This endpoint is not suitable for the current SPA needs, in particular the files tab, since it includes the size of the tabular files before being transformed.

We need the total size of the final downloadable files, without additional data.

It's also interesting that the endpoint returns a format that is easier to consume/transform. The current /downloadsize has the following format:

data: {
        status: 'OK',
        data: {
          message: 'Total size of the files available for download in this version of the dataset: 330 bytes'
        }
      }

We can include the bytes in a separate parameter, direct to read.

The new API changes will be consumed by: IQSS/dataverse-client-javascript#86

What kind of user is the feature intended for?
API user

What inspired the request?

What existing behavior do you want changed?
N/A

Any brand new behavior do you want to add to Dataverse?
Extend the API to get files download size without original tabular files size

Any open or closed issues related to this feature request?

@GPortas GPortas added User Role: API User Makes use of APIs pm.GREI-d-2.7.1 NIH, yr2, aim7, task1: R&D UI modules for creating datasets and supporting publishing workflows pm.GREI-d-2.7.2 NIH, yr2, aim7, task2: Implement UI modules for creating datasets and publishing workflows labels Sep 26, 2023
@GPortas GPortas added the Size: 10 A percentage of a sprint. 7 hours. label Sep 26, 2023
@GPortas GPortas self-assigned this Sep 26, 2023
@landreev
Copy link
Contributor

This makes sense, yes. I was surprised to see that the existing /downloadsize api did not have this option. The way our download APIs work, you can request either the (default) converted tab-delimited files (when present) or their saved originals, but never both at the same time - so it is strange, that the API counts both.

OK, looking at the GetDatasetStorageSizeCommand that the api relies on, and the service bean method that the command calls, it really looks like those were created for the specific purpose of finding the storage size, how much space the dataset is using on disk/s3 - as the name suggests. It must have been later repurposed for that "downloadsize" api, and it was somehow missed, that it was counting both.

I jsut saw the linked PR you made today. That should work; but an alternative (and maybe simpler?) solution would be not to use GetDatasetStorageSizeCommand for the /downloadsize API at all. And simply make it call DatasetUtil.getDownloadSizeNumeric(DatasetVersion dsv, boolean original) instead.
Our current dataset page uses this method for the download size, and the boolean arg. selects either the tab-delimited, or the original sizes. (up to you).

@GPortas
Copy link
Contributor Author

GPortas commented Sep 27, 2023

Thank you for your detailed explanation @landreev.

I like the approach of using DatasetUtil.getDownloadSizeNumeric, even more considering that the current dataset page uses that method, and it simplifies the logic comparing it to the GetDatasetStorageSizeCommand extension that you saw in my draft PR.

In any case, I think we should still use GetDatasetStorageSizeCommand when we want to include the original sizes for API backwards compatibility (maybe not so important here, but it is easy to implement). The reason is that GetDatasetStorageSizeCommand and DatasetUtil.getDownloadSizeNumeric calculate the size differently when including original tabular sizes.

I have updated the PR following this approach. f653c21

GPortas added a commit that referenced this issue Sep 27, 2023
GPortas added a commit that referenced this issue Oct 2, 2023
@pdurbin pdurbin added this to the 6.1 milestone Oct 13, 2023
@pdurbin
Copy link
Member

pdurbin commented Dec 11, 2023

I'm not sure why this issue wasn't automatically closed when the following PR was merged but I'm closing it now:

@pdurbin pdurbin closed this as completed Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pm.GREI-d-2.7.1 NIH, yr2, aim7, task1: R&D UI modules for creating datasets and supporting publishing workflows pm.GREI-d-2.7.2 NIH, yr2, aim7, task2: Implement UI modules for creating datasets and publishing workflows Size: 10 A percentage of a sprint. 7 hours. User Role: API User Makes use of APIs
Projects
Status: No status
Development

No branches or pull requests

3 participants