API call for total size of specified dataverse #5848

CCMumma · 2019-05-15T22:30:48Z

We need an API call to exist that currently does not: total size of a specified dataverse

The use case for us is that the Texas Data Repository hosts multiple Institutional Dataverses and I need a simple way to determine the size of all of the content, published and unpublished, in their entire dataverse.

pdurbin · 2019-05-15T23:35:49Z

@CCMumma thanks for creating this issue. The following issues are related:

Storage metering: Allow tracking and limiting total and individual file upload space per user/group. Storage metering: Allow tracking and limiting total and individual file upload space per user/group. #938
File and dataset limits: Add a programmatic way to limit file size and dataset size File and dataset limits: Add a programmatic way to limit file size and dataset size #3939
Storage allocation quota Storage allocation quota per user #4339

djbrooke · 2019-05-22T18:53:12Z

@CCMumma - since this is concerning storage space, are you only worried about files? Or metadata as well?

djbrooke · 2019-05-22T20:42:40Z

In sprint planning I offered to link back to the original message:

https://groups.google.com/forum/#!searchin/dataverse-community/courtney%7Csort:date/dataverse-community/OYzY4P2Bvfw/Eus-FULwAgAJ

CCMumma · 2019-06-10T16:38:32Z

We are more interested in content storage, but including metadata, or having a sep. call for md, would also be valuable.

djbrooke · 2019-06-10T16:42:07Z

Thanks @CCMumma !

landreev · 2019-06-12T16:50:40Z

@CCMumma Hi, I've put together a new API call for reporting the total file storage size.
Just wanted to run it by you before I move it along.
The API will output the sum of all the file sizes uploaded by users, published and unpublished; for tabular data files, it counts both the size of the file uploaded by the user, and the size of the archival version (the tab-delimited file that we generate on ingest). These files are what we consider the real "payload", the archival content of the dataverse. (it will count the files in the dataverse specified, and in all its sub-dataverses recursively).

However, if you go to the filesystem where the datasets are stored and add the sizes of all the files found there, you'll end up with a larger bytes number. This is because we also cache some extra files generated as the datasets are being served: resized thumbnail-size copies of image files; metadata exports for published datasets, etc. The logic behind not counting these files is that they are generated on top of the archival content; they can be erased, and the system will regenerate them automatically.

Is this ok for your purposes?

CCMumma · 2019-06-12T16:57:43Z

Thank you so much - that's excellent work. The total size of the 'archival payload' would be a good start and it's wise to include published and unpublished in the number.

The 'total storage used' (including generated files and metadata) by a dataverse would also be valuable for instances, like ours in Texas, where we're trying to create a service model that sets fees based on storage used above a set maximum per institutional dataverse.

landreev · 2019-06-12T19:48:39Z

@CCMumma Thank you.
As implemented, the API will report the total size without the cached sizes by default; but will include everything, exports and all, if an optional includeCached argument is supplied.
I just wasn't sure whether to include that optional mode; if it was needed/useful at all. But it sounds like it could be potentially useful to you - so I am including and documenting it.

…sured test. (#5848)

CCMumma · 2019-06-12T20:49:26Z

That is fantastic news. Thank you for your work.

…ste any precious storage (#5848)

djbrooke added the ready for estimation label May 16, 2019

djbrooke self-assigned this May 22, 2019

djbrooke added Medium and removed ready for estimation labels May 22, 2019

djbrooke removed their assignment May 22, 2019

scolapasta assigned landreev Jun 6, 2019

landreev added a commit that referenced this issue Jun 12, 2019

implementation of the total stored data size api (#5848)

c58dbed

landreev added a commit that referenced this issue Jun 12, 2019

better formatting for the api output (#5848)

883e13e

landreev added a commit that referenced this issue Jun 12, 2019

doc entry (#5848)

6118eea

landreev added a commit that referenced this issue Jun 12, 2019

Final checkins of everything, including the guides entry and a RestAs…

e49fc02

…sured test. (#5848)

landreev mentioned this issue Jun 12, 2019

5848 api dataverse totalsize #5941

Merged

5 tasks

landreev removed their assignment Jun 12, 2019

landreev added a commit that referenced this issue Jun 13, 2019

an extra null check; because defensive coding (#5848)

6b4944e

landreev added a commit that referenced this issue Jun 13, 2019

don't bother going through harvested datasets - we know they don't wa…

ea21ec1

…ste any precious storage (#5848)

kcondon closed this as completed in #5941 Jun 13, 2019

pdurbin added this to the 4.15 milestone Jun 14, 2019

djbrooke mentioned this issue Jan 15, 2020

Dataset Size API #6524

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API call for total size of specified dataverse #5848

API call for total size of specified dataverse #5848

CCMumma commented May 15, 2019

pdurbin commented May 15, 2019

djbrooke commented May 22, 2019

djbrooke commented May 22, 2019

CCMumma commented Jun 10, 2019

djbrooke commented Jun 10, 2019

landreev commented Jun 12, 2019 •

edited

Loading

CCMumma commented Jun 12, 2019

landreev commented Jun 12, 2019

CCMumma commented Jun 12, 2019

API call for total size of specified dataverse #5848

API call for total size of specified dataverse #5848

Comments

CCMumma commented May 15, 2019

pdurbin commented May 15, 2019

djbrooke commented May 22, 2019

djbrooke commented May 22, 2019

CCMumma commented Jun 10, 2019

djbrooke commented Jun 10, 2019

landreev commented Jun 12, 2019 • edited Loading

CCMumma commented Jun 12, 2019

landreev commented Jun 12, 2019

CCMumma commented Jun 12, 2019

landreev commented Jun 12, 2019 •

edited

Loading