Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API call for total size of specified dataverse #5848

Closed
CCMumma opened this issue May 15, 2019 · 9 comments · Fixed by #5941
Closed

API call for total size of specified dataverse #5848

CCMumma opened this issue May 15, 2019 · 9 comments · Fixed by #5941
Milestone

Comments

@CCMumma
Copy link

CCMumma commented May 15, 2019

We need an API call to exist that currently does not: total size of a specified dataverse

The use case for us is that the Texas Data Repository hosts multiple Institutional Dataverses and I need a simple way to determine the size of all of the content, published and unpublished, in their entire dataverse.

@pdurbin
Copy link
Member

pdurbin commented May 15, 2019

@CCMumma thanks for creating this issue. The following issues are related:

@djbrooke
Copy link
Contributor

@CCMumma - since this is concerning storage space, are you only worried about files? Or metadata as well?

@djbrooke djbrooke removed their assignment May 22, 2019
@djbrooke
Copy link
Contributor

@CCMumma
Copy link
Author

CCMumma commented Jun 10, 2019

We are more interested in content storage, but including metadata, or having a sep. call for md, would also be valuable.

@djbrooke
Copy link
Contributor

Thanks @CCMumma !

@landreev
Copy link
Contributor

landreev commented Jun 12, 2019

@CCMumma Hi, I've put together a new API call for reporting the total file storage size.
Just wanted to run it by you before I move it along.
The API will output the sum of all the file sizes uploaded by users, published and unpublished; for tabular data files, it counts both the size of the file uploaded by the user, and the size of the archival version (the tab-delimited file that we generate on ingest). These files are what we consider the real "payload", the archival content of the dataverse. (it will count the files in the dataverse specified, and in all its sub-dataverses recursively).

However, if you go to the filesystem where the datasets are stored and add the sizes of all the files found there, you'll end up with a larger bytes number. This is because we also cache some extra files generated as the datasets are being served: resized thumbnail-size copies of image files; metadata exports for published datasets, etc. The logic behind not counting these files is that they are generated on top of the archival content; they can be erased, and the system will regenerate them automatically.

Is this ok for your purposes?

@CCMumma
Copy link
Author

CCMumma commented Jun 12, 2019

Thank you so much - that's excellent work. The total size of the 'archival payload' would be a good start and it's wise to include published and unpublished in the number.

The 'total storage used' (including generated files and metadata) by a dataverse would also be valuable for instances, like ours in Texas, where we're trying to create a service model that sets fees based on storage used above a set maximum per institutional dataverse.

landreev added a commit that referenced this issue Jun 12, 2019
@landreev
Copy link
Contributor

@CCMumma Thank you.
As implemented, the API will report the total size without the cached sizes by default; but will include everything, exports and all, if an optional includeCached argument is supplied.
I just wasn't sure whether to include that optional mode; if it was needed/useful at all. But it sounds like it could be potentially useful to you - so I am including and documenting it.

@landreev landreev removed their assignment Jun 12, 2019
@CCMumma
Copy link
Author

CCMumma commented Jun 12, 2019

That is fantastic news. Thank you for your work.

landreev added a commit that referenced this issue Jun 13, 2019
@pdurbin pdurbin added this to the 4.15 milestone Jun 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants