Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable collection storage quota monitoring in HDV #240

Closed
landreev opened this issue Jan 8, 2024 · 6 comments
Closed

Enable collection storage quota monitoring in HDV #240

landreev opened this issue Jan 8, 2024 · 6 comments
Labels
GREI 5 Use cases NIH GREI General work related to any of the NIH GREI aims Size: 10 A percentage of a sprint.

Comments

@landreev
Copy link
Collaborator

landreev commented Jan 8, 2024

Once #239 is done (6.1 deployed in prod.), we will need to enable quota limits in HDV. For all the existing collections; and establish the process for setting limits on all the new collections going forward, and also set up automatic monitoring. More details/policies for this are outlined in the curation repo issue: https://github.com/IQSS/dataverse-HDV-Curation/issues/344#issuecomment-1881648186

@cmbz
Copy link
Collaborator

cmbz commented Jan 16, 2024

2024/01/16: Re. sizing of this issue...as per Slack comment, @landreev will get feedback from HDV curation team to determine eventual size.

@cmbz
Copy link
Collaborator

cmbz commented Jan 17, 2024

2024/01/17: Also waiting on resolution of: IQSS/dataverse#10220

@landreev landreev added the Size: 33 A percentage of a sprint. label Feb 12, 2024
@cmbz cmbz moved this from SPRINT- NEEDS SIZING to SPRINT READY in IQSS Dataverse Project Feb 12, 2024
@landreev landreev moved this from SPRINT READY to This Sprint 🏃‍♀️ 🏃 in IQSS Dataverse Project Mar 27, 2024
@landreev landreev moved this from This Sprint 🏃‍♀️ 🏃 to In Progress 💻 in IQSS Dataverse Project Mar 27, 2024
@landreev landreev self-assigned this Mar 27, 2024
@landreev
Copy link
Collaborator Author

(I asked in the linked sister issue IQSS/dataverse-HDV-Curation#344 about the specifics of differentiating between Harvard vs. non-Harvard affiliates; aside from that, it's mostly ready to go)

@cmbz cmbz added Size: 10 A percentage of a sprint. and removed Size: 33 A percentage of a sprint. labels Apr 10, 2024
@landreev
Copy link
Collaborator Author

Talked to Sonia directly, she replied in the linked curation issue, moving forward on this.

@landreev landreev changed the title Enable collection quotas in HDV Enable collection storage quota monitoring in HDV Jul 29, 2024
@landreev
Copy link
Collaborator Author

landreev commented Jul 29, 2024

@sbarbosadataverse @jggautier
I have set up regular storage use monitoring in prod., per conversations in IQSS/dataverse-HDV-Curation#344. I have a script that runs on schedule (as a cron job) and checks storage use as follows:

For collections: It separately lists the Harvard- and non-Harvard-affiliated top-level collections that are over the respective size limits. (I will post the report generated last night in the curation issue). As of now there are 2 Harvard collections that are over the 2.5TB limit , although one of them is OMAMA, so it doesn't count. There are 8 non-Harvard collections that are currently over the limit. I use Julian's database queries for determining which collections to consider Harvard vs. not.

Please note that for the purposes of counting total storage use, ALL uploaded files are counted - meaning, it includes the sizes of the unpublished files in draft versions and published files that are no longer in the latest version. This is in contrast to the numbers shown in Julian's "Harvard Dataverse Repository metrics" that only include the files in the latest published version. This is an important distinction. One dramatic example is the layline collection - it shows in the metrics report as only 6+GB, but actually uses close to 2TB of storage (!).

There is a separate listing for datasets that are directly in the top-level root collection. We should keep discussing how to count storage there in IQSS/dataverse-HDV-Curation#344, but this is what I'm doing as of now:
It is easy to check the sizes of individual datasets. However, that would not be enough since it's possible for a user to create multiple datasets in root. So I check the total sizes of datasets per creator and select the users with the totals that are over 1TB. The distinction is actually moot at the moment - there are only 2 users with root-level datasets that are larger than 1TB, and they happen to be the only users for whom this is true when counting all their datasets combined. Both the size totals, and all their individual datasets are listed. The report includes the users' email addresses (but more checks will likely be needed to determine if a dataset should be considered Harvard-affiliated). Neither of the current over-the-limit users appear to be from Harvard.

Once again, because of the email addresses and potentially other private information, I don't want to post a report example here, but will add it to the curation issue.

As currently configured, this script runs every night, it generates the report and sends it to the list of configured email addresses, which at the moment is just me. There's probably no need to run it this often - so maybe it should be once a week instead (?). So, let me know if you want to receive a copy of this report, and how often.

Let's continue the discussion in the curation issue.

@landreev landreev moved this from In Progress 💻 to In Review 🔎 in IQSS Dataverse Project Jul 29, 2024
@cmbz cmbz added the GREI 5 Use cases label Jul 29, 2024
@landreev
Copy link
Collaborator Author

I documented an example of the report, as generated in prod. oln Aug. 1 in IQSS/dataverse-HDV-Curation#344.
I'm going to close this issue. We will continue refining the monitoring and enforcing storage use in the other open issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GREI 5 Use cases NIH GREI General work related to any of the NIH GREI aims Size: 10 A percentage of a sprint.
Projects
Status: Done 🧹
Development

No branches or pull requests

3 participants