Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MARXAN-1616-scheduled-geodata-cleanup #1164

Merged
merged 7 commits into from
Jul 12, 2022

Conversation

rubvalave
Copy link
Contributor

-Adds Chron job (once per month) that will try to find any non-matching data for projects and scenarios and nuke it.
-It uses the same Project/Scenario Cleanup process that @angelhigueraacid introduced recently.
-It does not have tests for now (the Chron job set up to 50 seconds appears to do something, but maybe it is a good idea to test it in a non-used environment?).

@vercel
Copy link

vercel bot commented Jul 8, 2022

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated
marxan ✅ Ready (Inspect) Visit Preview Jul 11, 2022 at 1:15PM (UTC)
marxan-storybook ✅ Ready (Inspect) Visit Preview Jul 11, 2022 at 1:15PM (UTC)

Copy link
Member

@hotzevzl hotzevzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this - and even more, your patience in navigating uncharted territory and the overall madness of my plans.

I am not sure about the approach you chose - see my inline notes: I understand your point (reusing existing logic), but I also find it tempting to be done in much less code.

You've been thinking this through in way more detail than myself though, so you may have reservations about this.

);
const missingScenarioIdsFromScenariosPuData: entitiesWithScenarioId[] = await this.geoEntityManager.query(
`SELECT spd.scenario_id
FROM scenarios_pu_data spd
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not need this as db cascades should take care of deleting rows from this table when we delete the related rows from projects_pu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But all this cleanup is done just in case something gets wonky and on cascade do not delete anything properly, right? I mean, other tables I referenced here have also on cascades and I am cleaning them up anyways, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general, we should trust postgresql on cascades - if there are other entities that will be deleted via cascades and we're explicitly deleting them here, then we should avoid doing this (it may create more trouble than it tries to solve, especially if done outside of a db transaction) - I checked the entities we are deleting in your original implementation last week and IIRC this was the only one that was already covered by cascades.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl;dr basically this task is about manually triggering deletes that would be left to sql cascades if all the data were to be in a single db, but when we can rely on db cascades within geodb, then we should let postgresql work for us 😃

});
}

@Cron(CronExpression.EVERY_1ST_DAY_OF_MONTH_AT_MIDNIGHT)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's very much ok to hardcode the interval here, though I would still advise to move this to a config item, so that

  • it's easier to find settings that admins may want to tweak in the future
  • it can be overridden per-environment, in case admins so wish (prod vs staging vs testing, etc - the testing one may be useful in fact)

In order to do so we'd likely need to use the textual cron syntax, but I'd say that's ok.

And by all means, I would default to at the very most once a day, not more than that - but ideally every six hours or so.

Not because I expect this cleanup task to find so many dangling objects so frequently - these should all be an exception - but because in case of restarts of the Node process(es) due to new versions being deployed, restarts after failures, etc. especially close to the moment the scheduled cron event is going to become due in the event loop, we may risk skipping it and waiting until the next scheduled occurrence - so a monthly occurrence may be problematic.

On the other hand, almost all the tables we scan for dangling items have an index on project_id or scenario_id (as relevant), except blm_<partial|final>_results, but I don't think this index is going to be used in a negative constraint (not in...), so it may still make sense to use the set difference query you used in your implementation, but then use the resulting set to delete ... where <scenario|project>_id in (<the set of scenario_ids not belonging to any current scenario>) - but this would need some digging into query planning at scale... I'd not be too concerned at this stage 😏

ENV_VARS.md Show resolved Hide resolved
@rubvalave rubvalave force-pushed the MARXAN-1616-scheduled-geodata-cleanup branch from b9025dc to 0717105 Compare July 11, 2022 13:12
@rubvalave rubvalave merged commit 8f553c2 into develop Jul 12, 2022
@rubvalave rubvalave deleted the MARXAN-1616-scheduled-geodata-cleanup branch July 12, 2022 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants