-
Notifications
You must be signed in to change notification settings - Fork 333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add db retention support #2486
Add db retention support #2486
Conversation
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Codecov Report
@@ Coverage Diff @@
## main #2486 +/- ##
============================================
- Coverage 83.86% 83.09% -0.77%
- Complexity 1245 1274 +29
============================================
Files 238 241 +3
Lines 5657 5910 +253
Branches 271 281 +10
============================================
+ Hits 4744 4911 +167
- Misses 769 852 +83
- Partials 144 147 +3
... and 7 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
I love the change very much. However, is there a reason why only 3 tables are included? My understanding is that those are consuming the majority of space. |
@JDarDagran, that's a very good point that I didn't (but should have) elaborate on. Due to the migration |
Signed-off-by: wslulciuc <willy@datakin.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very useful feature, @wslulciuc. It should help us clean unnecessary old data!
Would it be worth having a dry-run version of the command as well? If you think it is a good idea, I'm happy for it to be a follow-up PR (even though I'd be anxious to run a delete command without knowing at least how many rows may be deleted).
Where does the project documentation currently live? Is it part of the repo?
Looking forward to seeing some tests as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is great to finally have this feature. I agree that we should probably be backwards compatible and not have it on by default.
Or possibly only for new databases? I don't know if we can do that.
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
…VersionAsInputForRun()` Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the retention tests and scenarios they cover and verify 🚀 👍 🏅
The one thing that could be tested more is:
- loading event from
event_full
- applying retention
- verifying if other tables get cleared on cascade while not throwing an error.
Leaving @wslulciuc the decision if this is worth the effort and approving PR.
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Thanks you @julienledem, @pawel-big-lebowski, @tatiana and @JDarDagran for all of your feedback. I've added a detailed descripiton of the retention logic, support for dry runs, and extensive testing (@pawel-big-lebowski I'll follow up with a PR with your suggestions as this PR is getting way to big and the core logic has been covered in the current test suite). Let's merge! 💯 |
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
* Add db migration to add cascade deletion on `fk`s Signed-off-by: wslulciuc <willy@datakin.com> * Add `DbDataRetention` and `dataRetentionInDays` config Signed-off-by: wslulciuc <willy@datakin.com> * Add `DbRetentionJob` Signed-off-by: wslulciuc <willy@datakin.com> * Add `DbRetentionCommand` Signed-off-by: wslulciuc <willy@datakin.com> * Add `frequencyMins` config for runs and rename `dbRetentionInDays` Signed-off-by: wslulciuc <willy@datakin.com> * Add docs to `DbRetentionJob` and minor renaming Signed-off-by: wslulciuc <willy@datakin.com> * Wrap `DbRetention.retentionOnDbOrError()` in `try/catch` Signed-off-by: wslulciuc <willy@datakin.com> * Add docs to DbRetention Signed-off-by: wslulciuc <willy@datakin.com> * continued: Add docs to `DbRetention` Signed-off-by: wslulciuc <willy@datakin.com> * Add handling of `errorOnDbRetention` Signed-off-by: wslulciuc <willy@datakin.com> * Add docs to `DbException` and `DbRetentionException` Signed-off-by: wslulciuc <willy@datakin.com> * `info` -> `debug` when inserting column lineage Signed-off-by: wslulciuc <willy@datakin.com> * Remove `dbRetention.enabled` Signed-off-by: wslulciuc <willy@datakin.com> * Update handling of `StatementException` Signed-off-by: wslulciuc <willy@datakin.com> * Minor changes Signed-off-by: wslulciuc <willy@datakin.com> * Add `docs/faq.md` Signed-off-by: wslulciuc <willy@datakin.com> * continued: `Add docs/faq.md` Signed-off-by: wslulciuc <willy@datakin.com> * continued: Add `docs/faq.md` Signed-off-by: wslulciuc <willy@datakin.com> * continued: Add `docs/faq.md` Signed-off-by: wslulciuc <willy@datakin.com> * Define `DEFAULT_RETENTION_DAYS` constant in `DbRetention` Signed-off-by: wslulciuc <willy@datakin.com> * Make chunk size in retention query configurable Signed-off-by: wslulciuc <willy@datakin.com> * Remove `DATA_RETENTION_IN_DAYS` from `MarquezConfig` Signed-off-by: wslulciuc <willy@datakin.com> * Update docs for chunk size config Signed-off-by: wslulciuc <willy@datakin.com> * Remove error log from `DbRetention.retentionOnDbOrError()` Signed-off-by: wslulciuc <willy@datakin.com> * Use `LOOP` for retention Signed-off-by: wslulciuc <willy@datakin.com> * continued: Use `LOOP` for retention Signed-off-by: wslulciuc <willy@datakin.com> * Use `numberOfRowsPerBatch` Signed-off-by: wslulciuc <willy@datakin.com> * Use `--number-of-rows-per-batch` Signed-off-by: wslulciuc <willy@datakin.com> * Add pause to prevent lock timeouts Signed-off-by: wslulciuc <willy@datakin.com> * Add `FOR UPDATE SKIP LOCKED` Signed-off-by: wslulciuc <willy@datakin.com> * Add `sql()` Signed-off-by: wslulciuc <willy@datakin.com> * Add `--dry-run` Signed-off-by: wslulciuc <willy@datakin.com> * Add `jdbi3-testcontainers` Signed-off-by: wslulciuc <willy@datakin.com> * Remove shortened flag args Signed-off-by: wslulciuc <willy@datakin.com> * Use `marquez.db.DbRetention.DEFAULT_DRY_RUN` Signed-off-by: wslulciuc <willy@datakin.com> * Add DbRetention.retentionOnRuns() Signed-off-by: wslulciuc <willy@datakin.com> * Add `DbMigration.migrateDbOrError(DataSource)` Signed-off-by: wslulciuc <willy@datakin.com> * Add `TestingDb` Signed-off-by: wslulciuc <willy@datakin.com> * Add `DbTest` Signed-off-by: wslulciuc <willy@datakin.com> * Add `testRetentionOnDbOrError_withDatasetsOlderThanXDays()` Signed-off-by: wslulciuc <willy@datakin.com> * Remove `jobs.DbRetentionConfig.dryRun` Signed-off-by: wslulciuc <willy@datakin.com> * Add `--dry-run` option to `faq.md` Signed-off-by: wslulciuc <willy@datakin.com> * continued: Add --dry-run option to faq.md Signed-off-by: wslulciuc <willy@datakin.com> * continued: `Add testRetentionOnDbOrError_withDatasetsOlderThanXDays` Signed-off-by: wslulciuc <willy@datakin.com> * Fix retention query for datasets and dataset versions Signed-off-by: wslulciuc <willy@datakin.com> * Add test for retention on dataset versions Signed-off-by: wslulciuc <willy@datakin.com> * Add comments to tests Signed-off-by: wslulciuc <willy@datakin.com> * Add `testRetentionOnDbOrErrorWithDatasetVersionsOlderThanXDays_skipIfVersionAsInputForRun()` Signed-off-by: wslulciuc <willy@datakin.com> * Add `testRetentionOnDbOrErrorWithJobsOlderThanXDays()` Signed-off-by: wslulciuc <willy@datakin.com> * Add `testRetentionOnDbOrErrorWithJobVersionsOlderThanXDays()` Signed-off-by: wslulciuc <willy@datakin.com> * Add tests for dry run Signed-off-by: wslulciuc <willy@datakin.com> * Add testRetentionOnDbOrErrorWithRunsOlderThanXDays() Signed-off-by: wslulciuc <willy@datakin.com> * Add `testRetentionOnDbOrErrorWithOlEventsOlderThanXDays()` Signed-off-by: wslulciuc <willy@datakin.com> * continued: `Add testRetentionOnDbOrErrorWithOlEventsOlderThanXDays()` Signed-off-by: wslulciuc <willy@datakin.com> * Add `javadocs` to `DbRetention` Signed-off-by: wslulciuc <willy@datakin.com> * Run tests in order of retention Signed-off-by: wslulciuc <willy@datakin.com> --------- Signed-off-by: wslulciuc <willy@datakin.com> Co-authored-by: Harel Shein <harel.shein@astronomer.io>
Problem
Marquez does not have a configurable data retention policy. Therefore, users are left to periodically increase the storage size of their PostgreSQL instance as disk space fills up. Below, you'll find an example graph for the
FreeStorageSpace
metric in RDS (from a real production deployment of Marquez in the wild). For every increase in storage, we observe a spike in freeable storage (I've lost count on how many times I've personally have done this). Closes: #1271Solution
To simplify the retention policy query executed by calling
DbRetention.retentionOnDbOrError()
, we introduced migrationV63__alter_tables_add_on_cascade_delete.sql
that updates all foreign key constraints to addON DELETE CASCADE
; therefore, deleting all derived metadata (=child tables) from tablesdatasets
,jobs
,runs
, andlineage_events
when the retention policy is applied. The migration has the added advantage of defining clear relationships between metadata at the database level, as well as, self documenting the order of deletion. Below, we outlined performance considerations on the sql used for the retention policy, how it's applied and options for configuration.Retention Policy Performance Considerations
In most cases, for small to medium-size metadata (<
50GB
) stored in Marquez, the retention policy could be scheduled to be applied in a maintenance window, in a single transaction. But, as we evaluated the metadata stored in Marquez at Astronomer (>500GB
), we quickly realized that we needed to rethink this approach. Below, we outline the general sql structure applied during retention (per metadata object), and the configurable parameters enabling users to best tune the policy specific to their needs.Applying Retention Policy
When invoking
DbRetention.retentionOnDbOrError()
, retention is applied in the following order (in an order of precedence from left-to-right):jobs
table ifjobs.updated_at
older thanretentionDays
.job_versions
table ifjob_versions.updated_at
older thanretentionDays
; a job version will not be deleted if the job version is the current version of a given job.runs
table ifruns.updated_at
older thanretentionDays
; a run will not be deleted if the run is the current run of a given job version.datasets
table ifdatasets.updated_at
older thanretentionDays
; a dataset will not be deleted if the dataset is an input / output of a given job version.dataset_versions
table ifdataset_versions.created_at
older thanretentionDays
; a dataset version will not be deleted if the dataset version is the current version of a given dataset version, or the input of a run.lineage_events
table iflineage_events.event_time
older thanretentionDays
.Configuring Retention Policy
We propose Marquez support the following ways to config retention policies on metadata:
OPTION
1
: Config-based viamarquez.yml
In
marquez.yml
, define thedbRetention
configuration to enable the jobDbRetentionJob
to periodically apply a retention policy to the database on an interval. The default retention policy is7
days, but this can be adjusted by overridingretentionDays
. You can also override the run interval forDbRetentionJob
by usingfrequencyMins
.To enabled a retention policy, define the
dbRetention
config in yourmarquez.yml
:To disable retention policy, omit the
dbRetention
config from yourmarquez.yml
:OPTION
2
: CLITo run an ad-hoc retention policy on your metadata, use the
db-retention
cmd:Improvements
Checklist
CHANGELOG.md
(Depending on the change, this may not be necessary)..sql
database schema migration according to Flyway's naming convention (if relevant)