Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add db retention support #2486

Merged
merged 60 commits into from
Jul 18, 2023
Merged

Add db retention support #2486

merged 60 commits into from
Jul 18, 2023

Conversation

wslulciuc
Copy link
Member

@wslulciuc wslulciuc commented May 11, 2023

Problem

Marquez does not have a configurable data retention policy. Therefore, users are left to periodically increase the storage size of their PostgreSQL instance as disk space fills up. Below, you'll find an example graph for the FreeStorageSpace metric in RDS (from a real production deployment of Marquez in the wild). For every increase in storage, we observe a spike in freeable storage (I've lost count on how many times I've personally have done this). Closes: #1271

Screen Shot 2023-05-15 at 9 37 01 AM

Solution

To simplify the retention policy query executed by calling DbRetention.retentionOnDbOrError(), we introduced migration V63__alter_tables_add_on_cascade_delete.sql that updates all foreign key constraints to add ON DELETE CASCADE; therefore, deleting all derived metadata (=child tables) from tables datasets, jobs, runs, and lineage_events when the retention policy is applied. The migration has the added advantage of defining clear relationships between metadata at the database level, as well as, self documenting the order of deletion. Below, we outlined performance considerations on the sql used for the retention policy, how it's applied and options for configuration.

Retention Policy Performance Considerations

In most cases, for small to medium-size metadata (< 50GB) stored in Marquez, the retention policy could be scheduled to be applied in a maintenance window, in a single transaction. But, as we evaluated the metadata stored in Marquez at Astronomer (> 500GB), we quickly realized that we needed to rethink this approach. Below, we outline the general sql structure applied during retention (per metadata object), and the configurable parameters enabling users to best tune the policy specific to their needs.

// We define a deletion function per metadata object that is called, per metadata object,
// in DbRetention.retentionOnDbOrError(); the deletion function has a return type of INT
// representing the number of rows deleted.
CREATE OR REPLACE FUNCTION delete_<METADATA-OBJECT>_older_than_x_days()
  RETURNS INT AS $$
DECLARE
  // To improve deletion performance (i.e. avoid deleting all rows at once), we delete rows
  // in batches; this divides the deletion process into smaller chunks; we made the
  // number of rows to delete per batch configurable.
  rows_per_batch INT := [NUMBER_OF_ROWS_PER_BATCH]
  rows_deleted INT;
  rows_deleted_total INT := 0;
BEGIN
  // [CREATE-TEMP-TABLES-If-NEEDED]
  //
  // We divide row deletion into smaller chunks by using a loop and limiting batch size via
  // the configurable 'rows_per_batch' parameter per iteration; this approach limits resource
  // impact and maintains overall database performance; the loop continues until all rows are deleted.
  LOOP
    WITH deleted_rows AS (
      DELETE FROM <METADATA-OBJECT>
        WHERE uuid IN (
          SELECT uuid
            FROM <METADATA-OBJECT>
           WHERE updated_at < CURRENT_TIMESTAMP - INTERVAL '<RETENTION-DAYS> days'
             FOR UPDATE SKIP LOCKED // Acquire row-level lock and skip any locked rows.
             // Use [TEMP-TABLES-If-NEEDED]
           LIMIT rows_per_batch
        ) RETURNING uuid
    )
    SELECT COUNT(*) INTO rows_deleted FROM deleted_rows;
    rows_deleted_total := rows_deleted_total + rows_deleted;
    // Exit when no more rows are deleted in a given iteration.
    EXIT WHEN rows_deleted = 0;
    // Pause to allow other transactions to run.
    PERFORM pg_sleep(0.1);
  END LOOP;
  RETURN rows_deleted_total;
END;
$$ LANGUAGE plpgsql;

Applying Retention Policy

When invoking DbRetention.retentionOnDbOrError(), retention is applied in the following order (in an order of precedence from left-to-right):

jobs > job versions > runs > datasets > dataset versions > lineage events
  1. Delete jobs from jobs table if jobs.updated_at older than retentionDays.
  2. Delete job versions from job_versions table if job_versions.updated_at older than retentionDays; a job version will not be deleted if the job version is the current version of a given job.
  3. Delete runs from runs table if runs.updated_at older than retentionDays; a run will not be deleted if the run is the current run of a given job version.
  4. Delete dataset from datasets table if datasets.updated_at older than retentionDays; a dataset will not be deleted if the dataset is an input / output of a given job version.
  5. Delete dataset versions from dataset_versions table if dataset_versions.created_at older than retentionDays; a dataset version will not be deleted if the dataset version is the current version of a given dataset version, or the input of a run.
  6. Delete lineage events from lineage_events table if lineage_events.event_time older than retentionDays.

Configuring Retention Policy

We propose Marquez support the following ways to config retention policies on metadata:

OPTION 1: Config-based via marquez.yml

In marquez.yml, define the dbRetention configuration to enable the job DbRetentionJob to periodically apply a retention policy to the database on an interval. The default retention policy is 7 days, but this can be adjusted by overriding retentionDays. You can also override the run interval for DbRetentionJob by using frequencyMins.

To enabled a retention policy, define the dbRetention config in your marquez.yml:

# Adjusts retention policy
dbRetention:
  # Apply retention policy at a frequency of every 'X' minutes (default: 15)
  frequencyMins: ${DB_RETENTION_FREQUENCY_MINS:-15}
  # Maximum number of rows deleted per batch (default: 1000)
  numberOfRowsPerBatch:  ${DB_RETENTION_NUMBER_OF_ROWS_PER_BATCH:-1000}
  # Maximum retention days (default: 7)
  retentionDays: ${DB_RETENTION_DAYS:-7}

To disable retention policy, omit the dbRetention config from your marquez.yml:

# Disables retention policy
# dbRetention: ...

OPTION 2: CLI

To run an ad-hoc retention policy on your metadata, use the db-retention cmd:

usage: java -jar marquez-api.jar
       db-retention [--number-of-rows-per-batch NUMBEROFROWSPERBATCH] [--retention-days RETENTIONDAYS] [--dry-run] [-h] [file]

apply one-off ad-hoc retention policy directly to database

positional arguments:
  file                   application configuration file

named arguments:
  --number-of-rows-per-batch NUMBEROFROWSPERBATCH
                         the number of rows deleted per batch (default: 1000)
  --retention-days RETENTIONDAYS
                         the number of days to retain metadata (default: 7)
  --dry-run              only output an estimate of metadata deleted by the retention policy, without applying the policy on database (default: false)
  -h, --help             show this help message and exit
$ java -jar api/build/libs/marquez-api.jar db-retention --retention-days 14 marquez.yml
INFO marquez.db.DbRetention - Applying retention policy of '30' days to jobs...
INFO marquez.db.DbRetention - Deleted '0' jobs in '22' ms!
INFO marquez.db.DbRetention - Applying retention policy of '30' days to job versions...
INFO marquez.db.DbRetention - Deleted '0' job versions in '30' ms!
INFO marquez.db.DbRetention - Applying retention policy of '30' days to runs...
INFO marquez.db.DbRetention - Deleted '4' runs in '124' ms!
INFO marquez.db.DbRetention - Applying retention policy of '30' days to datasets...
INFO marquez.db.DbRetention - Deleted '0' datasets in '22' ms!
INFO marquez.db.DbRetention - Applying retention policy of '30' days to dataset versions...
INFO marquez.db.DbRetention - Deleted '0' dataset versions in '28' ms!
INFO marquez.db.DbRetention - Applying retention policy of '30' days to lineage events...
INFO marquez.db.DbRetention - Deleted '0' lineage events in '15' ms!

Improvements

  • Estimate # number of rows / bytes deleted before running the retention policy as a percentage.

Checklist

  • You've signed-off your work
  • Your changes are accompanied by tests (if relevant)
  • Your change contains a small diff and is self-contained
  • You've updated any relevant documentation (if relevant)
  • You've included a one-line summary of your change for the CHANGELOG.md (Depending on the change, this may not be necessary).
  • You've versioned your .sql database schema migration according to Flyway's naming convention (if relevant)
  • You've included a header in any source code files (if relevant)

wslulciuc added 4 commits May 11, 2023 16:07
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
@boring-cyborg boring-cyborg bot added the api API layer changes label May 11, 2023
@codecov
Copy link

codecov bot commented May 11, 2023

Codecov Report

Merging #2486 (9dd1db3) into main (3d023e4) will decrease coverage by 0.77%.
The diff coverage is 68.95%.

❗ Current head 9dd1db3 differs from pull request most recent head 33d390a. Consider uploading reports for the commit 33d390a to get more accurate results

@@             Coverage Diff              @@
##               main    #2486      +/-   ##
============================================
- Coverage     83.86%   83.09%   -0.77%     
- Complexity     1245     1274      +29     
============================================
  Files           238      241       +3     
  Lines          5657     5910     +253     
  Branches        271      281      +10     
============================================
+ Hits           4744     4911     +167     
- Misses          769      852      +83     
- Partials        144      147       +3     
Impacted Files Coverage Δ
api/src/main/java/marquez/MarquezConfig.java 50.00% <0.00%> (-50.00%) ⬇️
api/src/main/java/marquez/db/Columns.java 79.74% <ø> (ø)
api/src/main/java/marquez/db/RunDao.java 92.40% <ø> (ø)
...c/main/java/marquez/db/exceptions/DbException.java 0.00% <0.00%> (ø)
...va/marquez/db/exceptions/DbRetentionException.java 0.00% <0.00%> (ø)
...pi/src/main/java/marquez/db/models/DatasetRow.java 75.00% <0.00%> (+41.66%) ⬆️
...main/java/marquez/db/models/DatasetVersionRow.java 100.00% <ø> (ø)
...rc/main/java/marquez/db/models/ExtendedRunRow.java 91.66% <ø> (ø)
api/src/main/java/marquez/db/models/JobRow.java 100.00% <ø> (ø)
api/src/main/java/marquez/db/models/RunRow.java 100.00% <ø> (ø)
... and 12 more

... and 7 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

wslulciuc added 8 commits May 11, 2023 16:33
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
@wslulciuc wslulciuc marked this pull request as ready for review May 15, 2023 16:52
@wslulciuc wslulciuc added the review Ready for review label May 15, 2023
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
@JDarDagran
Copy link
Contributor

I love the change very much. However, is there a reason why only 3 tables are included? My understanding is that those are consuming the majority of space.

@wslulciuc
Copy link
Member Author

@JDarDagran, that's a very good point that I didn't (but should have) elaborate on. Due to the migration V62__alter_tables_add_on_cascade_delete (introduced as part of this PR), we only need the tables datasets, jobs, and lineage_events. Let me write up a section outlining the usage of this migration file and its importance.

Signed-off-by: wslulciuc <willy@datakin.com>
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very useful feature, @wslulciuc. It should help us clean unnecessary old data!

Would it be worth having a dry-run version of the command as well? If you think it is a good idea, I'm happy for it to be a follow-up PR (even though I'd be anxious to run a delete command without knowing at least how many rows may be deleted).

Where does the project documentation currently live? Is it part of the repo?

Looking forward to seeing some tests as well!

Copy link
Member

@julienledem julienledem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is great to finally have this feature. I agree that we should probably be backwards compatible and not have it on by default.
Or possibly only for new databases? I don't know if we can do that.

Signed-off-by: wslulciuc <willy@datakin.com>
@boring-cyborg boring-cyborg bot added the docs label May 16, 2023
wslulciuc added 15 commits July 5, 2023 13:31
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
…VersionAsInputForRun()`

Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
Copy link
Collaborator

@pawel-big-lebowski pawel-big-lebowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the retention tests and scenarios they cover and verify 🚀 👍 🏅

The one thing that could be tested more is:

  • loading event from event_full
  • applying retention
  • verifying if other tables get cleared on cascade while not throwing an error.

Leaving @wslulciuc the decision if this is worth the effort and approving PR.

@wslulciuc wslulciuc requested review from tatiana and julienledem July 18, 2023 14:48
Signed-off-by: wslulciuc <willy@datakin.com>
Signed-off-by: wslulciuc <willy@datakin.com>
@wslulciuc
Copy link
Member Author

wslulciuc commented Jul 18, 2023

Thanks you @julienledem, @pawel-big-lebowski, @tatiana and @JDarDagran for all of your feedback. I've added a detailed descripiton of the retention logic, support for dry runs, and extensive testing (@pawel-big-lebowski I'll follow up with a PR with your suggestions as this PR is getting way to big and the core logic has been covered in the current test suite). Let's merge! 💯

@wslulciuc wslulciuc enabled auto-merge (squash) July 18, 2023 19:23
Signed-off-by: wslulciuc <willy@datakin.com>
@wslulciuc wslulciuc merged commit a515c91 into main Jul 18, 2023
@wslulciuc wslulciuc deleted the feature/db-retention branch July 18, 2023 19:36
@wslulciuc wslulciuc removed the review Ready for review label Jul 19, 2023
jonathanpmoraes pushed a commit to nubank/NuMarquez that referenced this pull request Feb 6, 2025
* Add db migration to add cascade deletion on `fk`s

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `DbDataRetention` and `dataRetentionInDays` config

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `DbRetentionJob`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `DbRetentionCommand`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `frequencyMins` config for runs and rename `dbRetentionInDays`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add docs to `DbRetentionJob` and minor renaming

Signed-off-by: wslulciuc <willy@datakin.com>

* Wrap `DbRetention.retentionOnDbOrError()` in `try/catch`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add docs to DbRetention

Signed-off-by: wslulciuc <willy@datakin.com>

* continued: Add docs to `DbRetention`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add handling of `errorOnDbRetention`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add docs to `DbException` and `DbRetentionException`

Signed-off-by: wslulciuc <willy@datakin.com>

* `info` -> `debug` when inserting column lineage

Signed-off-by: wslulciuc <willy@datakin.com>

* Remove `dbRetention.enabled`

Signed-off-by: wslulciuc <willy@datakin.com>

* Update handling of `StatementException`

Signed-off-by: wslulciuc <willy@datakin.com>

* Minor changes

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `docs/faq.md`

Signed-off-by: wslulciuc <willy@datakin.com>

* continued: `Add docs/faq.md`

Signed-off-by: wslulciuc <willy@datakin.com>

* continued: Add `docs/faq.md`

Signed-off-by: wslulciuc <willy@datakin.com>

* continued: Add `docs/faq.md`

Signed-off-by: wslulciuc <willy@datakin.com>

* Define `DEFAULT_RETENTION_DAYS` constant in `DbRetention`

Signed-off-by: wslulciuc <willy@datakin.com>

* Make chunk size in retention query configurable

Signed-off-by: wslulciuc <willy@datakin.com>

* Remove `DATA_RETENTION_IN_DAYS` from `MarquezConfig`

Signed-off-by: wslulciuc <willy@datakin.com>

* Update docs for chunk size config

Signed-off-by: wslulciuc <willy@datakin.com>

* Remove error log from `DbRetention.retentionOnDbOrError()`

Signed-off-by: wslulciuc <willy@datakin.com>

* Use `LOOP` for retention

Signed-off-by: wslulciuc <willy@datakin.com>

* continued: Use `LOOP` for retention

Signed-off-by: wslulciuc <willy@datakin.com>

* Use `numberOfRowsPerBatch`

Signed-off-by: wslulciuc <willy@datakin.com>

* Use `--number-of-rows-per-batch`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add pause to prevent lock timeouts

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `FOR UPDATE SKIP LOCKED`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `sql()`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `--dry-run`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `jdbi3-testcontainers`

Signed-off-by: wslulciuc <willy@datakin.com>

* Remove shortened flag args

Signed-off-by: wslulciuc <willy@datakin.com>

* Use `marquez.db.DbRetention.DEFAULT_DRY_RUN`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add DbRetention.retentionOnRuns()

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `DbMigration.migrateDbOrError(DataSource)`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `TestingDb`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `DbTest`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `testRetentionOnDbOrError_withDatasetsOlderThanXDays()`

Signed-off-by: wslulciuc <willy@datakin.com>

* Remove `jobs.DbRetentionConfig.dryRun`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `--dry-run` option to `faq.md`

Signed-off-by: wslulciuc <willy@datakin.com>

* continued: Add --dry-run option to faq.md

Signed-off-by: wslulciuc <willy@datakin.com>

* continued: `Add testRetentionOnDbOrError_withDatasetsOlderThanXDays`

Signed-off-by: wslulciuc <willy@datakin.com>

* Fix retention query for datasets and dataset versions

Signed-off-by: wslulciuc <willy@datakin.com>

* Add test for retention on dataset versions

Signed-off-by: wslulciuc <willy@datakin.com>

* Add comments to tests

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `testRetentionOnDbOrErrorWithDatasetVersionsOlderThanXDays_skipIfVersionAsInputForRun()`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `testRetentionOnDbOrErrorWithJobsOlderThanXDays()`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `testRetentionOnDbOrErrorWithJobVersionsOlderThanXDays()`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add tests for dry run

Signed-off-by: wslulciuc <willy@datakin.com>

* Add testRetentionOnDbOrErrorWithRunsOlderThanXDays()

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `testRetentionOnDbOrErrorWithOlEventsOlderThanXDays()`

Signed-off-by: wslulciuc <willy@datakin.com>

* continued: `Add testRetentionOnDbOrErrorWithOlEventsOlderThanXDays()`

Signed-off-by: wslulciuc <willy@datakin.com>

* Add `javadocs` to `DbRetention`

Signed-off-by: wslulciuc <willy@datakin.com>

* Run tests in order of retention

Signed-off-by: wslulciuc <willy@datakin.com>

---------

Signed-off-by: wslulciuc <willy@datakin.com>
Co-authored-by: Harel Shein <harel.shein@astronomer.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api API layer changes docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: Data Retention Functionality
5 participants