Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support TTL for BigQuery tables #2711

Merged
merged 18 commits into from
Aug 19, 2020

Conversation

kconvey
Copy link
Contributor

@kconvey kconvey commented Aug 17, 2020

resolves #2697

Description

Add time_to_expiration as a BigQuery specific adapter configuration option. This will only be applied if the adapter is not already creating a temporary table (which uses the same method to set a ttl of 12 hours). This option should be an integer, and is in hours.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label Aug 17, 2020
Comment on lines 32 to 37
@use_profile('bigquery')
def test_bigquery_location_invalid(self):
_, stdout = self.run_dbt_and_capture()
self.assertIn(
'expiration_timestamp: TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL '
'4 hour)', stdout)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is working the way I'm hoping it is (asserts the ddl contains the option and succeeds), I'll probably add a couple more of these tests for other adapter specific configs like kms_key_name, etc.

@kconvey kconvey changed the title Kconvey model ttl Support TTL for BigQuery tables Aug 17, 2020
@@ -745,6 +746,11 @@ def get_table_options(
expiration = 'TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 12 hour)'
opts['expiration_timestamp'] = expiration

if (config.get('time_to_expiration') is not None) and (not temporary):
expiration = ('TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL '
'{} hour').format(config.get('time_to_expiration'))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name 'time_to_expiration' doesn't provide any hints about the unit. Maybe this could be 'hours_to_expiration'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 How do you feel about this, following #2697?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm totally in favor of hours_to_expiration

Comment on lines 34 to 37
_, stdout = self.run_dbt_and_capture()
self.assertIn(
'expiration_timestamp: TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL '
'4 hour)', stdout)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example I copied here was one that expected failure on the model, so the query would be dumped in stdout.

I probably want to inspect results from self.run_dbt(), but could use a pointer to the compiled SQL within the results to do this assertIn (it's a little hard to decipher the schema sometimes). Let me know if that makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want results[index].node.injected_sql. You can look for results by node name using results[index].node.name.

Also, don't feel at all obligated to do this, but because we use pytest for tests now you are free to use the (much more ergonomic, at least to me) assert whatever in stdout syntax.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@beckjake The error I got makes me think injected_sql isn't what I'm looking for in results.

E AssertionError: 'expiration_timestamp: TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 4 hour)' not found in 'select 1 as id'

This config adds the expiration_timestamp as part of the ddl, and if this was ddl, it should say something like create or replace table as .... I can't remember if this is present in debug (which I believe just dumps the query), or where else the full ddl might be in the results. Any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now that I look at this more carefully, I think this will be in the output if you run with --debug, but not the injected_sql. injected_sql contains the value that will end up as the sql value in the materialization. But this change happens ultimately in the create_table_as macro that's called from the materialization.

It is, I suppose, always possible that we don't log all our queries on bigquery? That would be pretty bad behavior on our part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this on a real project locally and it doesn't look like the ddl is anywhere in run_results.json, but it definitely is in the output with --debug.

@kconvey
Copy link
Contributor Author

kconvey commented Aug 19, 2020

This is passing all tests at this point @beckjake. Thanks for the help & patience with this!

@jtcohen6 bumping the suggestion to call this hours_to_expiration, or something more self describing with respect to the unit. Let me know if you think I should change before this gets merged.

Still think it would be nice if the BigQuery ddl existed somewhere other than debug, but I don't believe it does. Something to think about.

@jtcohen6
Copy link
Contributor

@kconvey I'm in favor of changing this to hours_to_expiration. Thanks for the recommendation @heisencoder!

Copy link
Contributor

@beckjake beckjake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@beckjake beckjake merged commit f043f94 into dbt-labs:dev/marian-anderson Aug 19, 2020
@kconvey kconvey deleted the kconvey-model-ttl branch August 19, 2020 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support TTL for BigQuery tables
4 participants